Make WordPress Core

Opened 4 months ago

Closed 2 months ago

#63837 closed enhancement (fixed)

Update wp_check_invalid_utf8()

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version: 6.9
Component: Charset Keywords: has-patch has-unit-tests
Focuses: Cc:

Description (last modified by dmsnell)

There are a few challenges with wp_check_invalid_utf8()

  • Its behavior is dependent on Unicode support in the PCRE functions.
  • PCRE Unicode support has changed across versions, with older versions allowing invalid UTF-8.
  • It returns false if $strip = true is requested.
  • When a system lacks support there’s zero fallback.
  • It assumes that input strings are encoded with blog_charset.

The last point is inherent to how the function works, but the other points can be updated by relying on the newer wp_is_valid_utf8() and by providing a custom fallback method to strip out invalid byte sequences.

Improving clarity around this function involves removing the blurred line between determining if content is allegedly UTF-8 and handling it as if it is. A new function, wp_scrub_utf8(), can be used to produce a valid UTF-8 string formed by replacing invalid sequences of bytes with the Unicode replacement character U+FFFD (). This behaves as wp_check_invalid_bytes( $s, true ) should work if the blog_charset is UTF-8.

Ideally, data should be transformed from some encoding into UTF-8 as soon as it enters WordPress and then code which previously called wp_check_invalid_utf8() can now call wp_scrub_utf8() if it truly needs to remove invalid bytes.

Change History (11)

This ticket was mentioned in PR #9498 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#1

  • Keywords has-patch has-unit-tests added

Trac Ticket: Core-63837

wp_check_invalid_utf8() has been dependant on the runtime configuration of the system running it. For systems without the necessary Unicode support it would simply not perform any validation of the input. Worse, the use of icon() is not only system-dependant, but it also fails to perform substitution when requested, returning false instead of the substituted string as described in the function docblock.

In this patch a newer spec-compliant fallback is provided while the function itself defers to mb_scrub(). Both of these implementations are spec-compliant (to UTF-8) and perform “Maximal subpart replacement” when encountering invalid byte sequences — an important aspect to maintain secure and quality interop between systems.

#2 @dmsnell
4 months ago

  • Description modified (diff)

#3 @siliconforks
4 months ago

I don't know if this is something worth worrying about, but the new tests are causing about a hundred test failures on Ubuntu 22.04 LTS.

It appears that in the version of PHP being used (8.1.2, which is very old), mb_scrub() does not perform maximal subpart replacement. (It still performs replacement, just not "maximal" replacement.) This appears to be fixed in later 8.1.x versions.

A simple program to demonstrate the issue:

<?php

echo PHP_VERSION . "\n";

// https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G67519
$original = "\xC0\xAF\xE0\x80\xBF\xF0\x81\x82\x41";

$scrubbed = mb_scrub( $original, 'UTF-8' );
var_dump( $scrubbed );

Running this on PHP 8.1.2 and PHP 8.1.33, this is what I get:

$ ~/php-8.1.2/bin/php mb_scrub.php
8.1.2
string(7) "??????A"
$ ~/php-8.1.33/bin/php mb_scrub.php
8.1.33
string(9) "????????A"

Again, I'm not sure if this is actually something to be concerned about - the reason I bring it up is that 8.1.2 is Ubuntu's LTS version so it is still being used.

#4 follow-up: @dmsnell
4 months ago

This is great testing @siliconforks — thank you! I’ve pushed one update which falls to the failsafe if running on a PHP version earlier than 8.1.6 where the change was introduced.

My first thought was to change the tests to only validate that substitution on the newer versions, but considered the value of having universal behavior to be greater.

Ultimately this substitution isn’t super important, but it’s nice to have where possible.


Where did you see the tests fail? I’m only seeing passing tests but am glad someone is looking out for version differences.

#5 in reply to: ↑ 4 @siliconforks
4 months ago

Replying to dmsnell:

Where did you see the tests fail? I’m only seeing passing tests but am glad someone is looking out for version differences.

I noticed it while running the tests on my own Ubuntu 22.04 machine, using the version of PHP provided by the OS. (I don't think any of the CI tests use this configuration.)

#6 @jonsurrell
4 months ago

It returns false if $strip = true is requested.

The strip behavior is very interesting and the proposed change in behavior is important. It's tempting to rename the parameter to something like $replace or $substitute, but that's a potential breaking change with PHP8 named parameters.


The unicode standard is clear on the practice of U+FFFD substitution and notes that ignoring "bad" bytes represents a security risk:

If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.

The mentioned technical report gives some examples, but also clearly states:

If characters are to be substituted for ill-formed subsequences, it is important that those characters be relatively safe.

  • Deletion (substituting the empty string) can be quite nasty, because it joins characters that would have been separate…
  • Substituting characters that are valid syntax for constructs such as file names has similar problems. For example, the '.' can be very problematic.
    • U+FFFD is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it does not have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing.
    • Where U+FFFD is not available, a common alternative is "?". While this character may occur syntactically, it appears to be less subject to attack than most others.

I was reviewing the proposed changes to understand the difference and discovered this myself. The existing strip implementation relies on iconv in this way:

<?php
var_dump(
    iconv( 'utf-8', 'utf-8', ".\xC0." ),              // C0 is never valid.
    iconv( 'utf-8', 'utf-8', ".\xE2\x8C." ),          // Missing A3 at end.
    iconv( 'utf-8', 'utf-8', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts replaced separately.
    iconv( 'utf-8', 'utf-8', ".\xC1\xBF." ),         // Overlong sequence.
    iconv( 'utf-8', 'utf-8', ".\xED\xA0\x80." )    // Surrogate half.
);

Each of those generates a notice and returns false because the conversion fails. I reviewed iconv documentation to confirm this behavior and noticed the following:

If the string IGNORE is appended, characters that cannot be represented in the target charset are silently discarded.

That's likely what was intended in the original implementation, but is certainly not the documented behavior of this function. Each of these returns the string ".." with no warning:

<?php
var_dump(
    iconv( 'utf-8', 'utf-8//IGNORE', ".\xC0." ),              // C0 is never valid.
    iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C." ),          // Missing A3 at end.
    iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts replaced separately.
    iconv( 'utf-8', 'utf-8//IGNORE', ".\xC1\xBF." ),         // Overlong sequence.
    iconv( 'utf-8', 'utf-8//IGNORE', ".\xED\xA0\x80." )    // Surrogate half.
);

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


4 months ago

#8 @dmsnell
4 months ago

It’s rare for me to go on a hike and not come back with interesting ideas about how to proceed on these long problems. To that end, I am sorry to report that I made this issue much more expansive, bringing back ideas from WordPress/wordpress-develop#6883. I probably need to create a new Ticket for this, but all of the work revolves around one remaining issue that was bothering me about the proposed PR: can we avoid maintaining two or more UTF-8 decoders?

I would like to split off the work I have done to stage a broader update for 6.9.0.


@siliconforks the code in the proposed PR and all updated versions of it account for that change. thanks again!


The strip behavior is very interesting and the proposed change in behavior is important.

I’m not sure on this @jonsurrell but torn myself. It’s not a proposed change in behavior, just a bug-fix for a defect that’s been in the code for 16 years. I did look for tickets reporting this but couldn’t find any, and I suspect that the bug has been noticed, but hard to report because it’s not obvious where it appears. The failing behavior is to return an empty string, and this is the same behavior when a string is deemed “unsafe” by other means.

What this means is that this specific bug is masked by other intended rejections where the function is called. Fixing this bug might actually solve a number of unreported issues. The function docs clearly indicate that invalid bytes are to be “stripped” but doesn’t indicate how. Regardless, there’s a clear statement that only the invalid bytes are to be stripped, not the entire string.

Given that we should avoid removing invalid bytes, I propose substituting.

That's likely what was intended in the original implementation, but is certainly not the documented behavior of this function

There’s also //TRANSLIT but these all seem a bit blurry to understand from the PHP perspective. That is, WordPress doesn’t have control over them. We can guess that it was supposed to be //IGNORE, but I read the documented behavior to be exactly that. “Strip” implies removing characters, which is what //IGNORE does.

That being said, I feel like “strip” is broad enough to incorporate the substitution. It’s not a stretch to me to say that putting in the � in place of the invalid bytes is tantamount to stripping them out. Something is still there, but they aren’t.

#9 @dmsnell
4 months ago

  • Description modified (diff)

#10 @dmsnell
4 months ago

  • Component changed from Formatting to Charset

#11 @dmsnell
2 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 60793:

Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipeline.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

wp_check_invalid_utf8() has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the mbstring extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, wp_scrub_utf8() is created with a proper fallback so that the remaining logic inside of wp_check_invalid_utf8() can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in https://github.com/WordPress/wordpress-develop/pull/9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

Note: See TracTickets for help on using tickets.