Opened 4 months ago
Closed 2 months ago
#63837 closed enhancement (fixed)
Update wp_check_invalid_utf8()
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 6.9 | Priority: | normal |
| Severity: | normal | Version: | 6.9 |
| Component: | Charset | Keywords: | has-patch has-unit-tests |
| Focuses: | Cc: |
Description (last modified by )
There are a few challenges with wp_check_invalid_utf8()
- Its behavior is dependent on Unicode support in the PCRE functions.
- PCRE Unicode support has changed across versions, with older versions allowing invalid UTF-8.
- It returns
falseif$strip = trueis requested. - When a system lacks support there’s zero fallback.
- It assumes that input strings are encoded with
blog_charset.
The last point is inherent to how the function works, but the other points can be updated by relying on the newer wp_is_valid_utf8() and by providing a custom fallback method to strip out invalid byte sequences.
Improving clarity around this function involves removing the blurred line between determining if content is allegedly UTF-8 and handling it as if it is. A new function, wp_scrub_utf8(), can be used to produce a valid UTF-8 string formed by replacing invalid sequences of bytes with the Unicode replacement character U+FFFD (�). This behaves as wp_check_invalid_bytes( $s, true ) should work if the blog_charset is UTF-8.
Ideally, data should be transformed from some encoding into UTF-8 as soon as it enters WordPress and then code which previously called wp_check_invalid_utf8() can now call wp_scrub_utf8() if it truly needs to remove invalid bytes.
Related
- Resolves #43224
Change History (11)
This ticket was mentioned in PR #9498 on WordPress/wordpress-develop by @dmsnell.
4 months ago
#1
- Keywords has-patch has-unit-tests added
#3
@
4 months ago
I don't know if this is something worth worrying about, but the new tests are causing about a hundred test failures on Ubuntu 22.04 LTS.
It appears that in the version of PHP being used (8.1.2, which is very old), mb_scrub() does not perform maximal subpart replacement. (It still performs replacement, just not "maximal" replacement.) This appears to be fixed in later 8.1.x versions.
A simple program to demonstrate the issue:
<?php echo PHP_VERSION . "\n"; // https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G67519 $original = "\xC0\xAF\xE0\x80\xBF\xF0\x81\x82\x41"; $scrubbed = mb_scrub( $original, 'UTF-8' ); var_dump( $scrubbed );
Running this on PHP 8.1.2 and PHP 8.1.33, this is what I get:
$ ~/php-8.1.2/bin/php mb_scrub.php 8.1.2 string(7) "??????A" $ ~/php-8.1.33/bin/php mb_scrub.php 8.1.33 string(9) "????????A"
Again, I'm not sure if this is actually something to be concerned about - the reason I bring it up is that 8.1.2 is Ubuntu's LTS version so it is still being used.
#4
follow-up:
↓ 5
@
4 months ago
This is great testing @siliconforks — thank you! I’ve pushed one update which falls to the failsafe if running on a PHP version earlier than 8.1.6 where the change was introduced.
My first thought was to change the tests to only validate that substitution on the newer versions, but considered the value of having universal behavior to be greater.
Ultimately this substitution isn’t super important, but it’s nice to have where possible.
Where did you see the tests fail? I’m only seeing passing tests but am glad someone is looking out for version differences.
#5
in reply to:
↑ 4
@
4 months ago
Replying to dmsnell:
Where did you see the tests fail? I’m only seeing passing tests but am glad someone is looking out for version differences.
I noticed it while running the tests on my own Ubuntu 22.04 machine, using the version of PHP provided by the OS. (I don't think any of the CI tests use this configuration.)
#6
@
4 months ago
It returns false if $strip = true is requested.
The strip behavior is very interesting and the proposed change in behavior is important. It's tempting to rename the parameter to something like $replace or $substitute, but that's a potential breaking change with PHP8 named parameters.
The unicode standard is clear on the practice of U+FFFD substitution and notes that ignoring "bad" bytes represents a security risk:
If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.
The mentioned technical report gives some examples, but also clearly states:
If characters are to be substituted for ill-formed subsequences, it is important that those characters be relatively safe.
- Deletion (substituting the empty string) can be quite nasty, because it joins characters that would have been separate…
- Substituting characters that are valid syntax for constructs such as file names has similar problems. For example, the '.' can be very problematic.
- U+FFFD is usually unproblematic, because it is designed expressly for this kind of purpose. That is, because it does not have syntactic meaning in programming languages or structured data, it will typically just cause a failure in parsing.
- Where U+FFFD is not available, a common alternative is "?". While this character may occur syntactically, it appears to be less subject to attack than most others.
I was reviewing the proposed changes to understand the difference and discovered this myself. The existing strip implementation relies on iconv in this way:
<?php var_dump( iconv( 'utf-8', 'utf-8', ".\xC0." ), // C0 is never valid. iconv( 'utf-8', 'utf-8', ".\xE2\x8C." ), // Missing A3 at end. iconv( 'utf-8', 'utf-8', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts replaced separately. iconv( 'utf-8', 'utf-8', ".\xC1\xBF." ), // Overlong sequence. iconv( 'utf-8', 'utf-8', ".\xED\xA0\x80." ) // Surrogate half. );
Each of those generates a notice and returns false because the conversion fails. I reviewed iconv documentation to confirm this behavior and noticed the following:
If the string IGNORE is appended, characters that cannot be represented in the target charset are silently discarded.
That's likely what was intended in the original implementation, but is certainly not the documented behavior of this function. Each of these returns the string ".." with no warning:
<?php var_dump( iconv( 'utf-8', 'utf-8//IGNORE', ".\xC0." ), // C0 is never valid. iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C." ), // Missing A3 at end. iconv( 'utf-8', 'utf-8//IGNORE', ".\xE2\x8C\xE2\x8C." ), // Maximal subparts replaced separately. iconv( 'utf-8', 'utf-8//IGNORE', ".\xC1\xBF." ), // Overlong sequence. iconv( 'utf-8', 'utf-8//IGNORE', ".\xED\xA0\x80." ) // Surrogate half. );
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
4 months ago
#8
@
4 months ago
It’s rare for me to go on a hike and not come back with interesting ideas about how to proceed on these long problems. To that end, I am sorry to report that I made this issue much more expansive, bringing back ideas from WordPress/wordpress-develop#6883. I probably need to create a new Ticket for this, but all of the work revolves around one remaining issue that was bothering me about the proposed PR: can we avoid maintaining two or more UTF-8 decoders?
I would like to split off the work I have done to stage a broader update for 6.9.0.
@siliconforks the code in the proposed PR and all updated versions of it account for that change. thanks again!
The strip behavior is very interesting and the proposed change in behavior is important.
I’m not sure on this @jonsurrell but torn myself. It’s not a proposed change in behavior, just a bug-fix for a defect that’s been in the code for 16 years. I did look for tickets reporting this but couldn’t find any, and I suspect that the bug has been noticed, but hard to report because it’s not obvious where it appears. The failing behavior is to return an empty string, and this is the same behavior when a string is deemed “unsafe” by other means.
What this means is that this specific bug is masked by other intended rejections where the function is called. Fixing this bug might actually solve a number of unreported issues. The function docs clearly indicate that invalid bytes are to be “stripped” but doesn’t indicate how. Regardless, there’s a clear statement that only the invalid bytes are to be stripped, not the entire string.
Given that we should avoid removing invalid bytes, I propose substituting.
That's likely what was intended in the original implementation, but is certainly not the documented behavior of this function
There’s also //TRANSLIT but these all seem a bit blurry to understand from the PHP perspective. That is, WordPress doesn’t have control over them. We can guess that it was supposed to be //IGNORE, but I read the documented behavior to be exactly that. “Strip” implies removing characters, which is what //IGNORE does.
That being said, I feel like “strip” is broad enough to incorporate the substitution. It’s not a stretch to me to say that putting in the � in place of the invalid bytes is tantamount to stripping them out. Something is still there, but they aren’t.
Trac Ticket: Core-63837
wp_check_invalid_utf8()has been dependant on the runtime configuration of the system running it. For systems without the necessary Unicode support it would simply not perform any validation of the input. Worse, the use oficon()is not only system-dependant, but it also fails to perform substitution when requested, returningfalseinstead of the substituted string as described in the function docblock.In this patch a newer spec-compliant fallback is provided while the function itself defers to
mb_scrub(). Both of these implementations are spec-compliant (to UTF-8) and perform “Maximal subpart replacement” when encountering invalid byte sequences — an important aspect to maintain secure and quality interop between systems.