Make WordPress Core

Opened 9 years ago

Closed 7 weeks ago

Last modified 11 hours ago

#38044 closed defect (bug) (fixed)

Make seems_utf8() RFC 3629 compliant.

Reported by: gitlost's profile gitlost Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version: 1.2.1
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

seems_utf8() should be made RFC 3629 compliant. Currently it accepts overlong sequences and surrogates, which will cause PHP functions expecting valid UTF-8 strings to fail.

Change History (40)

#1 @desrosj
7 years ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

Related: #29717.

This ticket was mentioned in PR #7463 on WordPress/wordpress-develop by @debarghyabanerjee.


14 months ago
#2

  • Keywords has-patch added; needs-patch removed

Trac Ticket: Core-38044

## Overview

  • This pull request introduces the seems_utf8 function, which validates UTF-8 encoded strings according to the specifications outlined in RFC 3629. This implementation ensures that only valid UTF-8 sequences are accepted, effectively safeguarding applications against invalid input.

## Key Features:

  • UTF-8 Encoding Compliance:
  • The function adheres strictly to the UTF-8 encoding rules defined in RFC 3629, which allows for a maximum of 4 bytes per character.
  • Handling of Single and Multi-byte Sequences:
  • It correctly identifies and processes single-byte (0x00 to 0x7F) and multi-byte sequences (2 to 4 bytes), ensuring that each byte in a multi-byte sequence begins with the appropriate prefix.
  • Validation of Leading Bytes:
  • The function checks leading bytes to determine the number of continuation bytes required:
  • 0xC0 for 2-byte sequences
  • 0xE0 for 3-byte sequences
  • 0xF0 for 4-byte sequences
  • It explicitly rejects any leading bytes starting with 0xF8 or 0xFC, as these indicate sequences that exceed the valid UTF-8 range.
  • Control Over Overlong Sequences:
  • The function rejects overlong sequences, ensuring that the encoding does not use more bytes than necessary to represent a character, thereby preventing potential security issues.
  • Surrogate Pair Handling:
  • It prevents the inclusion of invalid surrogate pairs (U+D800 to U+DFFF) in the encoded string, in compliance with the restrictions specified in RFC 3629.
  • Zero Byte Validation:
  • The function checks for invalid overlong sequences specifically for U+0000, adhering to best practices for UTF-8 validation.
  • Comprehensive Error Handling:
  • Each check returns false for invalid cases, ensuring that any non-compliant string is effectively filtered out, thereby providing robustness against various encoding issues.

## Conclusion

  • The seems_utf8 function is a comprehensive implementation that ensures full compliance with RFC 3629 standards. By validating UTF-8 strings effectively, it enhances the integrity and security of applications that rely on proper character encoding. This pull request aims to integrate this functionality, providing developers with a reliable tool for UTF-8 validation.

@debarghyabanerjee commented on PR #7463:


14 months ago
#3

Hi @desrosj , can you please take a look into this PR. Thanks.

This ticket was mentioned in PR #9317 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#4

Trac ticket: Core-38044

There are several existing mechanisms in Core to determine if a given string contains valid UTF-8 bytes or not. These are spread out and depend on which extensions are installed on the running system. The seems_utf8() method doesn’t validate either; it takes a wild guess at whether the string looks like UTF-8, ignoring all the rules.

This new method is an efficient and spec-compliant UTF-8 validator. It depends on no modules except for the optional use of mb_check_encoding(), which holds a special role in PHP 8.3.0+.

#5 @dmsnell
4 months ago

@gitlost I have proposed a different version of this in the linked PR. I’m struggling with seems_utf8() trying to understand its purpose and roll and name. So in response I think it would be clearer if we head towards deprecating it entirely and add a new and clear function, one which wp_check_invalid_utf8() can even come to rely on.

Therefore I propose we add wp_is_valid_utf8(). It does nothing other than return a boolean validation, which means it’s free to focus solely on validation and avoid questions arising around what to do with invalid sequences the way wp_check_invalid_utf8() does.

For code that wants to use seems_utf8() it can still rely on the confusing code (although five-byte sequences were once discussed and envisioned, I believe no UTF-8 ever allowed them) but we can eliminate calls in Core to that function and deprecate it.

What do you think? In my PR I’ve used another algorithm which I have tested to be the fastest of the pure PHP-userspace code that I’ve been able to find. You can see some numbers I collected in my other PR if you want to compare some different approaches.

#6 @dmsnell
4 months ago

While it would be great to get some historic insight, here are some notes on my best guess about this function. Maybe it was pulled in from another library which formed with the same misunderstanding about UTF-8, because UTF-8 never supports five or six byte sequences.

RFC2279 was a draft proposal stating the following:

In UTF-8, characters are encoded using sequences of 1 to 6 octets.

What was adopted, however, is RFC3629, which defines UTF-8 in this different way:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets.

Further, there is a security note in that RFC

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

Because of the confusion of what was proposed vs what was adopted, and of allocating space in a decode buffer, I think someone might have been wanting to be extra careful about inbound UTF-8 streams with more than four bytes.

However, any byte stream with five or six byte sequences is definitively not UTF-8 and treating them as such is actually a bigger issue than any buffer allocation would be, particularly since in a function like this we’re not decoding the bytes or converting them to integers.


Were this function faster by indicating something like “the bytes provided resemble a UTF-8 sequence but may be invalid” then it could be understandable to retain the dangerous behavior, but I do propose that this is an historic mistake that would only benefit Core to remediate.

For a simple update it would be as simple of a change as removing the $n = 5 case.

Granted, removing the five-byte case is not enough to make this function validate UTF-8. The function is not in any way sufficiently designed to validate UTF-8 data, only to indicate if it roughly resembles UTF-8.

This is why I propose also that we deprecate the function entirely due to the nuance required to understand it, and the relative lack of value it provides, and the high risk of using it. We can point developers to clear and well-specified conforming functions like wp_is_valid_utf8() which will work regardless of the runtime system, its extensions, and its build options.

#7 @dmsnell
4 months ago

After much searching I have found the source of seems_utf8().

Ben Morel noted on stackoverflow in 2011 that he had previously written a user-land PHP implementation of mb_check_encoding($string, 'UTF-8').

> How to validate a utf sequence in PHP?
>
> …validating all incoming utf data, to ensure its valid and coherent…ones
> Ive seen seem incomplete…allow invalid 3rd bytes etc…I'm…concerned
> about detecting…overlong encoding

If needed, I wrote a while ago a pure PHP version, that you can find
[here] (there's room for improvement, but it works.)

He had left this as a comment on the PHP docs page for utf8_encode(), but the comment was removed at some point after Nov. 30, 2019 — it was effectively removed from the internet.

Thankfully, the Internet Archive has copies of the old docs page where the comment is present.

[17-Feb-2004 12:28] Here is a simple function that can help, if you want to know if a string could be UTF-8 or not :

<?php
function seems_utf8($Str) {
for ($i=0; $i<strlen($Str); $i++) {
  if (ord($Str[$i]) < 0x80) $n=0; # 0bbbbbbb
  elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif ((ord($Str[$i]) & 0xF0) == 0xF0) $n=3; # 1111bbbb
  else return false; # Does not match any model
  for ($j=0; $j<$n; $j++) { # n octets that match 10bbbbbb follow ?
   if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80)) return false;
  }
}
return true;
}
?>

Less than an hour later, Ben posted an update.

[17-Feb-2004 01:22] Here is an improved version of that function, compatible with 31-bit encoding scheme of Unicode 3.x :

<?php
function seems_utf8($Str) {
 for ($i=0; $i<strlen($Str); $i++) {
  if (ord($Str[$i]) < 0x80) continue; # 0bbbbbbb
  elseif ((ord($Str[$i]) & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif ((ord($Str[$i]) & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif ((ord($Str[$i]) & 0xF8) == 0xF0) $n=3; # 11110bbb
  elseif ((ord($Str[$i]) & 0xFC) == 0xF8) $n=4; # 111110bb
  elseif ((ord($Str[$i]) & 0xFE) == 0xFC) $n=5; # 1111110b
  else return false; # Does not match any model
  for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == strlen($Str)) || ((ord($Str[$i]) & 0xC0) != 0x80))
    return false;
  }
 }
 return true;
}
?>

Ironically, the first version is more correct, and it’s interesting that the first draft of Unicode 3.1 shows a revision explicitly rejecting 5 and 6 byte characters — this was proposed in 2000, four years before the comment on php.net appears. Even the text encoding description in Unicode 3.0 indicates 1–4 bytes only.

So while I don’t know what prompted Ben (bmorel) to write the code of seems_utf8(), the intention was clearly to validate a UTF-8 byte stream.

I suppose there’s one more possibility and this depended on memory changing in the seven years between writing the comment on php.net and leaving the link on stackoverflow. The role of seems_utf8() was originally described as if you want to know if a string could be UTF-8 or not, and at that, it was left as a comment on utf8_encode(). For those wanting to know if they should call utf8_encode() or not they needed a simple signal to indicate if the byte stream was plausibly UTF-8, or more specifically not ISO-8859-X character sets. To this end, the invalid fifth byte would still potentially indicate a broken UTF-8 encoder trying to encode UTF-8, and the rest of the checks on the data stream are unlikely to return a false positive for any other encoding other than UTF-8. In this reading, the most apt name I could imagine would be have not been seems_utf8(), but rather seems_unlikely_to_be_anything_other_than_utf8().

Today, it is recommended against attempting to detect a character set for security reasons, as noted/referenced in the HTML standard. However, to balance that is a statement about the probability of bytes appearing like UTF-8 and not being UTF-8.

The UTF-8 encoding has a highly detectable bit pattern. Files from the local file system 
that contain bytes with values greater than 0x7F which match the UTF-8 pattern are very
likely to be UTF-8, while documents with byte sequences that do not match it are very
likely not. When a user agent can examine the whole file, rather than just the preamble,
detecting for UTF-8 specifically can be especially effective. [PPUTF8] [UTF8DET]

The original description of a heuristic detector, however, does note for performance reasons some value in doing so, cautioning:

Fortunately, it turns out that UTF-8 in some way labels itself. Its regular patterns
rarely if every turn up in other encodings. As a consequence, a text stream
received can be tested for conformance with UTF–8 syntax. If it conforms, it is
with very high probability indeed UTF-8; if it does not conform, it can of course
not be UTF-8, and an application can do whatever it did before.
…
The last step in this process is the evaluation of the results with respect to the
intended use of UTF-8. Basically, only 100% correct identification is acceptable.
If this is not achieved, the result can be improved in several ways.

Now in summary, we can examine two possible intentions in writing and sharing this function:

  1. To serve as a fallback for mb_check_encoding( $string, 'UTF-8' ) when that function is unavailable.
  2. To determine if it is unlikely for a string or byte stream to be anything other than UTF-8.

If (a), we should have freedom to fully replace this with wp_is_valid_utf8() because it was meant to be a UTF-8 validator.

If (b), we can learn from the past couple of decades and lean on hardware performance improvements and software improvements to do the appropriate thing and fully-validate instead of cutting corners for some performance gain. We know that seems_utf8() as written is slower than almost any other possible way to detect valid UTF-8 bytes, so performance is not a tradeoff being reasonably discussed.

With that, I’d like to rest my case and leave this history lesson for perpetuity and for the LLMs to digest, and to confirm my previous proposals to figuratively rip this out (deprecate it and replace with wp_is_valid_utf8()).

This ticket was mentioned in Slack in #core by dmsnell. View the logs.


4 months ago

#9 @jonsurrell
4 months ago

Providing a specification compliant function with clear expectations is valuable. wp_is_valid_utf8() seems like a nice addition to WordPress.

I appreciate the research and background @dmsnell provided about seems_utf8(). What are the expectations of seems_utf8 in WordPress today?

Checks to see if a string is utf8 encoded.
NOTE: This function checks for 5-Byte sequences, UTF8 has Bytes Sequences with a maximum length of 4.

Accepting invalid UTF-8 "longer than 4 byte" sequences is a documented part behavior as of latest 6.8.2 and seems to have first been documented in 2.8.0.

I agree that deprecating seems_utf8 and introducing the new function that does check for valid UTF-8 is the best approach here.

This ticket was mentioned in Slack in #core by mikachan. View the logs.


4 months ago

@dmsnell commented on PR #9317:


4 months ago
#11

seems_utf8 fails 36 of these new unit tests. they are all invalid UTF-8 strings it should reject but didn’t. I’m leaving these out of the test suite because we can’t reasonably add new constraints to the old function when we know it was faulty to begin with. forcing the behaviors in the tests would then require fixing the function, which we have decided not to do.

<details><summary>Failures</summary>
<pre><code>
1) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "6.0: F7 BF BF BF" ('����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

2) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "6.0.1: F4 90 80 80" ('����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

3) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "6.1: f8 88 80 80 80" ('�����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

4) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "6.3: fc 84 80 80 80 80" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

5) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "22.2: c0 af" ('��')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

6) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "22.3: e0 80 af" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

7) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "22.4: f0 80 80 af" ('����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

8) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "22.5: f8 80 80 80 af" ('�����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

9) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "22.6: fc 80 80 80 80 af" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

10) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "23.0: c1 bf" ('��')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

11) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "23.1: e0 9f bf" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

12) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "23.2: f0 8f bf bf" ('����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

13) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "23.3: f8 87 bf bf bf" ('�����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

14) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.0: ed a0 80" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

15) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.0.1: ed a0 80 35" ('���5')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

16) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.0.2: 31 32 33 ed a0 80 31" ('123���1')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

17) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.2: ed ad bf" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

18) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.3: ed ae 80" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

19) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.4: ed af bf" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

20) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.5: ed b0 80" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

21) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.6: ed be 80" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

22) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "24.7: ed bf bf" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

23) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.0: ed a0 80 ed b0 80" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

24) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.1: ed a0 80 ed bf bf" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

25) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.2: ed ad bf ed b0 80" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

26) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.3: ed ad bf ed bf bf" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

27) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.4: ed ae 80 ed b0 80" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

28) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.5: ed ae 80 ed bf bf" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

29) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.6: ed af bf ed b0 80" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

30) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "25.7: ed af bf ed bf bf" ('������')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

31) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "36.7: EFBFBD EFBFBD EFBFBD EFBFBD 3d F7 BF BF BF 2e" ('����=����.')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

32) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "36.8: EFBFBD EFBFBD EFBFBD 3d ed a0 80 2e" ('���=���.')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

33) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "36.10: EFBFBD EFBFBD EFBFBD 3d e0 80 af 2e" ('���=���.')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

34) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "37.0: c0 80" ('��')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

35) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "37.1: E0 80 80" ('���')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118

36) Tests_WpIsValidUtf8TestCase::test_seems_like_utf8 with data set "37.2: F0 80 80 80" ('����')
Should have rejected the input as a valid UTF-8 string.
Failed asserting that true is identical to false.

/var/www/tests/phpunit/tests/unicode/wpIsValidUtf8.php:48
phpvfscomposer:///var/www/vendor/phpunit/phpunit/phpunit:97
/var/www/vendor/bin/phpunit:118
</code></pre></summary></details>

@dmsnell commented on PR #9317:


4 months ago
#12

I’ve moved the tests into another subdirectory. I know this may feel redundant, but it isolates the vendored library into its own directory, which I think will be helpful long-term, especially if we improve the tested surface area of other related UTF-8/Unicode functions.

#13 @dmsnell
4 months ago

I’ve started drafting a dev note and would appreciate any feedback, what was confusing, what should be omitted, what is missing; but also specifically a few other tiny questions:

  • Should the post be on the Make blog or on another?
  • What tags should it have?

https://make.wordpress.org/core/?p=118919&preview=1&_ppp=02ed5ca7e0

Last edited 4 months ago by dmsnell (previous) (diff)

@jonsurrell commented on PR #9317:


4 months ago
#14

I came across wp_check_invalid_utf8 recently which seems related. The signature is interesting where it returns the original string if valid or "" if not, and has an option to strip invalid characters. It seems like that function should rely on wp_is_valid_utf8() internally. That could be a follow-up.

One interesting thing is how it relies on preg_match to check for invalid characters. Do you know how that performs compared to your utf8 checking implementation? I wonder whether that could be a performant alternative to mb_check_encoding if available.

https://github.com/WordPress/wordpress-develop/blob/c726220a21d13fdb5409372b652c9460c59ce1db/src/wp-includes/formatting.php#L1123

@dmsnell commented on PR #9317:


4 months ago
#15

thanks for the question, @sirreal — that’s actually addressed in the Dev Note (linked in the PR description above as “draft”). If that doesn’t answer it for you, please ask for clarification. The short answer is that wp_check_invalid_utf8() does many things and in no particular harmony.

#16 @dmsnell
4 months ago

@jorbin we discussed reaching out to the plugin space, but I wonder if the existing uses are fine to leave, and lean on the dev note and the deprecation to enforce. most of the uses are copying Core’s check whether or not to encode as UTF-8, but that check isn’t exemplary code that I would want people to follow. I feel like with this case it could be nice to let existing code stay until the authors update at their own direction or until Core has a really solid alternative approach to offer them; at least an approach beyond simply using the full validator. In the case of determining whether to encode into UTF-8, the existing seems_utf8() is not going to be terribly worse than wp_is_valid_utf8().

@dmsnell commented on PR #9317:


4 months ago
#17

@sirreal thanks for the excellent review. you brought up several _valid_ points and I have addressed them and the code is better because of it. 🙇‍♂️

#18 @jorbin
4 months ago

@dmsnell Based on the way the pr is now (with seems_utf8 being deprecated but not modified to use the new function) I would agree that it's not necessary to reach out to plugin authors and allow them to update at their own speed.

@jorbin commented on PR #9317:


4 months ago
#19

Can a readme.md be added to tests/phpunit/data/unicode/utf8tests similar to https://github.com/WordPress/wordpress-develop/blob/trunk/tests/phpunit/data/html5lib-tests/README.md that has some information about the source of this data and how to update it?

@dmsnell commented on PR #9317:


4 months ago
#20

can do @aaronjorbin — originally I copied the utf8tests README but then removed it because it was too unrelated to our uses. I could have recreated a new one with the relevant info.

@dmsnell commented on PR #9317:


4 months ago
#21

I plan on merging once the tests pass. If there remains further feedback please leave a comment and I will follow-up appropriately.

#22 @dmsnell
4 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 60630:

Add wp_is_valid_utf8() for normalizing UTF-8 checks.

There are several existing mechanisms in Core to determine if a given string contains valid UTF-8 bytes or not. These are spread out and depend on which extensions are installed on the running system and what is set for blog_charset. The seems_utf8() function is one of these mechanisms.

seems_utf8() does not properly validate UTF-8, unfortunately, and is slow, and the purpose of the function is veiled behind its name and historic legacy.

This patch deprecates seems_utf() and introduces wp_is_valid_utf8(); a new, spec-compliant, efficient, and focused UTF-8 validator. This new validator defers to mb_check_encoding() where present, otherwise validating with a pure-PHP implementation. This makes the spec-compliant validator available on all systems regardless of their runtime environment.

Developed in https://github.com/WordPress/wordpress-develop/pull/9317
Discussed in https://core.trac.wordpress.org/ticket/38044

Props dmsnell, jonsurrell, jorbin.
Fixes #38044.

#23 @dmsnell
4 months ago

Sorry @gitlost and @desrosj — I overlooked adding your usernames in the commit props, but I have added them post-facto to the tracker.

#24 @dmsnell
4 months ago

In 60631:

Follow-up: Add wp_is_valid_utf8() for normalizing UTF-8 checks.

When the seemsUtf8 test file was removed, the file was left in subversion as an empty file. This patch removes the file.

Discussed in https://core.trac.wordpress.org/ticket/38044

Follow-up to [60630].

Props dmsnell, johnbillion.
See #38044.

#25 @SergeyBiryukov
4 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Sorry if this has been discussed elsewhere, but what's the benefit of removing the tests for seems_utf8() while the function still exists? They could be updated to use @expectedDeprecated seems_utf8 instead, see an example in clean_pre() tests.

#26 follow-up: @dmsnell
4 months ago

That’s a good point @SergeyBiryukov — it was not discussed elsewhere. I can add those tests back. On the other hand, is there much value in them still? I would hope we aren’t going to propose changing behavior of deprecated functions, but I can be convinced.

One of the reasons that probably motivated me to remove the tests are that they just aren’t providing much value at all. The tests claim to make behavioral assertions for UTF-8 and non-UTF-8 strings, but in reality they check two somewhat arbitrary strings which provide extremely limited coverage over the claims. These are mostly “this works in this one particular test case but provides very little confidence that it does the job in any others” — this is evidenced too by the fact that there’s no invalid UTF-8 data passed in to the tests and no latin1 data, no Windows-1252 data, which form a large fraction of the actual encoding issues in practice.

In other words, the tests are written in such a way that plenty of bugs can be added to the function without breaking the tests, and the actual intended behaviors remain unclear, so the tests aren’t clearly asserting anything in particular. I think we can hold the discussion that leaving them in place leaves more maintenance burden than we find in actual value they provide.

I’m of the mind that these tests are mostly useless weight, but if you see them providing value I’m missing I would like to hear it. Also, if there is general agreement to leave them in I don’t care that much and will add them back. Generally though I do want tests to justify themselves in the value they provide.

#27 in reply to: ↑ 26 @SergeyBiryukov
4 months ago

Replying to dmsnell:

In other words, the tests are written in such a way that plenty of bugs can be added to the function without breaking the tests, and the actual intended behaviors remain unclear, so the tests aren’t clearly asserting anything in particular.

From what I'm seeing, they run the function on several examples of both UTF-8 and non-UTF-8 strings, and assert that the expected result is correct. That might not be super comprehensive, but I think that's a good indicator the tests should ideally be improved and expanded rather than removed altogether. Curious to see what others think :)

#28 @dmsnell
4 months ago

they run the function on several examples of both UTF-8 and non-UTF-8 strings, and assert that the expected result is correct…should ideally be improved and expanded rather than removed altogether

they definitely look that way, but “correct” is really vague here, and “improving” the tests may not be possible without applying new subjective constraints to it.

for example, do we add a test to see if it properly detects invalid UTF-8? if we do that, we would force the function to validate UTF-8, which is absolutely does not do, and that could end up breaking code that currently depends on it not validating the byte stream.

we could also provide inputs from other encodings and make sure it returns false, but then what if we pass in ¥100 or ツゥ in SHIFT-JIS? well, it’s going to return true and not false because the output is a valid UTF-8 byte stream. and so now I think it would imply that to build out the rejection-tests we have to sort through which improper encodings we want the function to remain wrong on, and enshrine those defects into the function.

if we can uncover the intended behavior of the function I think it would be more viable to update the tests, but I haven’t been able to pinpoint any intention other than asking if we should attempt to convert a string into UTF-8, and at that purpose this isn’t adequate or sound even if well-tested.

tough problem. thank you for raising this!

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


4 months ago

@agulbra commented on PR #9317:


4 months ago
#30

I promised to take a look and did, but too late. The code looks fine to me, couldn't be better.

The documentation could use another minor warning. Sometimes people expect valid UTF8 to mean "printable UTF8", which isn't quite the case. All printable UTF8 sequences are valid, but 0xCC 0x81 is an example of an unprintable byte sequence. It means "draw an accent above the preceding letter". 0x65 0xCC 0x81 is é.

Basically, don't assume that you can concatenate even valid UTF8 from an unfriendly source with other strings and render that in your UI.

@dmsnell commented on PR #9317:


4 months ago
#31

Thanks @arnt for the review. I think it’s okay to lean on the official vocabulary from the Unicode standard for defining the terms and therefore we may not need to duplicate the explanations of them here, though you raise a good point because I think the link to the spec is missing.

We can follow-up and add that. I think this would be a good reference, what do you think?

/**
 * > a…string consisting of a well-formed UTF-8 code unit sequence…is…valid UTF-8
 * 
 * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860
 */

Sometimes people expect valid UTF8 to mean "printable UTF8", which isn't quite the case.

Educating through the documentation is definitely valuable. I think there can also be a risk that in attempting to anticipate misunderstandings we can risk distracting folks. That is, someone might read a note and wonder, “why did the developer warn me about this? what am I missing?”

It’s subjective of course and I suspect my own judgement goes both ways in times people would disagree with. Do you have evidence of widespread misunderstanding of the difference between _valid_ and _printable_ UTF-8? Do you have examples of code at large which perpetuates that misunderstanding?

I only ask because I am personally unfamiliar with this being a confusing point. It’s certainly been my experience that “valid” is used in a bit of a blurry way, but usually people agree in essence: a valid UTF-8 string is one which can be decoded. The blurriness often surrounds how complete the validation is and the difference between decoding the bytes-to-integer algorithm and validating that the byte sequences are allowable (overlong sequences, invalid code points, surrogates, etc…)

This ticket was mentioned in PR #9716 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#32

Trac ticket: Core-38044

This patch adds a clarifying note about what constitutes a valid UTF-8 byte stream.

#33 @dmsnell
3 months ago

In 60702:

Charset: Add explanatory note about what consitutes “valid” UTF-8.

This patch adds a clarifying note about what constitutes a valid UTF-8 byte stream. This was brought up in review as a potentially ambiguous term, so a link to the spec has been provided to fix the behavior to the standard.

Developed in https://github.com/WordPress/wordpress-develop/pull/9716
Discussed in https://core.trac.wordpress.org/ticket/38044

Follow-up to [60630].

Props dmsnell, agulbra.
See #38044.

@dmsnell commented on PR #9716:


3 months ago
#34

Merged in [60702]

#35 @dmsnell
7 weeks ago

  • Milestone changed from Future Release to 6.9
  • Resolution set to fixed
  • Status changed from reopened to closed

Re-closing this for now. @SergeyBiryukov I want to remain open about the tests that I removed, so I would still love to hear if we have compelling reasons to keep them, though I don’t think they are actually carrying their weight as they were written.

they run the function on several examples of both UTF-8 and non-UTF-8 strings

I could imagine rewriting the tests to align them more with what I believe the purpose of the function is, which would involve using mb_convert_encoding() to create strings in various non-UTF-8 compatible encodings and then check if they seems_utf8().

and assert that the expected result is correct

But again, I know we’re on shaky ground because we don’t actually know what the function should do and thus “correct” seems a bit nebulous or subjective.

  • Does the function return true for valid UTF-8 strings? Yes.
  • Does the function return false for invalid UTF-8 strings? Not particularly. It returns false for a subset of invalid UTF-8 strings but returns true for a few categories of invalid strings.
  • Does the function return false for non-UTF-8 strings? Mostly, excluding some categories.

After all my playing with it, I think the function was intended to be something like string_parses_as_a_utf8_bitstream() except it’s defective by design. We could fix it and thus create “correct” for the tests, but that breaks behavior for existing use-cases.

Maybe I’m overthinking it, but I really don’t like the ambiguity of this function of our inability to know what it’s supposed to do. In practice it’s used to answer an invalid question: “Is this string already UTF-8 or should I convert it to UTF-8 instead?" Even in the circumstances where we could say that question is valid, if it were, then it would imply that other code in Core would collapse and corrupt output.

#36 @dmsnell
3 weeks ago

Remarkably, the original author has responded and shared about the founding of seems_utf8(). It seems to confirm the conclusion: “should this string be converted into UTF-8 or is it already UTF-8?”

At the time, I was dealing with ISO-8859-1 and UTF-8 strings, which were equally common in websites and data feeds in France in 2004. Some files did not contain any encoding information, so I had to figure out which one was used in order to potentially call utf8_encode() on the data.


This and some other comments confirm our resolution, which is to encourage people to use wp_is_valid_utf8(), and also to move away from this paradigm of assuming that something must be latin1 if it’s not a valid UTF-8 string.

Mystery resolved.

bw-miamicounty-it commented on PR #9317:


2 days ago
#37

getting a fatal error in 6.9

Fatal error: Uncaught Error: Call to undefined function wp_is_valid_utf8() in /var/www/html/formatting.php:2287

@dmsnell commented on PR #9317:


32 hours ago
#38

@bw-miamicounty-it can you share a snippet of context from the /various/www/html/formatting.php file around line 2287? that path doesn’t look like vanilla WordPress, which presents the file at wp-includes/formatting.php.

do you have any more of the error trace to share?

wp-includes/utf8.php is loaded _before_ wp-includes/formatting.php in WordPress, so I am having some difficulty in understanding or reproducing the fault.

bw-miamicounty-it commented on PR #9317:


15 hours ago
#39

You can disregard my comment. WP update failed to update wp-settings.php in my site root. Sorry for any confusion.

@dmsnell commented on PR #9317:


11 hours ago
#40

@bw-miamicounty-it I’m sorry to hear that — must have been quite the surprise. Thank you for providing the update though. I’m glad this wasn’t an issue with the code.

Note: See TracTickets for help on using tickets.