Opened 11 years ago
Last modified 10 months ago
#29717 assigned enhancement
wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | |
Component: | Formatting | Keywords: | has-patch dev-feedback has-unit-tests |
Focuses: | performance | Cc: |
Description
Used in core in these 4 functions.
- esc_attr()
- esc_js()
- esc_html()
- sanitize_text_field()
It's the first function to execute for all 4, and especially for sanitize_text_field it gets called quite a bit and is pretty important.
It's purpose is to check a string for invalid utf. It utilizes preg_match with the '/u' modifier to parse both the pattern and subject for utf. PCRE automatically checks both the pattern and subject for invalid utf, upon which it will exit with an error code/constant.
The changes here: Normally pcre is compiled with utf support. It can also be compiled to disallow utf support, and it can be compiled without utf support. If utf is compiled and enabled the '/u' modifier for preg_match is available which turns on the automatic utf validation.
For older dists or those with utf support turned off at compile, there is a trick to enable the same functionality as the '/u' provides.
http://www.pcre.org/pcre.txt
In order process UTF-8 strings, you must build PCRE to include UTF-8
support in the code, and, in addition, you must call pcre_compile()
with the PCRE_UTF8 option flag, or the pattern must start with the
sequence (*UTF8). When either of these is the case, both the pattern
and any subject strings that are matched against it are treated as
UTF-8 strings instead of strings of 1-byte characters.
So the first change to this function was to allow a fallback to that pattern option trick in case '/u' wasnt supported.
@preg_match( '//u', '' ) !== false
@preg_match( '/(*UTF8)/', '' ) !== false
- Fallback to a regex that doesn't require UTF support, instead of using pcre utf validation it searches for it
I also wanted it to have better performance, especially due to its use in those 4 core functions I use often. I benchmarked it pretty thoroughly to try and gain more speed. This patch is about 10-20% faster.
Many gains were from refactoring the logic and control structures, chaining within if statements using bools, and utilizing the static variables to the fullest. This is especially crucial since this function gets called repeatedly. I also gained some cycles by replacing an in_array() check with a stripos
.
One of the bigger gains came from replacing the strlen( $string ) == 0
that ran on every run with. Since the $string variable was already casted to a string, that should always work and keep things a little cheaper.
$string = (string) $string; // if string length is 0 (faster than strlen) return empty if ( ! isset( $string[0] ) ) return '';
The final change was to the 2nd parameters $strip, which if true is supposed to strip the invalid utf out of the string and return the valid. In core nowhere is that parameter being used (yet), which explains the deprecated looking iconv. Also added a fallback to use mb_convert_encoding in case iconv is missing.
// try to use iconv if exists if ( function_exists( 'iconv' ) ) return @iconv( 'utf-8', 'utf-8//ignore', $string ); // otherwise try to use mb_convert_encoding, setting the substitue_character to none to mimic strip if ( function_exists( 'mb_convert_encoding' ) ) { @ini_set( 'mbstring.substitute_character', 'none' ); return @mb_convert_encoding( $string, 'utf-8', 'utf-8' ); }
Here are some of the test strings I used, I also used the utf-8-test file at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. I did testing on 4.0 using php 5.6, 5.4, 5.3, and 5.4. I verified the output and the strip feature as well. For all tests I had php error_reporting set to the max:
ini_set( 'error_reporting', 2147483647 );
$valid_utf = array( "\xc3\xb1", // 'Valid 2 Octet Sequence' "\xe2\x82\xa1", // 'Valid 3 Octet Sequence' => "\xf0\x90\x8c\xbc", // 'Valid 4 Octet Sequence' => "\xf8\xa1\xa1\xa1\xa1", //'Valid 5 Octet Sequence (but not Unicode!)' => "\xfc\xa1\xa1\xa1\xa1\xa1", //'Valid 6 Octet Sequence (but not Unicode!)' => "Iñtërnâtiônàlizætiøn\xf0\x90\x8c\xbcIñtërnâtiônàlizætiøn", // valid four octet id 'Iñtërnâtiônàlizætiøn', // valid UTF-8 string "\xc3\xb1", // valid two octet id "Iñtërnâtiônàlizætiøn\xe2\x82\xa1Iñtërnâtiônàlizætiøn", // valid three octet id ); $invalid_utf = array( "\xc3\x28", //'Invalid 2 Octet Sequence' => "\xa0\xa1", //'Invalid Sequence Identifier' => "\xe2\x28\xa1", //'Invalid 3 Octet Sequence (in 2nd Octet)' => "\xe2\x82\x28", //'Invalid 3 Octet Sequence (in 3rd Octet)' => "\xf0\x28\x8c\xbc", //'Invalid 4 Octet Sequence (in 2nd Octet)' => "\xf0\x90\x28\xbc", // 'Invalid 4 Octet Sequence (in 3rd Octet)' => "\xf0\x28\x8c\x28", //'Invalid 4 Octet Sequence (in 4th Octet)' => chr(0xE3) . chr(0x80) . chr(0x22), // Invalid malformed because 0x22 is not a valid second trailing byte following the leading byte 0xE3. http://www.unicode.org/reports/tr36/ chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Invalid UTF-8, overlong 5 byte encoding. chr(0xD0) . chr(0x01), // High code-point without trailing characters. chr(0xC0) . chr(0x80), // Overlong encoding of code point 0 chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Overlong encoding of 5 byte encoding chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Overlong encoding of 6 byte encoding chr(0xD0) . chr(0x01), // High code-point without trailing characters "Iñtërnâtiôn\xe9àlizætiøn", // invalid UTF-8 string "Iñtërnâtiônàlizætiøn\xfc\xa1\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", // invalid six octet sequence "Iñtërnâtiônàlizætiøn\xf0\x28\x8c\xbcIñtërnâtiônàlizætiøn", // invalid four octet sequence "Iñtërnâtiônàlizætiøn \xc3\x28 Iñtërnâtiônàlizætiøn", // invalid two octet sequence "this is an invalid char '\xe9' here", // invalid ASCII string "Iñtërnâtiônàlizætiøn\xa0\xa1Iñtërnâtiônàlizætiøn", // invalid id between two and three "Iñtërnâtiônàlizætiøn\xf8\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn", // invalid five octet sequence "Iñtërnâtiônàlizætiøn\xe2\x82\x28Iñtërnâtiônàlizætiøn", // invalid three octet sequence third "Iñtërnâtiônàlizætiøn\xe2\x28\xa1Iñtërnâtiônàlizætiøn", // invalid three octet sequence second );
Notes and more info:
In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. UTF-8 was devised in September 1992 by Ken Thompson, guided by design criteria specified by Rob Pike, with the objective of defining a UCS transformation format usable in the Plan9 operating system in a non- disruptive manner. Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234]. UTF8-octets = *( UTF8-char ) UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1 = %x00-7F UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail ) UTF8-tail = %x80-BF
- http://www.pcre.org/pcre.txt
- http://us1.php.net/manual/en/pcre.constants.php
- http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
- http://en.wikipedia.org/wiki/Unicode
- http://unicode.org/faq/utf_bom.html
- http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf
- http://www.pcre.org/pcre.txt
- http://tools.ietf.org/rfc/rfc3629.txt
- http://www.unicode.org/faq/utf_bom.html
- http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf
- http://www.unicode.org/reports/tr36/
- http://tools.ietf.org/rfc/rfc3629.txt
Related Tickets:
Attachments (8)
Change History (38)
#2
follow-up:
↓ 3
@
11 years ago
Impressive. So the main benefits are 10% faster and more compatibility? Are there any systems currently running WordPress that need this patch? A more concise, big picture description would help.
Also, I learned in feedback from the 4.0 release that we need to specifically test PHP versions less than 5.4.9 and 5.3.19, because they exhibit crashes when PCRE is used to perform certain types of alternation and backtracking. I found that version 5.2.13 is particularly easy to download. It is not necessary to add unit tests for that, but we need to see that if someone posts a 10kb or 100kb block of text that it won't suddenly crash due to a server bug.
#3
in reply to:
↑ 2
@
11 years ago
Replying to miqrogroove:
Impressive. So the main benefits are 10% faster and more compatibility? Are there any systems currently running WordPress that need this patch? A more concise, big picture description would help.
Also, I learned in feedback from the 4.0 release that we need to specifically test PHP versions less than 5.4.9 and 5.3.19, because they exhibit crashes when PCRE is used to perform certain types of alternation and backtracking. I found that version 5.2.13 is particularly easy to download. It is not necessary to add unit tests for that, but we need to see that if someone posts a 10kb or 100kb block of text that it won't suddenly crash due to a server bug.
The updates don't actually change the behaviour of this function unless:
- You are one of those with a site with an older pcre lacking utf-8 support, in which case those 4 functions will now correctly filter and check for invalid utf.
- You use the
strip
parameter to actually remove invalid utf for a plugin or theme, in which case it will now work correctly. That was a bug fix.
Some folks have pcre compiled without utf support enabled or with utf-support missing, or disabled, so for them the '/u' doesn't work which results in essentially this entire check being skipped.
This is also somewhat of a security issue, such as the whole IDN domain issues and other utf exploits. The big big picture with this is to update the function to more easily developed and used, it hadn't been updated for quite a while. This should make it easier to update/extend/move this function down the road, I think some people may have wrongly assumed that it was doing more than it was. It's kind of a strange function, to take a string as input and either return it as is, or return a blank string in case of invalid utf-8. But that's actually really clever, it's much safer and faster that way, just not so clear.
I've noticed several plugins like disqus and yoast seo have started to build their own incarnations of this function, this update should help make clear what it is and isn't.
I have tested on PHP 5.2, I approached this with extreme caution to avoid causing any issues. IOW, this function will also work on 5.0. The only reason it wouldn't work for php 4.x is that stripos
wasn't available as a builtin zend function until 5.0, but I noticed it's being used in several places in core so.. ( I am still used to having to code backwards for 4.x, so happy that's officially over for WP).
The big change are the 2 new fallbacks to the original preg_match, including the custom regex, which will be the fallback for those with absolutely no utf pcre capability. It has to be a rarity for that to ever actually be needed, but that's the only possible issue I can see with regard to buffer issues or memory problems. preg_match isn't as efficient as a builtin function such as strpos, but it is pretty darn efficient.
The other big change is making the 'strip' parameter work, since it isn't actually being used by any core, it seems to have been forgotten about a little. With it now working, I will start using it in plugins and themes to sanitize utf-8 (because this is super fast). That's actually why I initially started on this.
#4
follow-up:
↓ 5
@
11 years ago
The checks and ini_set() in the if ($strip ) {}
block only need to be run once.
The stripos() is only run once so does not save much and is probably wrong, since it matches 'utf-16', 'utf-7', etc.
Also the inline doc in the patch claims too much, while the old one said too little. Maybe this:
* @return string If the string is valid UTF-8 or the blog_charset is not UTF-8, the string is returned unmodified. Otherwise, an empty string is returned, or optionally the string stripped of invalid chars.
#5
in reply to:
↑ 4
@
10 years ago
Replying to kitchin:
The checks and ini_set() in the
if ($strip ) {}
block only need to be run once.
The stripos() is only run once so does not save much and is probably wrong, since it matches 'utf-16', 'utf-7', etc.
Also the inline doc in the patch claims too much, while the old one said too little. Maybe this:
* @return string If the string is valid UTF-8 or the blog_charset is not UTF-8, the string is returned unmodified. Otherwise, an empty string is returned, or optionally the string stripped of invalid chars.
Great feedback kitchin, I have just updated the patch to 29717.3.patch with all of your improvements. Please check it.
#6
follow-ups:
↓ 7
↓ 13
@
10 years ago
Cool stuff. Comments:
(1) I still think the old blog_charset check is clearest. No need to confuse people into having to look up obscure docs. Old code:
in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) )
vs. your new code
stripos( $is_utf8, 'utf' ) !== false && strpos( $is_utf8, '8' ) !== false
(2) The WP code base never checks the result of ini_set() or @ini_set() but in this case it seems wise to do so. Hosts can disallow it. Most robust way is probably:
static $mb_convert; if ( function_exists( 'mb_convert_encoding' ) ) { @ini_set( 'mbstring.substitute_character', 'none' ); $mb_convert = @ini_get( 'mbstring.substitute_character' ) === 'none'; }
I don't imagine anybody is worried about changing that ini value with restoring it, but it should probably be noted in the inline doc as a side effect.
As for WP coding standards nits, WP wants braces on all clauses (if ... {}). Also, no parentheses around function_exists() at line 775.
#7
in reply to:
↑ 6
@
10 years ago
Replying to kitchin:
Cool stuff. Comments:
(1) I still think the old blog_charset check is clearest. No need to confuse people into having to look up obscure docs. Old code:
in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) )vs. your new code
stripos( $is_utf8, 'utf' ) !== false && strpos( $is_utf8, '8' ) !== false(2) The WP code base never checks the result of ini_set() or @ini_set() but in this case it seems wise to do so. Hosts can disallow it. Most robust way is probably:
static $mb_convert; if ( function_exists( 'mb_convert_encoding' ) ) { @ini_set( 'mbstring.substitute_character', 'none' ); $mb_convert = @ini_get( 'mbstring.substitute_character' ) === 'none'; }I don't imagine anybody is worried about changing that ini value with restoring it, but it should probably be noted in the inline doc as a side effect.
As for WP coding standards nits, WP wants braces on all clauses (if ... {}). Also, no parentheses around function_exists() at line 775.
As the blog_charset check does only run once, I agree that the old code is better. I also added your suggestion to verify that the ini value is correctly set to 'none' as part of the requirement for using mb_convert_encoding if iconv is unavailable.
I also went ahead and added braces, and removed the parentheses from the function_exists statement, nice one.
#8
@
10 years ago
We could also use that same preg_match regex as a preg_replace failsafe for strip in case a machine can't use either iconv or mb_convert_encoding, I don't want to add if that situation rarely happens, since nowhere in core is the strip parameter being used.
Thoughts?
#10
@
10 years ago
Did some benchmarking on both valid and invalid, super long and normal length strings.
At first I was also using mb_check_encoding, but it would cause max execution time errors even on medium sized strings.
/* BENCHMARKS ON INVALID STRING (750,000 iterations) mb_strlen 16,049,664 750k time avg from 15k iterations preg_match_modifier) 0.73318 0.014659 preg_match_backtrack) 0.73956 0.014787 htmlspecialchars) 45.36456 0.907278 preg_match_pattern) 2.06490 0.041293 mb_check_encoding) CRASHED, IT CHECKS ENTIRE STRING SO TAKES FOREVERRRRRRR mb_strlen 2,674,944 750k time avg from 15k iterations preg_match_modifier) 0.76279 0.015250 preg_match_backtrack) 0.75758 0.015147 htmlspecialchars) 0.83401 0.016673 preg_match_pattern) 2.15377 0.043068 mb_strlen 344 750k time avg from 15k iterations preg_match_modifier) 0.74996 0.014995 preg_match_backtrack) 0.73503 0.014697 htmlspecialchars) 0.70115 0.014019 preg_match_pattern) 2.06986 0.041393 BENCHMARKS ON VALID STRING (750,000 iterations) strlen 26,873,856 750k time avg from 15k iterations preg_match_modifier) 0.74948 0.014984 preg_match_backtrack) 0.75690 0.015133 htmlspecialchars) 44.17337 0.883453 preg_match_pattern) 10.71417 0.214273 strlen 16 750k time avg from 15k iterations preg_match_modifier) 0.79939 0.015984 preg_match_backtrack) 0.80240 0.016044 htmlspecialchars) 0.86205 0.017237 preg_match_pattern) 10.63511 0.212693 */ class utf_validity { public function preg_match_modifier($string) { return ( preg_match( '//u', $string ) !== false ); } public function preg_match_backtrack($string) { return ( preg_match( '/(*UTF8)/', $string ) !== false ); } public function htmlspecialchars($string) { return ( htmlspecialchars( $string, null, 'utf-8' ) != '' ); } public function mb_check_encoding($string) { return ( mb_check_encoding( $string, 'UTF-8' ) ); } public function preg_match_pattern($string) { static $pattern; if ( $pattern == null ) { $pattern = '/(' . '[\xC0-\xC1]' # Invalid UTF-8 Bytes . '|[\xF5-\xFF]' # Invalid UTF-8 Bytes . '|\xE0[\x80-\x9F]' # Overlong encoding of prior code point . '|\xF0[\x80-\x8F]' # Overlong encoding of prior code point . '|[\xC2-\xDF](?![\x80-\xBF])' # Invalid UTF-8 Sequence Start . '|[\xE0-\xEF](?![\x80-\xBF]{2})' # Invalid UTF-8 Sequence Start . '|[\xF0-\xF4](?![\x80-\xBF]{3})' # Invalid UTF-8 Sequence Start . '|(?<=[\x0-\x7F\xF5-\xFF])[\x80-\xBF]' # Invalid UTF-8 Sequence Middle . '|(?<![\xC2-\xDF]|[\xE0-\xEF]|[\xE0-\xEF][\x80-\xBF]|[\xF0-\xF4]|[\xF0-\xF4][\x80-\xBF]|[\xF0-\xF4][\x80-\xBF]{2})[\x80-\xBF]' # Overlong Sequence . '|(?<=[\xE0-\xEF])[\x80-\xBF](?![\x80-\xBF])' # Short 3 byte sequence . '|(?<=[\xF0-\xF4])[\x80-\xBF](?![\x80-\xBF]{2})' # Short 4 byte sequence . '|(?<=[\xF0-\xF4][\x80-\xBF])[\x80-\xBF](?![\x80-\xBF])' # Short 4 byte sequence (2) . ')/'; } return ( preg_match( $pattern, $string ) != 1 ); } }
#11
@
10 years ago
Spent a lot of time looking into making this more compatible without sacrificing the speed gains of the new version. Patch 29717.5.patch incorporates the results from the benchmarking, significantly it removes the preg_match custom regex and replaces it with a much quicker and safer htmlspecialchars to check for validity.
The preg_match's are still a tiny bit faster than htmlspecialchars, but in my testing they had the exact same results for testing invalid UTF. And htmlspecialchars is a super core named function from /ext/standard/html.c and has supported this type of checking since before 5.0.
I also determined in the testing of the new strip code (this version prefers mb_convert_encoding over iconv, and uses the mb_substitute_character function instead of ini_set) that the current wp_check_invalid_utf8 function is definitely broken when using the strip parameter. It needs the IGNORE and should also use '@'. As it is now when you run it with the strip parameter turned on and with an invalid utf string, it will return boolean false and trigger some PHP Notices.
PHP Notice: iconv(): Detected an illegal character in input string in /wp-includes/formatting.php on line 738 PHP Notice: iconv(): Detected an incomplete multibyte character in input string in /wp-includes/formatting.php on line 738
Just for fun, here are a bunch of notes from researching this stuff. Please re-test and examine this patch. Below are just notes.
____ ____ ____ _____ ___ _ _ _____ _____ | _ \ / ___| _ \| ____| ( _ ) | | | |_ _| ___| | |_) | | | |_) | _| / _ \/\ | | | | | | | |_ | __/| |___| _ <| |___ | (_> < | |_| | | | | _| |_| \____|_| \_\_____| \___/\/ \___/ |_| |_| @link http://www.pcre.org/pcre.txt @author Philip Hazel - University of Cambridge UTF-8 AND UNICODE PROPERTY SUPPORT From release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0 this was greatly extended to cover most common requirements, and in release 5.0 additional support for Unicode general category properties was added. In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes. If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big. If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was built with UTF support. For example, an 8-bit pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and subjects as strings of UTF-8 characters instead of individual 8-bit characters. This causes both the pattern and any data against which it is matched to be checked for UTF-8 validity. If the data string is very long, such a check might use sufficiently many resources as to cause your application to lose performance. Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at compile time. This causes an compile time error if a pattern contains a UTF-setting sequence. In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. VALIDITY OF UTF-8 STRINGS When you set the PCRE_UTF8 flag, the byte strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. The entire string is checked before any other processing takes place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, which are themselves derived from the Unicode specification. Earlier releases of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 to U+10FFFF, excluding the surrogate area. (From release 8.33 the so-called "non-character" code points are no longer excluded because Unicode corrigendum #9 makes it clear that they should not be.) Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to encode codepoints with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and UTF-32.) If an invalid UTF-8 string is passed to PCRE, an error return is given. ___ ___ ___ ___ ___ _ _ | _ \/ __| _ \ __| / __| |_ __ _ _ _ __ _ ___| |___ __ _ | _/ (__| / _| | (__| ' \/ _` | ' \/ _` / -_) / _ \/ _` | |_| \___|_|_\___| \___|_||_\__,_|_||_\__, \___|_\___/\__, | |___/ |___/ // Release 8.33 28-May-2013 Version 8.33 28-May-2013 --------------------- 00. (*LIMIT_MATCH=d), (*LIMIT_RECURSION=d) added so the pattern can specify lower limits for the matching process. 35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, blocking (*UTF) etc. Version 8.32 30-November-2012 --------------------- 14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be set 24. Add support for 32-bit character strings, and UTF-32 25. (*UTF) can now be used to start a pattern in any of the three libraries. 30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints greater than 0x7fffffff (which cannot be represented in UTF-8, even under the "old" RFC 2279). Instead, it ended up passing a negative length to pcre_exec() Version 7.9 11-Apr-09 --------------------- 28. Added support for (*UTF8) at the start of a pattern. Version 7.3 28-Aug-07 --------------------- 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. This restricts code points to be within the range 0 to 0x10FFFF, excluding the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still does: it's just the validity check that is more restrictive. Version 4.4 21-Aug-03 --------------------- 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. PCRE checks UTF-8 strings for validity by default. There is an option to suppress this, just in case anybody wants that teeny extra bit of performance. Version 4.4 13-Aug-03 --------------------- 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at both compile and run time, and gives an error if an invalid UTF-8 sequence is found. There is a option for disabling this check in cases where the string is known to be correct and/or the maximum performance is wanted. Version 3.3 01-Aug-00 --------------------- 7. Added the beginnings of support for UTF-8 character strings. PCRE PHP)INI CONFIGURATION OPTIONS @link http://php.net/manual/en/pcre.configuration.php "PCRE Configuration Options" 2 PCRE INI options are available since PHP 5.2.0 pcre.backtrack_limit 1000000 PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7. pcre.recursion_limit 100000 PCRE's recursion limit. Please note that if you set this value too high you may consume all the available process stack and eventually crash PHP (due to reaching the stack size limit imposed by the OS). PCRE CRASHES FROM REGEXES // Release 8.33 28-May-2013 // (*LIMIT_MATCH=d) and (*LIMIT_RECURSION=d) have been added so that the creator of a pattern can specify lower (but not higher) limits for the matching process. PCRE_EXTRA_MATCH_LIMIT can be accessed through the set_match_limit() and match_limit() member functions. Setting match_limit to a non-zero value will limit the execution of pcre to keep it from doing bad things like blowing the stack or taking an eternity to return a result. A value of 5000 is good enough to stop stack blowup in a 2MB thread stack. Setting match_limit to zero disables match limiting. Alternatively, you can call match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE recurses. match_limit() limits the number of matches PCRE does; match_limit_recursion() limits the depth of internal recursion, and therefore the amount of stack that is used. The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running patterns that are not going to match, but which have a very large number of possibilities in their search trees. The classic example is the use of nested unlimited repeats. Internally, PCRE uses a function called match() which it calls repeatedly (sometimes recursively). The limit set by match_limit is imposed on the number of times this function is called during a match, which has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string. The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total number of calls, because not all calls to match() are recursive. This limit is of use only if it is set smaller than match_limit. Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use memory on the heap instead of the stack, the amount of heap memory that can be used. The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. _ _ ____ _ __ _ _ ___ __ _ _ __ __ _| |_ __| |_ / /\ \ | '_ \ '_/ -_) _` | | ' \/ _` | _/ _| ' \| | | | | .__/_| \___\__, |_|_|_|_\__,_|\__\__|_||_| | | | |_| |___/___| \_\/_/ preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred. u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have been regarded as valid UTF-8. With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings containing invalid UTF-8 byte sequences. It does not reject character codes above U+10FFFF (represented by 4 or more octets), though. Originally, this function checked according to RFC 2279, allowing for values in the range 0 to 0x7fffffff, up to 6 bytes long, but ensuring that they were in the canonical format. Once somebody had pointed out RFC 3629 to me (it obsoletes 2279), additional restrictions were applied. The values are now limited to be between 0 and 0x0010ffff, no more than 4 bytes long, and the subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte characters is still checked. BACKTRACKING CONTROL The following are recognized only at the start of a pattern: (*LIMIT_MATCH=d) set the match limit to d (decimal number) ( added 8.33 28-May-2013 ) (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) ( added 8.33 28-May-2013 ) (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) ( added 7.9 11-Apr-09 ) (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) ( added 7.9 11-Apr-09 ) (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) ( added 7.9 11-Apr-09 ) (*UTF) set appropriate UTF mode for the library in use ( added 7.9 11-Apr-09 ) In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF support, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8) or (*UTF). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of individual 1-byte characters. PCRE UTF ERRORS From release 8.13 more information about the details of the error are passed back in the returned value: PCRE_UTF8_ERR0 No error PCRE_UTF8_ERR1 Missing 1 byte at the end of the string PCRE_UTF8_ERR2 Missing 2 bytes at the end of the string PCRE_UTF8_ERR3 Missing 3 bytes at the end of the string PCRE_UTF8_ERR4 Missing 4 bytes at the end of the string PCRE_UTF8_ERR5 Missing 5 bytes at the end of the string PCRE_UTF8_ERR6 2nd-byte's two top bits are not 0x80 PCRE_UTF8_ERR7 3rd-byte's two top bits are not 0x80 PCRE_UTF8_ERR8 4th-byte's two top bits are not 0x80 PCRE_UTF8_ERR9 5th-byte's two top bits are not 0x80 PCRE_UTF8_ERR10 6th-byte's two top bits are not 0x80 PCRE_UTF8_ERR11 5-byte character is not permitted by RFC 3629 PCRE_UTF8_ERR12 6-byte character is not permitted by RFC 3629 PCRE_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted PCRE_UTF8_ERR14 3-byte character with value 0xd000-0xdfff is not permitted PCRE_UTF8_ERR15 Overlong 2-byte sequence PCRE_UTF8_ERR16 Overlong 3-byte sequence PCRE_UTF8_ERR17 Overlong 4-byte sequence PCRE_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur) PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur) PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character) PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff PCRE_UTF8_ERR22 Unused (was non-character) PHP PCRE CONSTANTS PREG_NO_ERROR Returned by preg_last_error() if there were no errors. 5.2.0 PREG_INTERNAL_ERROR Returned by preg_last_error() if there was an internal PCRE error. 5.2.0 PREG_BACKTRACK_LIMIT_ERROR Returned by preg_last_error() if backtrack limit was exhausted. 5.2.0 PREG_RECURSION_LIMIT_ERROR Returned by preg_last_error() if recursion limit was exhausted. 5.2.0 PREG_BAD_UTF8_ERROR Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when running a regex in UTF-8 mode). 5.2.0 PREG_BAD_UTF8_OFFSET_ERROR Returned by preg_last_error() if the offset didn't correspond to the begin of a valid UTF-8 code point (only when running a regex in UTF-8 mode). 5.3.0 PCRE_VERSION PCRE version and release date (e.g. "7.0 18-Dec-2006"). 5.2.4 PCRE CONSTANTS ON MY INSTALL get_defined_constants() PREG_PATTERN_ORDER' => 1, PREG_SET_ORDER' => 2, PREG_OFFSET_CAPTURE' => 256, PREG_SPLIT_NO_EMPTY' => 1, PREG_SPLIT_DELIM_CAPTURE' => 2, PREG_SPLIT_OFFSET_CAPTURE' => 4, PREG_GREP_INVERT' => 1, PREG_NO_ERROR' => 0, PREG_INTERNAL_ERROR' => 1, PREG_BACKTRACK_LIMIT_ERROR' => 2, PREG_RECURSION_LIMIT_ERROR' => 3, PREG_BAD_UTF8_ERROR' => 4, PREG_BAD_UTF8_OFFSET_ERROR' => 5, PCRE_VERSION' => '8.34 2013-12-15', _ ____ (_)__ ___ _ ___ __/ /\ \ | / _/ _ \ ' \ V / | | | |_\__\___/_||_\_/| | | | \_\/_/ https://www.gnu.org/software/libiconv/ If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. Otherwise, str is cut from the first illegal character and an E_NOTICE is generated. ( since GNU libiconv 2002-01-13 ) In other words, iconv() appears to be intended for use when converting the contents of files - whereas mb_convert_encoding() is intended for use when juggling strings internally, e.g. strings that aren't being read/written to/from files, but exchanged with some other media. ICONV CHARACTER SET ENCODINGS CONTAINING "UTF" $ iconv -l - ISO-10646UTF-8 - ISO-10646UTF8 - UTF-7 - UTF-8 - UTF-16 - UTF-16BE - UTF-16LE - UTF-32 - UTF-32BE - UTF-32LE - UTF7 - UTF8 - UTF16 - UTF16BE - UTF16LE - UTF32 - UTF32BE - UTF32LE If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion. ICONV IMPLEMENTATIONS - ICONV_IMPL CONSTANT @link http://www.gnu.org/software/libc/manual/html_node/Other-iconv-Implementations.html "Some Details about other iconv Implementations" @link http://www.gnu.org/software/libc/manual/html_node/Locales.html "Locales and Internationalization" "libiconv" - GNU libiconv is the native FreeBSD iconv implementation since 2002. "BSD iconv" - Konstantin Chugeuv's iconv "glibc" - GNU Glibc's "unknown" - Not one of the above
#13
in reply to:
↑ 6
@
10 years ago
Replying to kitchin:
Cool stuff. Comments:
(1) I still think the old blog_charset check is clearest. No need to confuse people into having to look up obscure docs. Old code:
in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) )vs. your new code
stripos( $is_utf8, 'utf' ) !== false && strpos( $is_utf8, '8' ) !== false(2) The WP code base never checks the result of ini_set() or @ini_set() but in this case it seems wise to do so. Hosts can disallow it. Most robust way is probably:
static $mb_convert; if ( function_exists( 'mb_convert_encoding' ) ) { @ini_set( 'mbstring.substitute_character', 'none' ); $mb_convert = @ini_get( 'mbstring.substitute_character' ) === 'none'; }I don't imagine anybody is worried about changing that ini value with restoring it, but it should probably be noted in the inline doc as a side effect.
As for WP coding standards nits, WP wants braces on all clauses (if ... {}). Also, no parentheses around function_exists() at line 775.
Hey kitchin, see any room for improvement on the latest patch? Would love more constructive feedback..
#14
@
10 years ago
yikes i didnt realize trac was lonlier than myspace. Dont take it perspnally but it seems there is a slight lag issue. Or maybe my code is bad or maybe icontributions arent wanted? Or maybe I submitted it wrong? Is this lag ok with everyone and is ot just a part of doing business? Wow. New features are great, but isnt there anyone who desires to put core code improvements ahead of new bells and whistles? lol sorry in advance, well aware this type of complaint wont win me any friends. :)
#15
follow-up:
↓ 16
@
10 years ago
Hi askapache, my Myspace account has lapsed but I did see this and can tell you I only have certain hours to work on Wordpress each week. I don't comment right away since that just clogs up everyone's emails.
#16
in reply to:
↑ 15
@
10 years ago
Replying to kitchin:
Hi askapache, my Myspace account has lapsed but I did see this and can tell you I only have certain hours to work on Wordpress each week. I don't comment right away since that just clogs up everyone's emails.
That'll teach me to drink&corecomment eesh. It is notable that this ticket replaces others, some of which were several years old. Looking forward to your feedback kitchin!
#17
@
10 years ago
FYI, Ive been running WP with this modification in place for 4.0 and 4.0.1, no issues.
#20
follow-up:
↓ 22
@
10 years ago
- Keywords needs-codex added
- Severity changed from normal to major
What is the holdup on this? How can I help?
#22
in reply to:
↑ 20
@
10 years ago
Replying to askapache:
What is the holdup on this? How can I help?
I think it's a question of "who has the time and tenacity to tackle this right now?" - I will ask around a bit to see if we can at least get some more feedback, though there's quite a bit going on right at the moment. It's definitely not in my individual wheelhouse.
#23
@
6 years ago
- Keywords needs-refresh needs-unit-tests added
Related: #38044.
The current patch is no longer applying cleanly to trunk
. @askapache are you able to refresh? I'd also like to see some unit tests here.
#24
@
6 years ago
Proof of concept, needs unit tests. Passes my ad-hoc testing with @askapache's test strings.
- Speed up testing for an empty string. Indeed stackexchange says 0==strlen($string) is slower than isset($string[0]). But ==$string is almost as fast and matches the WP codebase.
- For stripping, iconv() misses some patterns in the test strings, on my platform at least. But the bytewise regex in
wpdb::strip_invalid_text()
finds them all (4 byte version). So use that.
- Add a new parameter $bytewise that controls use of the regex from wpdb. "Bytewise" here means without using "/u".
The new parameter (set to 'always') should solve #38044 by providing a better check than seems_utf8()
.
By default the patch works the sane as trunk, when $strip is off. For $strip the patch uses the wpdb regex instead of inconv()
. Note there's a slight bug in trunk since the return can be null instead of string if inconv()
fails, and also inconv()
should be @inconv()
.
Compared to @askapache 29757.5.patch this patch does not try to use '*UTF8' or htmlspecialchars()
as fallbacks. The wpdb regex may be slower, but it's only used when "/u" is not available, or for the "not recommended" strip. It's five years later now, so platforms are better, and "not recommended" has been in the codebase longer than that.
Note the code patched has not changed logically since WP 4.0, approx. when this bug started.
(I'm going to post an updated patch that fixes a bug.)
#26
@
6 years ago
@kitchin I was happy to see some activity after 5 years! Now it's been 5 weeks waiting on what I considered an easy review 5 years ago!!. Something is broken if this is too hard to get reviewed, something is broken if it's not too hard, something is broken. 5 years.
If it's just bad code, that would be a great 10s response. I stopped contributing. I won't contribute in the future. Huge waste of time of effort. 5 years.
This ticket was mentioned in PR #6320 on WordPress/wordpress-develop by @pbearne.
12 months ago
#27
- Keywords has-unit-tests added; needs-refresh needs-unit-tests removed
Trac ticket: 29717
#28
@
12 months ago
- Owner set to pbearne
- Status changed from new to assigned
refreshed patch
Added some tests
This ticket was mentioned in Slack in #core-performance by pbearne. View the logs.
12 months ago
#30
@
10 months ago
Hello
With WP 6.5.2
PHP Notice: iconv(): Detected an illegal character in input string in /.../wordpress/wp-includes/formatting.php on line 1141
in wp-include/formatting.php line 1141
// Attempt to strip the bad chars if requested (not recommended).
if ( $strip && function_exists( 'iconv' ) ) {
return iconv( 'utf-8', 'utf-8', $text );
}
If the aim is to "Attempt to strip the bad chars" perhaps changing to "utf-8IGNORE" or "utf-8IGNORE" will do the job ?
29717.wp_check_invalid_utf8.patch