id,summary,reporter,owner,description,type,status,priority,milestone,component,version,severity,resolution,keywords,cc,focuses 29717,"wp_check_invalid_utf8 - pcre tricks and failsafes, +mb_convert_encoding, iconv fix, performance",askapache,,"Used in core in these 4 functions. * esc_attr() * esc_js() * esc_html() * sanitize_text_field() It's the first function to execute for all 4, and especially for sanitize_text_field it gets called quite a bit and is pretty important. It's purpose is to check a string for invalid utf. It utilizes preg_match with the '/u' modifier to parse both the pattern and subject for utf. PCRE automatically checks both the pattern and subject for invalid utf, upon which it will exit with an error code/constant. The changes here: Normally pcre is compiled with utf support. It can also be compiled to disallow utf support, and it can be compiled without utf support. If utf is compiled and enabled the '/u' modifier for preg_match is available which turns on the automatic utf validation. For older dists or those with utf support turned off at compile, there is a trick to enable the same functionality as the '/u' provides. http://www.pcre.org/pcre.txt In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. So the first change to this function was to allow a fallback to that pattern option trick in case '/u' wasnt supported. 1. `@preg_match( '//u', '' ) !== false` 2. `@preg_match( '/(*UTF8)/', '' ) !== false` 3. Fallback to a regex that doesn't require UTF support, instead of using pcre utf validation it searches for it I also wanted it to have better performance, especially due to its use in those 4 core functions I use often. I benchmarked it pretty thoroughly to try and gain more speed. This patch is about 10-20% faster. Many gains were from refactoring the logic and control structures, chaining within if statements using bools, and utilizing the static variables to the fullest. This is especially crucial since this function gets called repeatedly. I also gained some cycles by replacing an in_array() check with a `stripos`. One of the bigger gains came from replacing the `strlen( $string ) == 0` that ran on every run with. Since the $string variable was already casted to a string, that should always work and keep things a little cheaper. {{{ $string = (string) $string; // if string length is 0 (faster than strlen) return empty if ( ! isset( $string[0] ) ) return ''; }}} The final change was to the 2nd parameters $strip, which if true is supposed to strip the invalid utf out of the string and return the valid. In core nowhere is that parameter being used (yet), which explains the deprecated looking iconv. Also added a fallback to use mb_convert_encoding in case iconv is missing. {{{ // try to use iconv if exists if ( function_exists( 'iconv' ) ) return @iconv( 'utf-8', 'utf-8//ignore', $string ); // otherwise try to use mb_convert_encoding, setting the substitue_character to none to mimic strip if ( function_exists( 'mb_convert_encoding' ) ) { @ini_set( 'mbstring.substitute_character', 'none' ); return @mb_convert_encoding( $string, 'utf-8', 'utf-8' ); } }}} Here are some of the test strings I used, I also used the utf-8-test file at http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt. I did testing on 4.0 using php 5.6, 5.4, 5.3, and 5.4. I verified the output and the strip feature as well. For all tests I had php error_reporting set to the max: {{{ ini_set( 'error_reporting', 2147483647 ); }}} {{{ $valid_utf = array( ""\xc3\xb1"", // 'Valid 2 Octet Sequence' ""\xe2\x82\xa1"", // 'Valid 3 Octet Sequence' => ""\xf0\x90\x8c\xbc"", // 'Valid 4 Octet Sequence' => ""\xf8\xa1\xa1\xa1\xa1"", //'Valid 5 Octet Sequence (but not Unicode!)' => ""\xfc\xa1\xa1\xa1\xa1\xa1"", //'Valid 6 Octet Sequence (but not Unicode!)' => ""Iñtërnâtiônàlizætiøn\xf0\x90\x8c\xbcIñtërnâtiônàlizætiøn"", // valid four octet id 'Iñtërnâtiônàlizætiøn', // valid UTF-8 string ""\xc3\xb1"", // valid two octet id ""Iñtërnâtiônàlizætiøn\xe2\x82\xa1Iñtërnâtiônàlizætiøn"", // valid three octet id ); $invalid_utf = array( ""\xc3\x28"", //'Invalid 2 Octet Sequence' => ""\xa0\xa1"", //'Invalid Sequence Identifier' => ""\xe2\x28\xa1"", //'Invalid 3 Octet Sequence (in 2nd Octet)' => ""\xe2\x82\x28"", //'Invalid 3 Octet Sequence (in 3rd Octet)' => ""\xf0\x28\x8c\xbc"", //'Invalid 4 Octet Sequence (in 2nd Octet)' => ""\xf0\x90\x28\xbc"", // 'Invalid 4 Octet Sequence (in 3rd Octet)' => ""\xf0\x28\x8c\x28"", //'Invalid 4 Octet Sequence (in 4th Octet)' => chr(0xE3) . chr(0x80) . chr(0x22), // Invalid malformed because 0x22 is not a valid second trailing byte following the leading byte 0xE3. http://www.unicode.org/reports/tr36/ chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Invalid UTF-8, overlong 5 byte encoding. chr(0xD0) . chr(0x01), // High code-point without trailing characters. chr(0xC0) . chr(0x80), // Overlong encoding of code point 0 chr(0xF8) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Overlong encoding of 5 byte encoding chr(0xFC) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80) . chr(0x80), // Overlong encoding of 6 byte encoding chr(0xD0) . chr(0x01), // High code-point without trailing characters ""Iñtërnâtiôn\xe9àlizætiøn"", // invalid UTF-8 string ""Iñtërnâtiônàlizætiøn\xfc\xa1\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn"", // invalid six octet sequence ""Iñtërnâtiônàlizætiøn\xf0\x28\x8c\xbcIñtërnâtiônàlizætiøn"", // invalid four octet sequence ""Iñtërnâtiônàlizætiøn \xc3\x28 Iñtërnâtiônàlizætiøn"", // invalid two octet sequence ""this is an invalid char '\xe9' here"", // invalid ASCII string ""Iñtërnâtiônàlizætiøn\xa0\xa1Iñtërnâtiônàlizætiøn"", // invalid id between two and three ""Iñtërnâtiônàlizætiøn\xf8\xa1\xa1\xa1\xa1Iñtërnâtiônàlizætiøn"", // invalid five octet sequence ""Iñtërnâtiônàlizætiøn\xe2\x82\x28Iñtërnâtiônàlizætiøn"", // invalid three octet sequence third ""Iñtërnâtiônàlizætiøn\xe2\x28\xa1Iñtërnâtiônàlizætiøn"", // invalid three octet sequence second ); }}} ---- Notes and more info: {{{ In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When either of these is the case, both the pattern and any subject strings that are matched against it are treated as UTF-8 strings instead of strings of 1-byte characters. UTF-8 was devised in September 1992 by Ken Thompson, guided by design criteria specified by Rob Pike, with the objective of defining a UCS transformation format usable in the Plan9 operating system in a non- disruptive manner. Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx A UTF-8 string is a sequence of octets representing a sequence of UCS characters. An octet sequence is valid UTF-8 only if it matches the following syntax, which is derived from the rules for encoding UTF-8 and is expressed in the ABNF of [RFC2234]. UTF8-octets = *( UTF8-char ) UTF8-char = UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4 UTF8-1 = %x00-7F UTF8-2 = %xC2-DF UTF8-tail UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) / %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail ) UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) / %xF4 %x80-8F 2( UTF8-tail ) UTF8-tail = %x80-BF }}} * http://www.pcre.org/pcre.txt * http://us1.php.net/manual/en/pcre.constants.php * http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8 * http://en.wikipedia.org/wiki/Unicode * http://unicode.org/faq/utf_bom.html * http://www.unicode.org/versions/Unicode6.1.0/ch03.pdf * http://www.pcre.org/pcre.txt * http://tools.ietf.org/rfc/rfc3629.txt * http://www.unicode.org/faq/utf_bom.html * http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf * http://www.unicode.org/reports/tr36/ * http://tools.ietf.org/rfc/rfc3629.txt Related Tickets: * https://core.trac.wordpress.org/ticket/11175 * https://core.trac.wordpress.org/ticket/28786 ",enhancement,new,normal,Awaiting Review,Formatting,,normal,,has-patch dev-feedback needs-refresh needs-unit-tests,,performance