WordPress.org

Make WordPress Core

Changes between Initial Version and Version 1 of Ticket #29717, comment 11


Ignore:
Timestamp:
10/17/2014 04:00:49 PM (5 years ago)
Author:
DrewAPicture
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #29717, comment 11

    initial v1  
    1414
    1515
    16 Just for fun, here are a bunch of notes from researching this stuff. Please re-test and examine this patch.  Below are just notes.
     16Just for fun, here are a bunch of notes from researching this stuff. Please re-test and examine this patch.  The notes can be found here: http://pastebin.com/M5jBF8Dz
    1717
    1818
    19 {{{
    20  ____   ____ ____  _____    ___     _   _ _____ _____
    21 |  _ \ / ___|  _ \| ____|  ( _ )   | | | |_   _|  ___|
    22 | |_) | |   | |_) |  _|    / _ \/\ | | | | | | | |_
    23 |  __/| |___|  _ <| |___  | (_>  < | |_| | | | |  _|
    24 |_|    \____|_| \_\_____|  \___/\/  \___/  |_| |_|
    25 
    26 
    27 @link http://www.pcre.org/pcre.txt @author Philip Hazel - University of Cambridge
    28 UTF-8 AND UNICODE PROPERTY SUPPORT
    29 
    30 From release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0
    31 this was greatly extended to cover most common requirements, and in release 5.0 additional support for Unicode
    32 general category properties was added.
    33 
    34 In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition,
    35 you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject
    36 strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes.
    37 
    38 If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the
    39 additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big.
    40 
    41 If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you
    42 should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was
    43 built with UTF support. For example, an 8-bit pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
    44 which interprets patterns and subjects as strings of UTF-8 characters instead of individual 8-bit characters. This
    45 causes both the pattern and any data against which it is matched to be checked for UTF-8 validity. If the data string
    46 is very long, such a check might use sufficiently many resources as to cause your application to lose performance.
    47 
    48 Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at compile time. This
    49 causes an compile time error if a pattern contains a UTF-setting sequence.
    50 
    51 In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you
    52 must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When
    53 either of these is the case, both the pattern and any subject strings that are matched against it are treated as
    54 UTF-8 strings instead of strings of 1-byte characters.
    55 
    56 
    57 VALIDITY OF UTF-8 STRINGS
    58 
    59 When you set the PCRE_UTF8 flag, the byte strings passed as patterns and subjects are (by default) checked for
    60 validity on entry to the relevant functions. The entire string is checked before any other processing takes
    61 place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, which are themselves derived from
    62 the Unicode specification. Earlier releases of PCRE followed the rules of RFC 2279, which allows the full range
    63 of 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 to U+10FFFF, excluding
    64 the surrogate area. (From release 8.33 the so-called "non-character" code points are no longer excluded because
    65 Unicode corrigendum #9 makes it clear that they should not be.)
    66 
    67 Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to
    68 encode codepoints with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available
    69 independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16
    70 which unfortunately messes up UTF-8 and UTF-32.)
    71 
    72 If an invalid UTF-8 string is passed to PCRE, an error return is given.
    73 
    74 
    75 
    76 
    77 
    78 
    79 
    80  ___  ___ ___ ___    ___ _                       _
    81 | _ \/ __| _ \ __|  / __| |_  __ _ _ _  __ _ ___| |___  __ _
    82 |  _/ (__|   / _|  | (__| ' \/ _` | ' \/ _` / -_) / _ \/ _` |
    83 |_|  \___|_|_\___|  \___|_||_\__,_|_||_\__, \___|_\___/\__, |
    84                                        |___/           |___/
    85 // Release 8.33 28-May-2013
    86 
    87 Version 8.33 28-May-2013
    88 ---------------------
    89 00. (*LIMIT_MATCH=d), (*LIMIT_RECURSION=d) added so the pattern can specify lower limits for the matching process.
    90 35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, blocking (*UTF) etc.
    91 
    92 Version 8.32 30-November-2012
    93 ---------------------
    94 14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be set
    95 24. Add support for 32-bit character strings, and UTF-32
    96 25. (*UTF) can now be used to start a pattern in any of the three libraries.
    97 30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints greater than 0x7fffffff (which cannot be
    98     represented in UTF-8, even under the "old" RFC 2279). Instead, it ended up passing a negative length to pcre_exec()
    99 
    100 Version 7.9 11-Apr-09
    101 ---------------------
    102 28. Added support for (*UTF8) at the start of a pattern.
    103 
    104 Version 7.3 28-Aug-07
    105 ---------------------
    106 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
    107     This restricts code points to be within the range 0 to 0x10FFFF, excluding
    108     the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the
    109     full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still
    110     does: it's just the validity check that is more restrictive.
    111 
    112 Version 4.4 21-Aug-03
    113 ---------------------
    114 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629.
    115 PCRE checks UTF-8 strings for validity by default. There is an option to suppress
    116 this, just in case anybody wants that teeny extra bit of performance.
    117 
    118 Version 4.4 13-Aug-03
    119 ---------------------
    120 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
    121     both compile and run time, and gives an error if an invalid UTF-8 sequence
    122     is found. There is a option for disabling this check in cases where the
    123     string is known to be correct and/or the maximum performance is wanted.
    124 
    125 Version 3.3 01-Aug-00
    126 ---------------------
    127 7. Added the beginnings of support for UTF-8 character strings.
    128 
    129 
    130 
    131 
    132 
    133 PCRE PHP)INI CONFIGURATION OPTIONS
    134 
    135 @link http://php.net/manual/en/pcre.configuration.php "PCRE Configuration Options"
    136 
    137 2 PCRE INI options are available since PHP 5.2.0
    138 
    139 pcre.backtrack_limit 1000000
    140     PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7.
    141 
    142 pcre.recursion_limit 100000
    143     PCRE's recursion limit. Please note that if you set this value too high you may consume all the available
    144     process stack and eventually crash PHP (due to reaching the stack size limit imposed by the OS).
    145 
    146 
    147 
    148 
    149 
    150 PCRE CRASHES FROM REGEXES
    151 
    152 // Release 8.33 28-May-2013
    153 // (*LIMIT_MATCH=d) and (*LIMIT_RECURSION=d) have been added so that the creator of a pattern can specify lower (but not higher) limits for the matching process.
    154 
    155 
    156 PCRE_EXTRA_MATCH_LIMIT can be accessed through the set_match_limit()
    157 and match_limit() member functions. Setting match_limit to a non-zero value will limit the execution of
    158 pcre to keep it from doing bad things like blowing the stack or taking an eternity to return a result. A value
    159 of 5000 is good enough to stop stack blowup in a 2MB thread stack. Setting match_limit to zero disables match
    160 limiting. Alternatively, you can call match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit
    161 how much PCRE recurses. match_limit() limits the number of matches PCRE does; match_limit_recursion() limits the
    162 depth of internal recursion, and therefore the amount of stack that is used.
    163 
    164 The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running
    165 patterns that are not going to match, but which have a very large number of possibilities in their search trees. The
    166 classic example is the use of nested unlimited repeats.
    167 
    168 Internally, PCRE uses a function called match() which it calls repeatedly (sometimes recursively). The limit set
    169 by match_limit is imposed on the number of times this function is called during a match, which has the effect of
    170 limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts
    171 from zero for each position in the subject string.
    172 
    173 The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all
    174 but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which
    175 match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec()
    176 returns PCRE_ERROR_MATCHLIMIT.
    177 
    178 The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times
    179 that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total
    180 number of calls, because not all calls to match() are recursive. This limit is of use only if it is set smaller
    181 than match_limit.
    182 
    183 Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use
    184 memory on the heap instead of the stack, the amount of heap memory that can be used.
    185 
    186 The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value
    187 as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in
    188 which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit
    189 is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
    190 
    191 
    192 
    193 
    194 
    195                                 _      _     ____
    196  _ __ _ _ ___ __ _   _ __  __ _| |_ __| |_  / /\ \
    197 | '_ \ '_/ -_) _` | | '  \/ _` |  _/ _| ' \| |  | |
    198 | .__/_| \___\__, |_|_|_|_\__,_|\__\__|_||_| |  | |
    199 |_|          |___/___|                      \_\/_/
    200 
    201 preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred.
    202 
    203 u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and
    204 subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP
    205 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will
    206 cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and
    207 six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have
    208 been regarded as valid UTF-8.
    209 
    210 With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings containing invalid UTF-8 byte sequences. It
    211 does not reject character codes above U+10FFFF (represented by 4 or more octets), though.
    212 
    213 Originally, this function checked according to RFC 2279, allowing for values in the range 0 to 0x7fffffff, up to 6
    214 bytes long, but ensuring that they were in the canonical format. Once somebody had pointed out RFC 3629 to me (it
    215 obsoletes 2279), additional restrictions were applied. The values are now limited to be between 0 and 0x0010ffff,
    216 no more than 4 bytes long, and the subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte
    217 characters is still checked.
    218 
    219 
    220 
    221 BACKTRACKING CONTROL
    222 
    223 The following are recognized only at the start of a pattern:
    224 
    225 (*LIMIT_MATCH=d) set the match limit to d (decimal number) ( added 8.33 28-May-2013 )
    226 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) ( added 8.33 28-May-2013 )
    227 
    228 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) ( added 7.9 11-Apr-09 )
    229 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) ( added 7.9 11-Apr-09 )
    230 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) ( added 7.9 11-Apr-09 )
    231 (*UTF) set appropriate UTF mode for the library in use ( added 7.9 11-Apr-09 )
    232 
    233 In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF support, and, in addition, you
    234 must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8) or
    235 (*UTF). When either of these is the case, both the pattern and any subject strings that are matched against it
    236 are treated as UTF-8 strings instead of strings of individual 1-byte characters.
    237 
    238 
    239 
    240 PCRE UTF ERRORS
    241 
    242 From release 8.13 more information about the details of the error are passed back in the returned value:
    243 
    244 PCRE_UTF8_ERR0 No error
    245 PCRE_UTF8_ERR1 Missing 1 byte at the end of the string
    246 PCRE_UTF8_ERR2 Missing 2 bytes at the end of the string
    247 PCRE_UTF8_ERR3 Missing 3 bytes at the end of the string
    248 PCRE_UTF8_ERR4 Missing 4 bytes at the end of the string
    249 PCRE_UTF8_ERR5 Missing 5 bytes at the end of the string
    250 PCRE_UTF8_ERR6 2nd-byte's two top bits are not 0x80
    251 PCRE_UTF8_ERR7 3rd-byte's two top bits are not 0x80
    252 PCRE_UTF8_ERR8 4th-byte's two top bits are not 0x80
    253 PCRE_UTF8_ERR9 5th-byte's two top bits are not 0x80
    254 PCRE_UTF8_ERR10 6th-byte's two top bits are not 0x80
    255 PCRE_UTF8_ERR11 5-byte character is not permitted by RFC 3629
    256 PCRE_UTF8_ERR12 6-byte character is not permitted by RFC 3629
    257 PCRE_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted
    258 PCRE_UTF8_ERR14 3-byte character with value 0xd000-0xdfff is not permitted
    259 PCRE_UTF8_ERR15 Overlong 2-byte sequence
    260 PCRE_UTF8_ERR16 Overlong 3-byte sequence
    261 PCRE_UTF8_ERR17 Overlong 4-byte sequence
    262 PCRE_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur)
    263 PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur)
    264 PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character)
    265 PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff
    266 PCRE_UTF8_ERR22 Unused (was non-character)
    267 
    268 
    269 PHP PCRE CONSTANTS
    270 
    271 PREG_NO_ERROR   Returned by preg_last_error() if there were no errors.  5.2.0
    272 PREG_INTERNAL_ERROR  Returned by preg_last_error() if there was an internal PCRE error.  5.2.0
    273 PREG_BACKTRACK_LIMIT_ERROR  Returned by preg_last_error() if backtrack limit was exhausted.  5.2.0
    274 PREG_RECURSION_LIMIT_ERROR  Returned by preg_last_error() if recursion limit was exhausted.  5.2.0
    275 PREG_BAD_UTF8_ERROR  Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when
    276                      running a regex in UTF-8 mode).  5.2.0
    277 PREG_BAD_UTF8_OFFSET_ERROR  Returned by preg_last_error() if the offset didn't correspond to the begin of a valid
    278                             UTF-8 code point (only when running a regex in UTF-8 mode).  5.3.0
    279 PCRE_VERSION  PCRE version and release date (e.g. "7.0 18-Dec-2006").  5.2.4
    280 
    281 PCRE CONSTANTS ON MY INSTALL get_defined_constants()
    282 
    283 PREG_PATTERN_ORDER' => 1,
    284 PREG_SET_ORDER' => 2,
    285 PREG_OFFSET_CAPTURE' => 256,
    286 PREG_SPLIT_NO_EMPTY' => 1,
    287 PREG_SPLIT_DELIM_CAPTURE' => 2,
    288 PREG_SPLIT_OFFSET_CAPTURE' => 4,
    289 PREG_GREP_INVERT' => 1,
    290 PREG_NO_ERROR' => 0,
    291 PREG_INTERNAL_ERROR' => 1,
    292 PREG_BACKTRACK_LIMIT_ERROR' => 2,
    293 PREG_RECURSION_LIMIT_ERROR' => 3,
    294 PREG_BAD_UTF8_ERROR' => 4,
    295 PREG_BAD_UTF8_OFFSET_ERROR' => 5,
    296 PCRE_VERSION' => '8.34 2013-12-15',
    297 
    298 
    299 
    300 
    301  _                 ____
    302 (_)__ ___ _ ___ __/ /\ \
    303 | / _/ _ \ ' \ V / |  | |
    304 |_\__\___/_||_\_/| |  | |
    305                   \_\/_/
    306 
    307 https://www.gnu.org/software/libiconv/
    308 
    309 If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded.
    310 Otherwise, str is cut from the first illegal character and an E_NOTICE is generated.  ( since GNU libiconv 2002-01-13 )
    311 
    312 In other words, iconv() appears to be intended for use when converting the contents of files - whereas mb_convert_encoding() is intended
    313 for use when juggling strings internally, e.g. strings that aren't being read/written to/from files, but exchanged with some other media.
    314 
    315 ICONV CHARACTER SET ENCODINGS CONTAINING "UTF"
    316 
    317 $ iconv -l
    318  - ISO-10646UTF-8
    319  - ISO-10646UTF8
    320  - UTF-7
    321  - UTF-8
    322  - UTF-16
    323  - UTF-16BE
    324  - UTF-16LE
    325  - UTF-32
    326  - UTF-32BE
    327  - UTF-32LE
    328  - UTF7
    329  - UTF8
    330  - UTF16
    331  - UTF16BE
    332  - UTF16LE
    333  - UTF32
    334  - UTF32BE
    335  - UTF32LE
    336 
    337 If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion.
    338 
    339 ICONV IMPLEMENTATIONS - ICONV_IMPL CONSTANT
    340 
    341 @link http://www.gnu.org/software/libc/manual/html_node/Other-iconv-Implementations.html "Some Details about other iconv Implementations"
    342 @link http://www.gnu.org/software/libc/manual/html_node/Locales.html "Locales and Internationalization"
    343 
    344 "libiconv" - GNU libiconv is the native FreeBSD iconv implementation since 2002.
    345 "BSD iconv" - Konstantin Chugeuv's iconv
    346 "glibc" - GNU Glibc's
    347 "unknown" - Not one of the above
    348 }}}
    349 
    350 
    351