19 | | {{{ |
20 | | ____ ____ ____ _____ ___ _ _ _____ _____ |
21 | | | _ \ / ___| _ \| ____| ( _ ) | | | |_ _| ___| |
22 | | | |_) | | | |_) | _| / _ \/\ | | | | | | | |_ |
23 | | | __/| |___| _ <| |___ | (_> < | |_| | | | | _| |
24 | | |_| \____|_| \_\_____| \___/\/ \___/ |_| |_| |
25 | | |
26 | | |
27 | | @link http://www.pcre.org/pcre.txt @author Philip Hazel - University of Cambridge |
28 | | UTF-8 AND UNICODE PROPERTY SUPPORT |
29 | | |
30 | | From release 3.3, PCRE has had some support for character strings encoded in the UTF-8 format. For release 4.0 |
31 | | this was greatly extended to cover most common requirements, and in release 5.0 additional support for Unicode |
32 | | general category properties was added. |
33 | | |
34 | | In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, |
35 | | you must call pcre_compile() with the PCRE_UTF8 option flag. When you do this, both the pattern and any subject |
36 | | strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes. |
37 | | |
38 | | If you compile PCRE with UTF-8 support, but do not use it at run time, the library will be a bit bigger, but the |
39 | | additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big. |
40 | | |
41 | | If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you |
42 | | should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was |
43 | | built with UTF support. For example, an 8-bit pattern that begins with "(*UTF8)" or "(*UTF)" turns on UTF-8 mode, |
44 | | which interprets patterns and subjects as strings of UTF-8 characters instead of individual 8-bit characters. This |
45 | | causes both the pattern and any data against which it is matched to be checked for UTF-8 validity. If the data string |
46 | | is very long, such a check might use sufficiently many resources as to cause your application to lose performance. |
47 | | |
48 | | Alternatively, from release 8.33, you can set the PCRE_NEVER_UTF option at compile time. This |
49 | | causes an compile time error if a pattern contains a UTF-setting sequence. |
50 | | |
51 | | In order process UTF-8 strings, you must build PCRE to include UTF-8 support in the code, and, in addition, you |
52 | | must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8). When |
53 | | either of these is the case, both the pattern and any subject strings that are matched against it are treated as |
54 | | UTF-8 strings instead of strings of 1-byte characters. |
55 | | |
56 | | |
57 | | VALIDITY OF UTF-8 STRINGS |
58 | | |
59 | | When you set the PCRE_UTF8 flag, the byte strings passed as patterns and subjects are (by default) checked for |
60 | | validity on entry to the relevant functions. The entire string is checked before any other processing takes |
61 | | place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, which are themselves derived from |
62 | | the Unicode specification. Earlier releases of PCRE followed the rules of RFC 2279, which allows the full range |
63 | | of 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 to U+10FFFF, excluding |
64 | | the surrogate area. (From release 8.33 the so-called "non-character" code points are no longer excluded because |
65 | | Unicode corrigendum #9 makes it clear that they should not be.) |
66 | | |
67 | | Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, where they are used in pairs to |
68 | | encode codepoints with values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs are available |
69 | | independently in the UTF-8 and UTF-32 encodings. (In other words, the whole surrogate thing is a fudge for UTF-16 |
70 | | which unfortunately messes up UTF-8 and UTF-32.) |
71 | | |
72 | | If an invalid UTF-8 string is passed to PCRE, an error return is given. |
73 | | |
74 | | |
75 | | |
76 | | |
77 | | |
78 | | |
79 | | |
80 | | ___ ___ ___ ___ ___ _ _ |
81 | | | _ \/ __| _ \ __| / __| |_ __ _ _ _ __ _ ___| |___ __ _ |
82 | | | _/ (__| / _| | (__| ' \/ _` | ' \/ _` / -_) / _ \/ _` | |
83 | | |_| \___|_|_\___| \___|_||_\__,_|_||_\__, \___|_\___/\__, | |
84 | | |___/ |___/ |
85 | | // Release 8.33 28-May-2013 |
86 | | |
87 | | Version 8.33 28-May-2013 |
88 | | --------------------- |
89 | | 00. (*LIMIT_MATCH=d), (*LIMIT_RECURSION=d) added so the pattern can specify lower limits for the matching process. |
90 | | 35. Implement PCRE_NEVER_UTF to lock out the use of UTF, in particular, blocking (*UTF) etc. |
91 | | |
92 | | Version 8.32 30-November-2012 |
93 | | --------------------- |
94 | | 14. Applied user-supplied patch to pcrecpp.cc to allow PCRE_NO_UTF8_CHECK to be set |
95 | | 24. Add support for 32-bit character strings, and UTF-32 |
96 | | 25. (*UTF) can now be used to start a pattern in any of the three libraries. |
97 | | 30. In 8-bit UTF-8 mode, pcretest failed to give an error for data codepoints greater than 0x7fffffff (which cannot be |
98 | | represented in UTF-8, even under the "old" RFC 2279). Instead, it ended up passing a negative length to pcre_exec() |
99 | | |
100 | | Version 7.9 11-Apr-09 |
101 | | --------------------- |
102 | | 28. Added support for (*UTF8) at the start of a pattern. |
103 | | |
104 | | Version 7.3 28-Aug-07 |
105 | | --------------------- |
106 | | 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. |
107 | | This restricts code points to be within the range 0 to 0x10FFFF, excluding |
108 | | the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the |
109 | | full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still |
110 | | does: it's just the validity check that is more restrictive. |
111 | | |
112 | | Version 4.4 21-Aug-03 |
113 | | --------------------- |
114 | | 15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. |
115 | | PCRE checks UTF-8 strings for validity by default. There is an option to suppress |
116 | | this, just in case anybody wants that teeny extra bit of performance. |
117 | | |
118 | | Version 4.4 13-Aug-03 |
119 | | --------------------- |
120 | | 10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at |
121 | | both compile and run time, and gives an error if an invalid UTF-8 sequence |
122 | | is found. There is a option for disabling this check in cases where the |
123 | | string is known to be correct and/or the maximum performance is wanted. |
124 | | |
125 | | Version 3.3 01-Aug-00 |
126 | | --------------------- |
127 | | 7. Added the beginnings of support for UTF-8 character strings. |
128 | | |
129 | | |
130 | | |
131 | | |
132 | | |
133 | | PCRE PHP)INI CONFIGURATION OPTIONS |
134 | | |
135 | | @link http://php.net/manual/en/pcre.configuration.php "PCRE Configuration Options" |
136 | | |
137 | | 2 PCRE INI options are available since PHP 5.2.0 |
138 | | |
139 | | pcre.backtrack_limit 1000000 |
140 | | PCRE's backtracking limit. Defaults to 100000 for PHP < 5.3.7. |
141 | | |
142 | | pcre.recursion_limit 100000 |
143 | | PCRE's recursion limit. Please note that if you set this value too high you may consume all the available |
144 | | process stack and eventually crash PHP (due to reaching the stack size limit imposed by the OS). |
145 | | |
146 | | |
147 | | |
148 | | |
149 | | |
150 | | PCRE CRASHES FROM REGEXES |
151 | | |
152 | | // Release 8.33 28-May-2013 |
153 | | // (*LIMIT_MATCH=d) and (*LIMIT_RECURSION=d) have been added so that the creator of a pattern can specify lower (but not higher) limits for the matching process. |
154 | | |
155 | | |
156 | | PCRE_EXTRA_MATCH_LIMIT can be accessed through the set_match_limit() |
157 | | and match_limit() member functions. Setting match_limit to a non-zero value will limit the execution of |
158 | | pcre to keep it from doing bad things like blowing the stack or taking an eternity to return a result. A value |
159 | | of 5000 is good enough to stop stack blowup in a 2MB thread stack. Setting match_limit to zero disables match |
160 | | limiting. Alternatively, you can call match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit |
161 | | how much PCRE recurses. match_limit() limits the number of matches PCRE does; match_limit_recursion() limits the |
162 | | depth of internal recursion, and therefore the amount of stack that is used. |
163 | | |
164 | | The match_limit field provides a means of preventing PCRE from using up a vast amount of resources when running |
165 | | patterns that are not going to match, but which have a very large number of possibilities in their search trees. The |
166 | | classic example is the use of nested unlimited repeats. |
167 | | |
168 | | Internally, PCRE uses a function called match() which it calls repeatedly (sometimes recursively). The limit set |
169 | | by match_limit is imposed on the number of times this function is called during a match, which has the effect of |
170 | | limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts |
171 | | from zero for each position in the subject string. |
172 | | |
173 | | The default value for the limit can be set when PCRE is built; the default default is 10 million, which handles all |
174 | | but the most extreme cases. You can override the default by suppling pcre_exec() with a pcre_extra block in which |
175 | | match_limit is set, and PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() |
176 | | returns PCRE_ERROR_MATCHLIMIT. |
177 | | |
178 | | The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times |
179 | | that match() is called, it limits the depth of recursion. The recursion depth is a smaller number than the total |
180 | | number of calls, because not all calls to match() are recursive. This limit is of use only if it is set smaller |
181 | | than match_limit. |
182 | | |
183 | | Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use |
184 | | memory on the heap instead of the stack, the amount of heap memory that can be used. |
185 | | |
186 | | The default value for match_limit_recursion can be set when PCRE is built; the default default is the same value |
187 | | as the default for match_limit. You can override the default by suppling pcre_exec() with a pcre_extra block in |
188 | | which match_limit_recursion is set, and PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit |
189 | | is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
190 | | |
191 | | |
192 | | |
193 | | |
194 | | |
195 | | _ _ ____ |
196 | | _ __ _ _ ___ __ _ _ __ __ _| |_ __| |_ / /\ \ |
197 | | | '_ \ '_/ -_) _` | | ' \/ _` | _/ _| ' \| | | | |
198 | | | .__/_| \___\__, |_|_|_|_\__,_|\__\__|_||_| | | | |
199 | | |_| |___/___| \_\/_/ |
200 | | |
201 | | preg_match() returns 1 if the pattern matches given subject, 0 if it does not, or FALSE if an error occurred. |
202 | | |
203 | | u (PCRE_UTF8) This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern and |
204 | | subject strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP |
205 | | 4.2.3 on win32. UTF-8 validity of the pattern and the subject is checked since PHP 4.3.5. An invalid subject will |
206 | | cause the preg_* function to match nothing; an invalid pattern will trigger an error of level E_WARNING. Five and |
207 | | six octet UTF-8 sequences are regarded as invalid since PHP 5.3.4 (resp. PCRE 7.3 2007-08-28); formerly those have |
208 | | been regarded as valid UTF-8. |
209 | | |
210 | | With the PCRE_UTF8 modifier 'u', preg_match() fails silently on strings containing invalid UTF-8 byte sequences. It |
211 | | does not reject character codes above U+10FFFF (represented by 4 or more octets), though. |
212 | | |
213 | | Originally, this function checked according to RFC 2279, allowing for values in the range 0 to 0x7fffffff, up to 6 |
214 | | bytes long, but ensuring that they were in the canonical format. Once somebody had pointed out RFC 3629 to me (it |
215 | | obsoletes 2279), additional restrictions were applied. The values are now limited to be between 0 and 0x0010ffff, |
216 | | no more than 4 bytes long, and the subrange 0xd000 to 0xdfff is excluded. However, the format of 5-byte and 6-byte |
217 | | characters is still checked. |
218 | | |
219 | | |
220 | | |
221 | | BACKTRACKING CONTROL |
222 | | |
223 | | The following are recognized only at the start of a pattern: |
224 | | |
225 | | (*LIMIT_MATCH=d) set the match limit to d (decimal number) ( added 8.33 28-May-2013 ) |
226 | | (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) ( added 8.33 28-May-2013 ) |
227 | | |
228 | | (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) ( added 7.9 11-Apr-09 ) |
229 | | (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) ( added 7.9 11-Apr-09 ) |
230 | | (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) ( added 7.9 11-Apr-09 ) |
231 | | (*UTF) set appropriate UTF mode for the library in use ( added 7.9 11-Apr-09 ) |
232 | | |
233 | | In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF support, and, in addition, you |
234 | | must call pcre_compile() with the PCRE_UTF8 option flag, or the pattern must start with the sequence (*UTF8) or |
235 | | (*UTF). When either of these is the case, both the pattern and any subject strings that are matched against it |
236 | | are treated as UTF-8 strings instead of strings of individual 1-byte characters. |
237 | | |
238 | | |
239 | | |
240 | | PCRE UTF ERRORS |
241 | | |
242 | | From release 8.13 more information about the details of the error are passed back in the returned value: |
243 | | |
244 | | PCRE_UTF8_ERR0 No error |
245 | | PCRE_UTF8_ERR1 Missing 1 byte at the end of the string |
246 | | PCRE_UTF8_ERR2 Missing 2 bytes at the end of the string |
247 | | PCRE_UTF8_ERR3 Missing 3 bytes at the end of the string |
248 | | PCRE_UTF8_ERR4 Missing 4 bytes at the end of the string |
249 | | PCRE_UTF8_ERR5 Missing 5 bytes at the end of the string |
250 | | PCRE_UTF8_ERR6 2nd-byte's two top bits are not 0x80 |
251 | | PCRE_UTF8_ERR7 3rd-byte's two top bits are not 0x80 |
252 | | PCRE_UTF8_ERR8 4th-byte's two top bits are not 0x80 |
253 | | PCRE_UTF8_ERR9 5th-byte's two top bits are not 0x80 |
254 | | PCRE_UTF8_ERR10 6th-byte's two top bits are not 0x80 |
255 | | PCRE_UTF8_ERR11 5-byte character is not permitted by RFC 3629 |
256 | | PCRE_UTF8_ERR12 6-byte character is not permitted by RFC 3629 |
257 | | PCRE_UTF8_ERR13 4-byte character with value > 0x10ffff is not permitted |
258 | | PCRE_UTF8_ERR14 3-byte character with value 0xd000-0xdfff is not permitted |
259 | | PCRE_UTF8_ERR15 Overlong 2-byte sequence |
260 | | PCRE_UTF8_ERR16 Overlong 3-byte sequence |
261 | | PCRE_UTF8_ERR17 Overlong 4-byte sequence |
262 | | PCRE_UTF8_ERR18 Overlong 5-byte sequence (won't ever occur) |
263 | | PCRE_UTF8_ERR19 Overlong 6-byte sequence (won't ever occur) |
264 | | PCRE_UTF8_ERR20 Isolated 0x80 byte (not within UTF-8 character) |
265 | | PCRE_UTF8_ERR21 Byte with the illegal value 0xfe or 0xff |
266 | | PCRE_UTF8_ERR22 Unused (was non-character) |
267 | | |
268 | | |
269 | | PHP PCRE CONSTANTS |
270 | | |
271 | | PREG_NO_ERROR Returned by preg_last_error() if there were no errors. 5.2.0 |
272 | | PREG_INTERNAL_ERROR Returned by preg_last_error() if there was an internal PCRE error. 5.2.0 |
273 | | PREG_BACKTRACK_LIMIT_ERROR Returned by preg_last_error() if backtrack limit was exhausted. 5.2.0 |
274 | | PREG_RECURSION_LIMIT_ERROR Returned by preg_last_error() if recursion limit was exhausted. 5.2.0 |
275 | | PREG_BAD_UTF8_ERROR Returned by preg_last_error() if the last error was caused by malformed UTF-8 data (only when |
276 | | running a regex in UTF-8 mode). 5.2.0 |
277 | | PREG_BAD_UTF8_OFFSET_ERROR Returned by preg_last_error() if the offset didn't correspond to the begin of a valid |
278 | | UTF-8 code point (only when running a regex in UTF-8 mode). 5.3.0 |
279 | | PCRE_VERSION PCRE version and release date (e.g. "7.0 18-Dec-2006"). 5.2.4 |
280 | | |
281 | | PCRE CONSTANTS ON MY INSTALL get_defined_constants() |
282 | | |
283 | | PREG_PATTERN_ORDER' => 1, |
284 | | PREG_SET_ORDER' => 2, |
285 | | PREG_OFFSET_CAPTURE' => 256, |
286 | | PREG_SPLIT_NO_EMPTY' => 1, |
287 | | PREG_SPLIT_DELIM_CAPTURE' => 2, |
288 | | PREG_SPLIT_OFFSET_CAPTURE' => 4, |
289 | | PREG_GREP_INVERT' => 1, |
290 | | PREG_NO_ERROR' => 0, |
291 | | PREG_INTERNAL_ERROR' => 1, |
292 | | PREG_BACKTRACK_LIMIT_ERROR' => 2, |
293 | | PREG_RECURSION_LIMIT_ERROR' => 3, |
294 | | PREG_BAD_UTF8_ERROR' => 4, |
295 | | PREG_BAD_UTF8_OFFSET_ERROR' => 5, |
296 | | PCRE_VERSION' => '8.34 2013-12-15', |
297 | | |
298 | | |
299 | | |
300 | | |
301 | | _ ____ |
302 | | (_)__ ___ _ ___ __/ /\ \ |
303 | | | / _/ _ \ ' \ V / | | | |
304 | | |_\__\___/_||_\_/| | | | |
305 | | \_\/_/ |
306 | | |
307 | | https://www.gnu.org/software/libiconv/ |
308 | | |
309 | | If you append the string //IGNORE, characters that cannot be represented in the target charset are silently discarded. |
310 | | Otherwise, str is cut from the first illegal character and an E_NOTICE is generated. ( since GNU libiconv 2002-01-13 ) |
311 | | |
312 | | In other words, iconv() appears to be intended for use when converting the contents of files - whereas mb_convert_encoding() is intended |
313 | | for use when juggling strings internally, e.g. strings that aren't being read/written to/from files, but exchanged with some other media. |
314 | | |
315 | | ICONV CHARACTER SET ENCODINGS CONTAINING "UTF" |
316 | | |
317 | | $ iconv -l |
318 | | - ISO-10646UTF-8 |
319 | | - ISO-10646UTF8 |
320 | | - UTF-7 |
321 | | - UTF-8 |
322 | | - UTF-16 |
323 | | - UTF-16BE |
324 | | - UTF-16LE |
325 | | - UTF-32 |
326 | | - UTF-32BE |
327 | | - UTF-32LE |
328 | | - UTF7 |
329 | | - UTF8 |
330 | | - UTF16 |
331 | | - UTF16BE |
332 | | - UTF16LE |
333 | | - UTF32 |
334 | | - UTF32BE |
335 | | - UTF32LE |
336 | | |
337 | | If the string //IGNORE is appended to to-encoding, characters that cannot be converted are discarded and an error is printed after conversion. |
338 | | |
339 | | ICONV IMPLEMENTATIONS - ICONV_IMPL CONSTANT |
340 | | |
341 | | @link http://www.gnu.org/software/libc/manual/html_node/Other-iconv-Implementations.html "Some Details about other iconv Implementations" |
342 | | @link http://www.gnu.org/software/libc/manual/html_node/Locales.html "Locales and Internationalization" |
343 | | |
344 | | "libiconv" - GNU libiconv is the native FreeBSD iconv implementation since 2002. |
345 | | "BSD iconv" - Konstantin Chugeuv's iconv |
346 | | "glibc" - GNU Glibc's |
347 | | "unknown" - Not one of the above |
348 | | }}} |
349 | | |
350 | | |
351 | | |