#25709 closed defect (bug) (fixed)
Searches with stopwords can trigger E_WARNING when PCRE is lacking UTF-8 support
Reported by: | nacin | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.7.1 | Priority: | high |
Severity: | major | Version: | 3.7 |
Component: | Query | Keywords: | has-patch commit |
Focuses: | Cc: |
Description
This is a curious PHP warning:
preg_match() [function.preg-match]: Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-includes/query.php on line 1990
Original report: http://wordpress.org/support/topic/preg_match-functionpreg-match-compilation-failed-queryphp?replies=3
Attachments (2)
Change History (23)
#2
@
11 years ago
I'd say we silence it, certainly for 3.7.1, yes.
$ ack --php "preg_.*u['\"]" src/wp-includes/formatting.php 529: $utf8_pcre = @preg_match( '/^./u', 'a' ); 2230: preg_match_all( '/./u', $text, $words_array ); src/wp-includes/pomo/po.php 138: preg_match_all('/./u', $line, $chars); src/wp-includes/query.php 1990: if ( ! $term || preg_match( '/^\p{L}$/u', $term ) ) src/wp-includes/shortcodes.php 295: $text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text); src/wp-includes/SimplePie/Misc.php 1926: return (bool) preg_match('/^([A-Za-z0-9\-._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!$&\'()*+,;=@]|(%[0-9ABCDEF]{2}))+$/u', $string); src/wp-includes/wp-diff.php 495: $words = preg_split( '/([^\w])/u', $string, -1, PREG_SPLIT_DELIM_CAPTURE );
Also, allowing for an additional character between "u" and the closing quote:
$ ack --php "preg_.*u.['\"]" src/wp-includes/compat.php 30: preg_match_all( '/./us', $str, $match ); src/wp-includes/formatting.php 537: if ( 1 === @preg_match( '/^./us', $string ) ) {
These are tough to match for; there might be others.
pomo/po.php isn't used in core. Other than that, how have we never received a report of this for shortcode_parse_atts() or elsewhere? Something smells funny.
#3
@
11 years ago
Something smells funny.
My gut feeling is that /u
is simply used as a flag to tell PCRE to enable extra UTF8 support, if it's not compiled with UTF8, nothing happens, .
continues to match partial UTf8 characters. However, this warning is triggered when a UTF8-specific character selector (such as \p{L}) is used but is unsupported.
If that's correct, all the other cases are fine, but may not act as expected if there's no UTF8 support, it's just this one will also emit a PHP Warning.
#6
@
11 years ago
I think the warning is about lacking of Unicode properties support which is differrnt from UTF-8 support. Here are some codes might be useful from Kohana framework.
<th>PCRE UTF-8</th> <?php if ( ! @preg_match('/^.$/u', 'ñ')): $failed = TRUE ?> <td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with UTF-8 support.</td> <?php elseif ( ! @preg_match('/^\pL$/u', 'ñ')): $failed = TRUE ?> <td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with Unicode property support.</td> <?php else: ?> <td class="pass">Pass</td> <?php endif ?>
https://github.com/kohana/kohana/blob/3.3/master/install.php#L99
One more thing I need to tell is the regex
/^\p{L}$/u
matches to a single Japanese character, because L property includes Lo property which actually matches to it. So the regex matches a non Japanese (or Chinese) single letter would be like
/^(\p{Ll}|\p{Lu}|\p{Lm}|\p{Lt})$/u
#7
@
11 years ago
So, this check wasn't actually tested with East Asian characters? Maybe we should just convert it to /^[a-z]$/i
for 3.7.1.
#8
follow-up:
↓ 9
@
11 years ago
As far as I remember \X
matches any unicode character. If \p{L}
includes all letters, this should match:
$s = '畑'; var_dump( preg_match( '/^\p{L}$/u', $s ) );
There is definitely something more going on, perhaps excluding only ASCII letters is good for now.
#9
in reply to:
↑ 8
@
11 years ago
perhaps excluding only ASCII letters is good for now.
seems much less problematic.
#10
@
11 years ago
Perhaps a better way to do that in the future would be to match specific unicode character ranges: http://www.ssec.wisc.edu/~tomw/java/unicode.html.
#11
@
11 years ago
if ( ! $term || ( 1 === strlen( $term ) && preg_match( '/^[a-z]$/i', $term ) ) )
Thoughts? Good?
#13
@
11 years ago
Tests well here. This will need more feedback from different translators before it can cover all languages.
#15
@
11 years ago
if ( ! $term || ( 1 === strlen( $term ) && preg_match( '/^[a-z]$/i', $term ) ) )
That doesn't sound right to me, specifically, the strlen() component.
A non-ascii character will always be >1char (unless mbstring overloading is enabled, but lets not go there) so that preg_match is only checking to see if a non-a-z ascii character is present.
#16
follow-up:
↓ 17
@
11 years ago
preg_match is only checking to see if a non-a-z ascii character is present.
Right, the conditional is true only for single letters so they are removed from the separate term matching. strlen()
checks if the term is only one byte long, so preg_match() is not run when not needed.
#17
in reply to:
↑ 16
@
11 years ago
Replying to azaozz:
preg_match is only checking to see if a non-a-z ascii character is present.
Right, the conditional is true only for single letters so they are removed from the separate term matching.
strlen()
checks if the term is only one byte long, so preg_match() is not run when not needed.
Right, I missed the part where it was decided to just ignore single-letter Ascii characters for now.
There are few places where the
u
PCRE modifier is used. Should we silence all like http://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L529?