Make WordPress Core

Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#25709 closed defect (bug) (fixed)

Searches with stopwords can trigger E_WARNING when PCRE is lacking UTF-8 support

Reported by: nacin's profile nacin Owned by: nacin's profile nacin
Milestone: 3.7.1 Priority: high
Severity: major Version: 3.7
Component: Query Keywords: has-patch commit
Focuses: Cc:

Description

This is a curious PHP warning:

preg_match() [function.preg-match]: Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-includes/query.php on line 1990

Original report: http://wordpress.org/support/topic/preg_match-functionpreg-match-compilation-failed-queryphp?replies=3

Attachments (2)

25709.diff (824 bytes) - added by nacin 11 years ago.
25709.2.diff (604 bytes) - added by nacin 11 years ago.

Download all attachments as: .zip

Change History (23)

#1 @azaozz
11 years ago

There are few places where the u PCRE modifier is used. Should we silence all like http://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L529?

#2 @nacin
11 years ago

I'd say we silence it, certainly for 3.7.1, yes.

$ ack --php "preg_.*u['\"]"
src/wp-includes/formatting.php
529:		$utf8_pcre = @preg_match( '/^./u', 'a' );
2230:		preg_match_all( '/./u', $text, $words_array );

src/wp-includes/pomo/po.php
138:			preg_match_all('/./u', $line, $chars);

src/wp-includes/query.php
1990:			if ( ! $term || preg_match( '/^\p{L}$/u', $term ) )

src/wp-includes/shortcodes.php
295:	$text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text);

src/wp-includes/SimplePie/Misc.php
1926:		return (bool) preg_match('/^([A-Za-z0-9\-._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!$&\'()*+,;=@]|(%[0-9ABCDEF]{2}))+$/u', $string);

src/wp-includes/wp-diff.php
495:		$words  = preg_split( '/([^\w])/u', $string, -1, PREG_SPLIT_DELIM_CAPTURE );

Also, allowing for an additional character between "u" and the closing quote:

$ ack --php "preg_.*u.['\"]"
src/wp-includes/compat.php
30:	preg_match_all( '/./us', $str, $match );

src/wp-includes/formatting.php
537:	if ( 1 === @preg_match( '/^./us', $string ) ) {

These are tough to match for; there might be others.

pomo/po.php isn't used in core. Other than that, how have we never received a report of this for shortcode_parse_atts() or elsewhere? Something smells funny.

#3 @dd32
11 years ago

Something smells funny.

My gut feeling is that /u is simply used as a flag to tell PCRE to enable extra UTF8 support, if it's not compiled with UTF8, nothing happens, . continues to match partial UTf8 characters. However, this warning is triggered when a UTF8-specific character selector (such as \p{L}) is used but is unsupported.

If that's correct, all the other cases are fine, but may not act as expected if there's no UTF8 support, it's just this one will also emit a PHP Warning.

#4 @nacin
11 years ago

That sounds exactly it.

#5 @nacin
11 years ago

In 25932:

In search stopwords filtering, suppress possible PHP warnings when PCRE lacks UTF-8 support.

see #25709.

@nacin
11 years ago

#6 @tenpura
11 years ago

I think the warning is about lacking of Unicode properties support which is differrnt from UTF-8 support. Here are some codes might be useful from Kohana framework.

<th>PCRE UTF-8</th>
<?php if ( ! @preg_match('/^.$/u', 'ñ')): $failed = TRUE ?>
	<td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with UTF-8 support.</td>
<?php elseif ( ! @preg_match('/^\pL$/u', 'ñ')): $failed = TRUE ?>
	<td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with Unicode property support.</td>
<?php else: ?>
	<td class="pass">Pass</td>
<?php endif ?>

https://github.com/kohana/kohana/blob/3.3/master/install.php#L99

One more thing I need to tell is the regex

/^\p{L}$/u

matches to a single Japanese character, because L property includes Lo property which actually matches to it. So the regex matches a non Japanese (or Chinese) single letter would be like

/^(\p{Ll}|\p{Lu}|\p{Lm}|\p{Lt})$/u

#7 @nacin
11 years ago

So, this check wasn't actually tested with East Asian characters? Maybe we should just convert it to /^[a-z]$/i for 3.7.1.

#8 follow-up: @azaozz
11 years ago

As far as I remember \X matches any unicode character. If \p{L} includes all letters, this should match:

$s = '畑';

var_dump( preg_match( '/^\p{L}$/u', $s ) );

There is definitely something more going on, perhaps excluding only ASCII letters is good for now.

#9 in reply to: ↑ 8 @tenpura
11 years ago

perhaps excluding only ASCII letters is good for now.

seems much less problematic.

#10 @azaozz
11 years ago

Perhaps a better way to do that in the future would be to match specific unicode character ranges: http://www.ssec.wisc.edu/~tomw/java/unicode.html.

@nacin
11 years ago

#11 @nacin
11 years ago

if ( ! $term || ( 1 === strlen( $term ) && preg_match( '/^[a-z]$/i', $term ) ) )

25709.2.diff

Thoughts? Good?

#12 @nacin
11 years ago

  • Keywords has-patch commit added

Let's get 25709.2.diff evaluated and in.

#13 @azaozz
11 years ago

Tests well here. This will need more feedback from different translators before it can cover all languages.

#14 @tenpura
11 years ago

I've tested and looks good to me too.

#15 @dd32
11 years ago

if ( ! $term
( 1 === strlen( $term ) && preg_match( '/[a-z]$/i', $term ) ) )

That doesn't sound right to me, specifically, the strlen() component.
A non-ascii character will always be >1char (unless mbstring overloading is enabled, but lets not go there) so that preg_match is only checking to see if a non-a-z ascii character is present.

Version 0, edited 11 years ago by dd32 (next)

#16 follow-up: @azaozz
11 years ago

preg_match is only checking to see if a non-a-z ascii character is present.

Right, the conditional is true only for single letters so they are removed from the separate term matching. strlen() checks if the term is only one byte long, so preg_match() is not run when not needed.

#17 in reply to: ↑ 16 @dd32
11 years ago

Replying to azaozz:

preg_match is only checking to see if a non-a-z ascii character is present.

Right, the conditional is true only for single letters so they are removed from the separate term matching. strlen() checks if the term is only one byte long, so preg_match() is not run when not needed.

Right, I missed the part where it was decided to just ignore single-letter Ascii characters for now.

#18 @alex-ye
11 years ago

  • Cc nashwan.doaqan@… added

#19 @nacin
11 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 25954:

Query stopwords: Only eliminate single A-Z letters as search terms.

Stop trying to match any single letters that are not East Asian characters, as this requires PCRE with UTF-8 support; and because it doesn't actually work.

fixes #25709.

#20 @nacin
11 years ago

In 25955:

Query stopwords: Only eliminate single A-Z letters as search terms.

Merges [25954] to the 3.7 branch.

Stop trying to match any single letters that are not East Asian characters, as this requires PCRE with UTF-8 support; and because it doesn't actually work.

fixes #25709.

#21 @nacin
11 years ago

  • Component changed from General to Query
Note: See TracTickets for help on using tickets.