WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#25709 closed defect (bug) (fixed)

Searches with stopwords can trigger E_WARNING when PCRE is lacking UTF-8 support

Reported by: nacin Owned by: nacin
Milestone: 3.7.1 Priority: high
Severity: major Version: 3.7
Component: Query Keywords: has-patch commit
Focuses: Cc:

Description

This is a curious PHP warning:

preg_match() [function.preg-match]: Compilation failed: support for \P, \p, and \X has not been compiled at offset 2 in wp-includes/query.php on line 1990

Original report: http://wordpress.org/support/topic/preg_match-functionpreg-match-compilation-failed-queryphp?replies=3

Attachments (2)

25709.diff (824 bytes) - added by nacin 5 years ago.
25709.2.diff (604 bytes) - added by nacin 5 years ago.

Download all attachments as: .zip

Change History (23)

#1 @azaozz
5 years ago

There are few places where the u PCRE modifier is used. Should we silence all like http://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L529?

#2 @nacin
5 years ago

I'd say we silence it, certainly for 3.7.1, yes.

$ ack --php "preg_.*u['\"]"
src/wp-includes/formatting.php
529:		$utf8_pcre = @preg_match( '/^./u', 'a' );
2230:		preg_match_all( '/./u', $text, $words_array );

src/wp-includes/pomo/po.php
138:			preg_match_all('/./u', $line, $chars);

src/wp-includes/query.php
1990:			if ( ! $term || preg_match( '/^\p{L}$/u', $term ) )

src/wp-includes/shortcodes.php
295:	$text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text);

src/wp-includes/SimplePie/Misc.php
1926:		return (bool) preg_match('/^([A-Za-z0-9\-._~\x{A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E1000}-\x{EFFFD}!$&\'()*+,;=@]|(%[0-9ABCDEF]{2}))+$/u', $string);

src/wp-includes/wp-diff.php
495:		$words  = preg_split( '/([^\w])/u', $string, -1, PREG_SPLIT_DELIM_CAPTURE );

Also, allowing for an additional character between "u" and the closing quote:

$ ack --php "preg_.*u.['\"]"
src/wp-includes/compat.php
30:	preg_match_all( '/./us', $str, $match );

src/wp-includes/formatting.php
537:	if ( 1 === @preg_match( '/^./us', $string ) ) {

These are tough to match for; there might be others.

pomo/po.php isn't used in core. Other than that, how have we never received a report of this for shortcode_parse_atts() or elsewhere? Something smells funny.

#3 @dd32
5 years ago

Something smells funny.

My gut feeling is that /u is simply used as a flag to tell PCRE to enable extra UTF8 support, if it's not compiled with UTF8, nothing happens, . continues to match partial UTf8 characters. However, this warning is triggered when a UTF8-specific character selector (such as \p{L}) is used but is unsupported.

If that's correct, all the other cases are fine, but may not act as expected if there's no UTF8 support, it's just this one will also emit a PHP Warning.

#4 @nacin
5 years ago

That sounds exactly it.

#5 @nacin
5 years ago

In 25932:

In search stopwords filtering, suppress possible PHP warnings when PCRE lacks UTF-8 support.

see #25709.

@nacin
5 years ago

#6 @tenpura
5 years ago

I think the warning is about lacking of Unicode properties support which is differrnt from UTF-8 support. Here are some codes might be useful from Kohana framework.

<th>PCRE UTF-8</th>
<?php if ( ! @preg_match('/^.$/u', 'ñ')): $failed = TRUE ?>
	<td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with UTF-8 support.</td>
<?php elseif ( ! @preg_match('/^\pL$/u', 'ñ')): $failed = TRUE ?>
	<td class="fail"><a href="http://php.net/pcre">PCRE</a> has not been compiled with Unicode property support.</td>
<?php else: ?>
	<td class="pass">Pass</td>
<?php endif ?>

https://github.com/kohana/kohana/blob/3.3/master/install.php#L99

One more thing I need to tell is the regex

/^\p{L}$/u

matches to a single Japanese character, because L property includes Lo property which actually matches to it. So the regex matches a non Japanese (or Chinese) single letter would be like

/^(\p{Ll}|\p{Lu}|\p{Lm}|\p{Lt})$/u

#7 @nacin
5 years ago

So, this check wasn't actually tested with East Asian characters? Maybe we should just convert it to /^[a-z]$/i for 3.7.1.

#8 follow-up: @azaozz
5 years ago

As far as I remember \X matches any unicode character. If \p{L} includes all letters, this should match:

$s = '畑';

var_dump( preg_match( '/^\p{L}$/u', $s ) );

There is definitely something more going on, perhaps excluding only ASCII letters is good for now.

#9 in reply to: ↑ 8 @tenpura
5 years ago

perhaps excluding only ASCII letters is good for now.

seems much less problematic.

#10 @azaozz
5 years ago

Perhaps a better way to do that in the future would be to match specific unicode character ranges: http://www.ssec.wisc.edu/~tomw/java/unicode.html.

@nacin
5 years ago

#11 @nacin
5 years ago

if ( ! $term || ( 1 === strlen( $term ) && preg_match( '/^[a-z]$/i', $term ) ) )

25709.2.diff

Thoughts? Good?

#12 @nacin
5 years ago

  • Keywords has-patch commit added

Let's get 25709.2.diff evaluated and in.

#13 @azaozz
5 years ago

Tests well here. This will need more feedback from different translators before it can cover all languages.

#14 @tenpura
5 years ago

I've tested and looks good to me too.

#15 @dd32
5 years ago

if ( ! $term || ( 1 === strlen( $term ) && preg_match( '/^[a-z]$/i', $term ) ) )

That doesn't sound right to me, specifically, the strlen() component. A non-ascii character will always be >1char (unless mbstring overloading is enabled, but lets not go there) so that preg_match is only checking to see if a non-a-z ascii character is present.

Last edited 5 years ago by SergeyBiryukov (previous) (diff)

#16 follow-up: @azaozz
5 years ago

preg_match is only checking to see if a non-a-z ascii character is present.

Right, the conditional is true only for single letters so they are removed from the separate term matching. strlen() checks if the term is only one byte long, so preg_match() is not run when not needed.

#17 in reply to: ↑ 16 @dd32
5 years ago

Replying to azaozz:

preg_match is only checking to see if a non-a-z ascii character is present.

Right, the conditional is true only for single letters so they are removed from the separate term matching. strlen() checks if the term is only one byte long, so preg_match() is not run when not needed.

Right, I missed the part where it was decided to just ignore single-letter Ascii characters for now.

#18 @alex-ye
5 years ago

  • Cc nashwan.doaqan@… added

#19 @nacin
5 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 25954:

Query stopwords: Only eliminate single A-Z letters as search terms.

Stop trying to match any single letters that are not East Asian characters, as this requires PCRE with UTF-8 support; and because it doesn't actually work.

fixes #25709.

#20 @nacin
5 years ago

In 25955:

Query stopwords: Only eliminate single A-Z letters as search terms.

Merges [25954] to the 3.7 branch.

Stop trying to match any single letters that are not East Asian characters, as this requires PCRE with UTF-8 support; and because it doesn't actually work.

fixes #25709.

#21 @nacin
5 years ago

  • Component changed from General to Query
Note: See TracTickets for help on using tickets.