Opened 5 months ago
Last modified 4 months ago
#63913 new enhancement
WordPress assumes that the UTF-8 PCRE flag is available.
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Future Release | Priority: | low |
| Severity: | normal | Version: | 6.9 |
| Component: | Charset | Keywords: | has-patch |
| Focuses: | Cc: |
Description
There are several places in Core where code assumes that the UTF-8 flag (PCRE_UTF8) is available and performs matching or substitution with no alternative. When this happens, the preg_ functions fail silently. Data corruption ensues in ways which are hard to track to this source.
The _wp_can_use_pcre_u() function indicates support for this flag, but resolving this issue is not as simple as wrapping the preg_ calls in a check. The existing code provides no fallback mechanism and it may be necessary to define such in the absence of support for the flag.
One such example is shortcode_parse_atts().
<?php $text = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text );
In this case, when PCRE_UTF8 is not supported, $text becomes NULL and the function fails to parse any shortcode attributes. This is an example of a function which doesn’t require UTF-8 regex patterns, because the above replacement is trivial to replace.
A more complicated example is get_avatar_data(), which searches based on script properties of characters which has no replacement. In the lack of support there, it may not be possible to provide an equivalent fallback.
<?php if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name ) || false === strpos( $name, ' ' ) ) { $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8' ) ), 'UTF-8' ); } else { $first = mb_substr( $name, 0, 1, 'UTF-8' ); $last = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1, 'UTF-8' ); $initials = $first . $last; }
An incomplete PCRE pattern I used to find sources in the codebase using the /u flag follows. It would be better to build a search based off of a PHP parser, but finding a comprehensive list of places assuming the UTF-8 flag is left as an exercise for future work on this ticket.
('|")((?!\1)[^_a-zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1
This looks for string literals starting and ending with the same delimiter, followed by a set of PCRE modifiers including the u, terminated by the opening quote. It does not find HEREDOC or NOWDOC patterns and it does not find unmatched delimiters like parentheses or other brackets.
h/t @tusharbharti
Change History (6)
#2
@
5 months ago
I created the scanner and got these:
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/functions.php#L363
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/functions.php#L386
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/shortcodes.php#L616
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/formatting.php#L1296
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/formatting.php#L1304
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/formatting.php#L2222
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/formatting.php#L4236
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/SimplePie/src/Misc.php#L1878
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/class-wp-plugin-dependencies.php#L606
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/class-wp-text-diff-renderer-inline.php#L29
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/link-template.php#L4567
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/pomo/po.php#L158
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/comment.php#L100
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/comment.php#L1418
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/compat.php#L70
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/compat.php#L172
https://github.com/WordPress/wordpress-develop/blob/trunk/src/wp-includes/compat.php#L257
I will also do a manual scan via editor but I think we have atleast consolidated most of the modifier locations
This ticket was mentioned in PR #9724 on WordPress/wordpress-develop by @tusharbharti.
5 months ago
#3
- Keywords has-patch added
This PR adds _wp_can_use_pcre_u guards to all the functions that use pcre_u modifier flag in regex.
Currently WordPress assumes that u flag is available by default but when the pcre_u support isn't present this falls apart and functions like parse_shortcodes_atts like breaks returning NULL.
Trac ticket: https://core.trac.wordpress.org/ticket/63913
#4
@
5 months ago
Right now, While writing this patch, I realized that the only way to solve/add guards is to either try to find a function that supports utf-8 by default or mbstring extension or some other hacks like IntlChar etc.
I believe moving to mbstring extension as standard would help a lot.
cc: @dmsnell
#5
@
5 months ago
If anyone is interested with the scanner, this uses https://github.com/nikic/PHP-Parser
<?php require 'vendor/autoload.php'; use PhpParser\Error; use PhpParser\Node; use PhpParser\NodeTraverser; use PhpParser\NodeVisitorAbstract; use PhpParser\ParserFactory; class RegexVisitor extends NodeVisitorAbstract { private string $file; public function __construct(string $file) { $this->file = $file; } public function enterNode(Node $node): void { // Look for *any* string literal (single, double, HEREDOC, NOWDOC, encapsed) if ($node instanceof Node\Scalar\String_ || $node instanceof Node\Scalar\Encapsed) { $value = $node instanceof Node\Scalar\String_ ? $node->value : implode('', array_map(function ($p) { return $p instanceof Node\Scalar\EncapsedStringPart ? $p->value : ''; }, $node->parts)); // Regex: delimiter, body, delimiter, modifiers if (preg_match( '/^(?P<delim>[^a-zA-Z0-9\s\\\\]) (?P<body>(?:\\\\.|(?!\1).)*) \1(?P<modifiers>[imsxADSUXJu]+)$/x', $value, $m )) { if (strpos($m['modifiers'], 'u') !== false) { printf( "%s:%d: %s\n", $this->file, $node->getStartLine(), $value ); } } } } } $root = __DIR__ . '/src'; // adjust path to WP root $parser = (new ParserFactory)->createForNewestSupportedVersion(); $rii = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root)); foreach ($rii as $file) { if ($file->isDir() || pathinfo($file->getPathname(), PATHINFO_EXTENSION) !== 'php') { continue; } $code = file_get_contents($file->getPathname()); try { $ast = $parser->parse($code); } catch (Error $e) { fwrite(STDERR, "Parse error in {$file}: {$e->getMessage()}\n"); continue; } $traverser = new NodeTraverser(); $traverser->addVisitor(new RegexVisitor($file->getPathname())); $traverser->traverse($ast); }
Regex version
<?php $root = __DIR__ . '/src'; $iterator = new RecursiveIteratorIterator( new RecursiveDirectoryIterator($root) ); $pattern = '/(?P<quote>[\'"])(?P<delim>[^a-zA-Z0-9\s\\\\]) (?P<regex>(?:\\\\.|(?!\2|\1).)*) (?P=delim)(?P<modifiers>[imsxADSUXJu]+)(?P=quote)/x'; foreach ($iterator as $file) { if ($file->isDir()) { continue; } if (pathinfo($file, PATHINFO_EXTENSION) !== 'php') { continue; } $lines = file($file->getPathname()); foreach ($lines as $lineNumber => $line) { if (preg_match_all($pattern, $line, $matches, PREG_SET_ORDER)) { foreach ($matches as $match) { if (strpos($match['modifiers'], 'u') !== false) { printf( "%s:%d: %s\n", $file->getPathname(), $lineNumber + 1, trim($line) ); } } } } }
#6
@
4 months ago
moving to mbstring extension as standard would help a lot.
It absolutely would. There are places WordPress currently crashes if mbstring isn’t available anyway. That issue, however, is discussed in #55603.
the only way to solve/add guards is to either try to find a function that supports utf-8 by default or mbstring extension or some other hacks like IntlChar etc.
Yes, anyone will discover that resolving these is not a simple matter of adding a function_exists() check, because the code calling these functions is likely wrong already and there is no alternative in the absence of these functions. This is why I created this separate tracking issue to take take of the long-term migration of functions assuming the flag.
hi @dmsnell, thanks for mention,
for
We can possibly use
IntlCharclass to detect the script and get the initialsHmm, I will see if I can write the scanner but I can meantime improve the regex.