Context Navigation

← Previous Ticket
Next Ticket →

#63913 new enhancement

WordPress assumes that the UTF-8 PCRE flag is available.

Reported by:	dmsnell	Owned by:
Milestone:	Future Release	Priority:	low
Severity:	normal	Version:	6.9
Component:	Charset	Keywords:	has-patch
Focuses:		Cc:

Description

There are several places in Core where code assumes that the UTF-8 flag (PCRE_UTF8) is available and performs matching or substitution with no alternative. When this happens, the preg_ functions fail silently. Data corruption ensues in ways which are hard to track to this source.

The _wp_can_use_pcre_u() function indicates support for this flag, but resolving this issue is not as simple as wrapping the preg_ calls in a check. The existing code provides no fallback mechanism and it may be necessary to define such in the absence of support for the flag.

One such example is shortcode_parse_atts().

<?php
$text    = preg_replace( "/[\x{00a0}\x{200b}]+/u", ' ', $text );

In this case, when PCRE_UTF8 is not supported, $text becomes NULL and the function fails to parse any shortcode attributes. This is an example of a function which doesn’t require UTF-8 regex patterns, because the above replacement is trivial to replace.

A more complicated example is get_avatar_data(), which searches based on script properties of characters which has no replacement. In the lack of support there, it may not be possible to provide an equivalent fallback.

<?php
if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name ) || false === strpos( $name, ' ' ) ) {
        $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8' ) ), 'UTF-8' );
} else {
        $first    = mb_substr( $name, 0, 1, 'UTF-8' );
        $last     = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1, 'UTF-8' );
        $initials = $first . $last;
}

An incomplete PCRE pattern I used to find sources in the codebase using the /u flag follows. It would be better to build a search based off of a PHP parser, but finding a comprehensive list of places assuming the UTF-8 flag is left as an exercise for future work on this ticket.

('|")((?!\1)[^_a-zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1

This looks for string literals starting and ending with the same delimiter, followed by a set of PCRE modifiers including the u, terminated by the opening quote. It does not find HEREDOC or NOWDOC patterns and it does not find unmatched delimiters like parentheses or other brackets.

h/t @tusharbharti

Change History (6)

#1 @tusharbharti
5 months ago

hi @dmsnell, thanks for mention,
for

<?php
if ( preg_match( '/\p{Han}|\p{Hiragana}|\p{Katakana}|\p{Hangul}/u', $name ) || false === strpos( $name, ' ' ) ) {
        $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8' ) ), 'UTF-8' );
} else {
        $first    = mb_substr( $name, 0, 1, 'UTF-8' );
        $last     = mb_substr( $name, strrpos( $name, ' ' ) + 1, 1, 'UTF-8' );
        $initials = $first . $last;
}

We can possibly use IntlChar class to detect the script and get the initials

<?php
$firstChar = mb_substr( $name, 0, 1, 'UTF-8' );
$codepoint = IntlChar::ord( $firstChar );
$block     = IntlChar::getBlockCode( $codepoint );

$cjkBlocks = array(
    IntlChar::BLOCK_CODE_CJK_UNIFIED_IDEOGRAPHS,
    IntlChar::BLOCK_CODE_HANGUL_SYLLABLES,
    IntlChar::BLOCK_CODE_HIRAGANA,
    IntlChar::BLOCK_CODE_KATAKANA,
    );

if ( in_array( $block, $cjkBlocks, true ) ) {
    $initials = mb_substr( $name, 0, min( 2, mb_strlen( $name, 'UTF-8' ) ), 'UTF-8' );
}

An incomplete PCRE pattern I used to find sources in the codebase using the /u flag follows. It would be better to build a search based off of a PHP parser, but finding a comprehensive list of places assuming the UTF-8 flag is left as an exercise for future work on this ticket.

('|")((?!\1)[^_a-zA-Z0-9-])((?!\1).)+\2[idsxumrADSUXJ]*?u[idsxumrADSUXJ]*?\1
This looks for string literals starting and ending with the same delimiter, followed by a set of PCRE modifiers including the u, terminated by the opening quote. It does not find HEREDOC or NOWDOC patterns and it does not find unmatched delimiters like parentheses or other brackets.

Hmm, I will see if I can write the scanner but I can meantime improve the regex.

#2 @tusharbharti
5 months ago

I created the scanner and got these:

I will also do a manual scan via editor but I think we have atleast consolidated most of the modifier locations

Last edited 5 months ago by tusharbharti (previous) (diff)

This ticket was mentioned in PR #9724 on WordPress/wordpress-develop by @tusharbharti.

5 months ago #3

Keywords has-patch added

This PR adds _wp_can_use_pcre_u guards to all the functions that use pcre_u modifier flag in regex.

Currently WordPress assumes that u flag is available by default but when the pcre_u support isn't present this falls apart and functions like parse_shortcodes_atts like breaks returning NULL.

Trac ticket: https://core.trac.wordpress.org/ticket/63913

#4 @tusharbharti
5 months ago

Right now, While writing this patch, I realized that the only way to solve/add guards is to either try to find a function that supports utf-8 by default or mbstring extension or some other hacks like IntlChar etc.

I believe moving to mbstring extension as standard would help a lot.

cc: @dmsnell

#5 @tusharbharti
5 months ago

If anyone is interested with the scanner, this uses https://github.com/nikic/PHP-Parser

<?php
require 'vendor/autoload.php';

use PhpParser\Error;
use PhpParser\Node;
use PhpParser\NodeTraverser;
use PhpParser\NodeVisitorAbstract;
use PhpParser\ParserFactory;

class RegexVisitor extends NodeVisitorAbstract {
    private string $file;

    public function __construct(string $file) {
        $this->file = $file;
    }

    public function enterNode(Node $node): void {
        // Look for *any* string literal (single, double, HEREDOC, NOWDOC, encapsed)
        if ($node instanceof Node\Scalar\String_ || $node instanceof Node\Scalar\Encapsed) {
            $value = $node instanceof Node\Scalar\String_
                ? $node->value
                : implode('', array_map(function ($p) {
                    return $p instanceof Node\Scalar\EncapsedStringPart ? $p->value : '';
                }, $node->parts));

            // Regex: delimiter, body, delimiter, modifiers
            if (preg_match(
                '/^(?P<delim>[^a-zA-Z0-9\s\\\\])
                  (?P<body>(?:\\\\.|(?!\1).)*)
                  \1(?P<modifiers>[imsxADSUXJu]+)$/x',
                $value,
                $m
            )) {
                if (strpos($m['modifiers'], 'u') !== false) {
                    printf(
                        "%s:%d: %s\n",
                        $this->file,
                        $node->getStartLine(),
                        $value
                    );
                }
            }
        }
    }
}

$root = __DIR__ . '/src'; // adjust path to WP root
$parser = (new ParserFactory)->createForNewestSupportedVersion();

$rii = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root));

foreach ($rii as $file) {
    if ($file->isDir() || pathinfo($file->getPathname(), PATHINFO_EXTENSION) !== 'php') {
        continue;
    }

    $code = file_get_contents($file->getPathname());
    try {
        $ast = $parser->parse($code);
    } catch (Error $e) {
        fwrite(STDERR, "Parse error in {$file}: {$e->getMessage()}\n");
        continue;
    }

    $traverser = new NodeTraverser();
    $traverser->addVisitor(new RegexVisitor($file->getPathname()));
    $traverser->traverse($ast);
}

Regex version

<?php

$root = __DIR__ . '/src';
$iterator = new RecursiveIteratorIterator(
    new RecursiveDirectoryIterator($root)
);

$pattern = '/(?P<quote>[\'"])(?P<delim>[^a-zA-Z0-9\s\\\\])
            (?P<regex>(?:\\\\.|(?!\2|\1).)*)
            (?P=delim)(?P<modifiers>[imsxADSUXJu]+)(?P=quote)/x';

foreach ($iterator as $file) {
    if ($file->isDir()) {
        continue;
    }
    if (pathinfo($file, PATHINFO_EXTENSION) !== 'php') {
        continue;
    }

    $lines = file($file->getPathname());
    foreach ($lines as $lineNumber => $line) {
        if (preg_match_all($pattern, $line, $matches, PREG_SET_ORDER)) {
            foreach ($matches as $match) {
                if (strpos($match['modifiers'], 'u') !== false) {
                    printf(
                        "%s:%d: %s\n",
                        $file->getPathname(),
                        $lineNumber + 1,
                        trim($line)
                    );
                }
            }
        }
    }
}

#6 @dmsnell
4 months ago

moving to mbstring extension as standard would help a lot.

It absolutely would. There are places WordPress currently crashes if mbstring isn’t available anyway. That issue, however, is discussed in #55603.

the only way to solve/add guards is to either try to find a function that supports utf-8 by default or mbstring extension or some other hacks like IntlChar etc.

Yes, anyone will discover that resolving these is not a simple matter of adding a function_exists() check, because the code calling these functions is likely wrong already and there is no alternative in the absence of these functions. This is why I created this separate tracking issue to take take of the long-term migration of functions assuming the flag.

Note: See TracTickets for help on using tickets.

Trac UI Preferences

Download in other formats:

Make WordPress Core

Context Navigation

#63913 new enhancement

WordPress assumes that the UTF-8 PCRE flag is available.

Description

Change History (6)

#1 @tusharbharti 5 months ago

#2 @tusharbharti 5 months ago

This ticket was mentioned in ​PR #9724 on ​WordPress/wordpress-develop by ​@tusharbharti. 5 months ago #3

#4 @tusharbharti 5 months ago

#5 @tusharbharti 5 months ago

#6 @dmsnell 4 months ago