Make WordPress Core

Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#57207 closed enhancement (fixed)

Consider adding the Unicode regex flag in wp_check_comment_disallowed_list

Reported by: bonjour52's profile bonjour52 Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 6.2 Priority: normal
Severity: normal Version:
Component: Comments Keywords: has-patch
Focuses: Cc:

Description

Hello,

SHORT VERSION:

Consider adding the Unicode regex flag ("u") in wp_check_comment_disallowed_list by replacing line:

$pattern = "#$word#i";

with:

$pattern = "#$word#iu";

LONG VERSION:

My site is getting bombarded with Russian spam. In my site's settings, I've established a list of blacklisted words used by Russian spammers, such as "установка", and they're correctly being populated in disallowed_keys. However, wp_check_comment_disallowed_list is performing poorly. In theory, wp_check_comment_disallowed_list should be case-insensitive, and should be blocking "Установка", "УСТАНОВКА", etc. But in practice, it isn't and it doesn't for Russian, unless you explicitely add the Unicode regex flag ("u").

I am 100% sure in what I am saying. I observed Russian spam passing through the filter. And also, I did the following debug test, which I give below (since I wrote it for myself, it is in French):

DEBUG TEST CODE:

SimpleLogger()->notice( 'Débogage — début', array() );
$test = 'установка';
$word = 'Установка';
$word = preg_quote( $word, '#' );
$pattern = "#$word#i";
if ( preg_match( $pattern, $test ) ) {
    SimpleLogger()->notice( 'Débogage — i - réussi' );
} else {
    SimpleLogger()->notice( 'Débogage — i - échoué' );
}
$pattern = "#$word#iu";
if ( preg_match( $pattern, $test ) ) {
    SimpleLogger()->notice( 'Débogage — iu - réussi' );
} else {
    SimpleLogger()->notice( 'Débogage — iu - échoué' );
}
SimpleLogger()->notice( 'Débogage — fin' );

DEBUG TEST RESULT:

Débogage — début
Débogage — i - échoué
Débogage — iu - réussi
Débogage — fin

Attachments (2)

57207.diff (1.5 KB) - added by SergeyBiryukov 2 years ago.
57207.2.diff (3.2 KB) - added by SergeyBiryukov 2 years ago.

Download all attachments as: .zip

Change History (7)

#1 @bonjour52
2 years ago

I found official documentation on the purpose of the Unicode regex flag ("u") for case-insensitive matching. Not the PHP documentation, which is extremely brief on the subject of regex flags, which it calls "PCRE modifiers". But in the "Perl-compatible Regular Expressions (PCRE)" documentation:

https://pcre.org/pcre.txt

Here is what this documentation says:

"If you want to use caseless matching for characters 128 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF-8 support."

and also:

"Case-insensitive matching applies only to characters whose values are less than 128, unless PCRE is built with Unicode property support."

This means that line:

$pattern = "#$word#i";

works only for ASCII characters (characters whose values are less than 128), while:

$pattern = "#$word#iu";

works for all characters in general.

@SergeyBiryukov
2 years ago

#2 @SergeyBiryukov
2 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 6.2
  • Owner set to SergeyBiryukov
  • Status changed from new to accepted

Hi there, welcome to WordPress Trac! Thanks for the report.

I was able to reproduce the issue and confirm that adding the u (PCRE_UTF8) modifier fixes it.

57207.diff includes a unit test.

#3 @SergeyBiryukov
2 years ago

57207.2.diff includes the same change for check_comment(), which handles the Comment Moderation list.

#4 @SergeyBiryukov
2 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

In 54888:

Comments: Make moderated or disallowed key check case-insensitive for non-Latin words.

The check_comment() and wp_check_comment_disallowed_list() functions are expected to be case-insensitive, but that only worked for words using Latin script and consisting of ASCII characters.

This commit adds the Unicode flag to the regular expression used for the check in these functions, so that both pattern and subject can be treated as UTF-8 strings.

Reference: PHP Manual: Pattern Modifiers.

Follow-up to [984], [2075], [48121], [48575].

Props bonjour52, SergeyBiryukov.
Fixes #57207.

#5 @bonjour52
2 years ago

Thank you very much, Sergey! Very professionally tested and documented. Спасибо, Серёжа!

Note: See TracTickets for help on using tickets.