#57207 closed enhancement (fixed)
Consider adding the Unicode regex flag in wp_check_comment_disallowed_list
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 6.2 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Comments | Keywords: | has-patch |
Focuses: | Cc: |
Description
Hello,
SHORT VERSION:
Consider adding the Unicode regex flag ("u") in wp_check_comment_disallowed_list
by replacing line:
$pattern = "#$word#i";
with:
$pattern = "#$word#iu";
LONG VERSION:
My site is getting bombarded with Russian spam. In my site's settings, I've established a list of blacklisted words used by Russian spammers, such as "установка", and they're correctly being populated in disallowed_keys
. However, wp_check_comment_disallowed_list
is performing poorly. In theory, wp_check_comment_disallowed_list
should be case-insensitive, and should be blocking "Установка", "УСТАНОВКА", etc. But in practice, it isn't and it doesn't for Russian, unless you explicitely add the Unicode regex flag ("u").
I am 100% sure in what I am saying. I observed Russian spam passing through the filter. And also, I did the following debug test, which I give below (since I wrote it for myself, it is in French):
DEBUG TEST CODE:
SimpleLogger()->notice( 'Débogage — début', array() ); $test = 'установка'; $word = 'Установка'; $word = preg_quote( $word, '#' ); $pattern = "#$word#i"; if ( preg_match( $pattern, $test ) ) { SimpleLogger()->notice( 'Débogage — i - réussi' ); } else { SimpleLogger()->notice( 'Débogage — i - échoué' ); } $pattern = "#$word#iu"; if ( preg_match( $pattern, $test ) ) { SimpleLogger()->notice( 'Débogage — iu - réussi' ); } else { SimpleLogger()->notice( 'Débogage — iu - échoué' ); } SimpleLogger()->notice( 'Débogage — fin' );
DEBUG TEST RESULT:
Débogage — début Débogage — i - échoué Débogage — iu - réussi Débogage — fin
Attachments (2)
Change History (7)
#2
@
2 years ago
- Keywords has-patch added
- Milestone changed from Awaiting Review to 6.2
- Owner set to SergeyBiryukov
- Status changed from new to accepted
Hi there, welcome to WordPress Trac! Thanks for the report.
I was able to reproduce the issue and confirm that adding the u
(PCRE_UTF8) modifier fixes it.
57207.diff includes a unit test.
#3
@
2 years ago
57207.2.diff includes the same change for check_comment()
, which handles the Comment Moderation list.
I found official documentation on the purpose of the Unicode regex flag ("u") for case-insensitive matching. Not the PHP documentation, which is extremely brief on the subject of regex flags, which it calls "PCRE modifiers". But in the "Perl-compatible Regular Expressions (PCRE)" documentation:
Here is what this documentation says:
"If you want to use caseless matching for characters 128 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF-8 support."
and also:
"Case-insensitive matching applies only to characters whose values are less than 128, unless PCRE is built with Unicode property support."
This means that line:
works only for ASCII characters (characters whose values are less than 128), while:
works for all characters in general.