#11528 closed defect (bug) (fixed)
sanitize_text_field() issue with UTF-8 characters
Reported by: | SergeyBiryukov | Owned by: | |
---|---|---|---|
Milestone: | 2.9.1 | Priority: | normal |
Severity: | major | Version: | 2.9 |
Component: | Formatting | Keywords: | |
Focuses: | Cc: |
Description
sanitize_text_field()
is the new function in /wp-includes/formatting.php
which sanitizes a string from user input or from the database.
The following line of the function is not fully compatible with UTF-8:
$filtered = trim( preg_replace('/\s+/', ' ', $filtered) );
It creates problems with characters like Р (capital Cyrillic R) which can be represented as D0 A0
(hexadecimal) in ASCII and becomes D0 20
after the replacement. To reproduce the issue, one can try to create a category named оРангутанг or САПР. The rest of the word after Р is not displayed, the slug is incorrect too. If a title starts with Р, it is not displayed at all.
The problem was reported on Russian support forums soon after the release. Currently the filter is included in local files to avoid this replacement, however I think the issue is relevant to other languages using Cyrillic alphabet.
Change History (12)
#2
@
15 years ago
Yes, a0
or UTF-8 00a0
is the
character. Can you try replacing that line with:
$filtered = trim( preg_replace('/[\r\n\t ]+/', ' ', $filtered) );
to see if it fixes this.
#7
@
15 years ago
I've Added these examples to a set of new unit tests for sanitize_text_field()
as well.
#10
@
15 years ago
- Milestone changed from 2.9.1 to 2.9.2
- Resolution fixed deleted
- Status changed from closed to reopened
To fix this properly, you need to add the UTF8 modifier to preg_replace otherwise this will ever fail. Don't fix on the wrong end even if you think the results are pleasing. Instead you should know what is actually broken and what you do.
About the PCRE-u-Modifier:
u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
Pattern Modifiers in the PHP Manual
To check wether or not a string is an UTF8 string, the best you can (currently) do with WP core code is to use the seems_utf8() function which only has some deficiencies compared to the UTF8 standard rfc. Propper functions are suggested in another ticket/patch locate here: #5998 / 5998.2.patch (for reference).
Propper check therefore would be to still use the \s character class but to use the u-modifier for it. preg_replace will set a regex error and return an empty string (boolean false) in case there is a problem with the encoding.
Unfourtionatly this patch went into 2.9.1 without a propper review so I will reopen the ticket and I suggest 2.9.2 as next milestone. This issue is related to charset / encoding.
The same problem is mentioned in preg_replace() comments section. It turns out
A0
is actually
character, which is stripped by\s
.