Make WordPress Core

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#11528 closed defect (bug) (fixed)

sanitize_text_field() issue with UTF-8 characters

Reported by: sergeybiryukov's profile SergeyBiryukov Owned by:
Milestone: 2.9.1 Priority: normal
Severity: major Version: 2.9
Component: Formatting Keywords:
Focuses: Cc:


sanitize_text_field() is the new function in /wp-includes/formatting.php which sanitizes a string from user input or from the database.

The following line of the function is not fully compatible with UTF-8:

$filtered = trim( preg_replace('/\s+/', ' ', $filtered) );

It creates problems with characters like Р (capital Cyrillic R) which can be represented as D0 A0 (hexadecimal) in ASCII and becomes D0 20 after the replacement. To reproduce the issue, one can try to create a category named оРангутанг or САПР. The rest of the word after Р is not displayed, the slug is incorrect too. If a title starts with Р, it is not displayed at all.

The problem was reported on Russian support forums soon after the release. Currently the filter is included in local files to avoid this replacement, however I think the issue is relevant to other languages using Cyrillic alphabet.

Change History (12)

#1 @SergeyBiryukov
14 years ago

The same problem is mentioned in preg_replace() comments section. It turns out A0 is actually   character, which is stripped by \s.

#2 @azaozz
14 years ago

Yes, a0 or UTF-8 00a0 is the   character. Can you try replacing that line with:

$filtered = trim( preg_replace('/[\r\n\t ]+/', ' ', $filtered) );

to see if it fixes this.

#3 @SergeyBiryukov
14 years ago

Yes, every Cyrillic character is displayed correctly with the new expression.

#4 @azaozz
14 years ago

  • Milestone changed from Unassigned to 2.9.1

#6 @azaozz
14 years ago

  • Resolution set to fixed
  • Status changed from new to closed

#7 @westi
14 years ago

I've Added these examples to a set of new unit tests for sanitize_text_field() as well.

#8 @SergeyBiryukov
14 years ago

Thanks a lot!

#9 @hakre
14 years ago

Related: #11619

#10 @hakre
14 years ago

  • Milestone changed from 2.9.1 to 2.9.2
  • Resolution fixed deleted
  • Status changed from closed to reopened

To fix this properly, you need to add the UTF8 modifier to preg_replace otherwise this will ever fail. Don't fix on the wrong end even if you think the results are pleasing. Instead you should know what is actually broken and what you do.

About the PCRE-u-Modifier:

u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Pattern Modifiers in the PHP Manual

To check wether or not a string is an UTF8 string, the best you can (currently) do with WP core code is to use the seems_utf8() function which only has some deficiencies compared to the UTF8 standard rfc. Propper functions are suggested in another ticket/patch locate here: #5998 / 5998.2.patch (for reference).

Propper check therefore would be to still use the \s character class but to use the u-modifier for it. preg_replace will set a regex error and return an empty string (boolean false) in case there is a problem with the encoding.

Unfourtionatly this patch went into 2.9.1 without a propper review so I will reopen the ticket and I suggest 2.9.2 as next milestone. This issue is related to charset / encoding.

#11 @westi
14 years ago

  • Milestone changed from 2.9.2 to 2.9.1
  • Resolution set to fixed
  • Status changed from reopened to closed

Please do not re-open tickets which were closed against a released milestone.

Please open a new ticket instead.

#12 @hakre
14 years ago

There you go: #11738

Note: See TracTickets for help on using tickets.