Context Navigation

← Previous Ticket
Next Ticket →

#11528 closed defect (bug) (fixed)

sanitize_text_field() issue with UTF-8 characters

Reported by:	SergeyBiryukov	Owned by:
Milestone:	2.9.1	Priority:	normal
Severity:	major	Version:	2.9
Component:	Formatting	Keywords:
Focuses:		Cc:

Description

sanitize_text_field() is the new function in /wp-includes/formatting.php which sanitizes a string from user input or from the database.

The following line of the function is not fully compatible with UTF-8:

$filtered = trim( preg_replace('/\s+/', ' ', $filtered) );

It creates problems with characters like Р (capital Cyrillic R) which can be represented as D0 A0 (hexadecimal) in ASCII and becomes D0 20 after the replacement. To reproduce the issue, one can try to create a category named оРангутанг or САПР. The rest of the word after Р is not displayed, the slug is incorrect too. If a title starts with Р, it is not displayed at all.

The problem was reported on Russian support forums soon after the release. Currently the filter is included in local files to avoid this replacement, however I think the issue is relevant to other languages using Cyrillic alphabet.

Change History (12)

#1 @SergeyBiryukov
16 years ago

The same problem is mentioned in preg_replace() comments section. It turns out A0 is actually   character, which is stripped by \s.

#2 @azaozz
16 years ago

Yes, a0 or UTF-8 00a0 is the   character. Can you try replacing that line with:

$filtered = trim( preg_replace('/[\r\n\t ]+/', ' ', $filtered) );

to see if it fixes this.

#3 @SergeyBiryukov
16 years ago

Yes, every Cyrillic character is displayed correctly with the new expression.

#4 @azaozz
16 years ago

Milestone changed from Unassigned to 2.9.1

#5 @azaozz
16 years ago

Fixed in [12499], [12501].

#6 @azaozz
16 years ago

Resolution set to fixed
Status changed from new to closed

#7 @westi
16 years ago

I've Added these examples to a set of new unit tests for sanitize_text_field() as well.

#8 @SergeyBiryukov
16 years ago

Thanks a lot!

#9 @hakre
16 years ago

Related: #11619

#10 @hakre
16 years ago

Milestone changed from 2.9.1 to 2.9.2
Resolution fixed deleted
Status changed from closed to reopened

To fix this properly, you need to add the UTF8 modifier to preg_replace otherwise this will ever fail. Don't fix on the wrong end even if you think the results are pleasing. Instead you should know what is actually broken and what you do.

About the PCRE-u-Modifier:

u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

Pattern Modifiers in the PHP Manual

To check wether or not a string is an UTF8 string, the best you can (currently) do with WP core code is to use the seems_utf8() function which only has some deficiencies compared to the UTF8 standard rfc. Propper functions are suggested in another ticket/patch locate here: #5998 / 5998.2.patch (for reference).

Propper check therefore would be to still use the \s character class but to use the u-modifier for it. preg_replace will set a regex error and return an empty string (boolean false) in case there is a problem with the encoding.

Unfourtionatly this patch went into 2.9.1 without a propper review so I will reopen the ticket and I suggest 2.9.2 as next milestone. This issue is related to charset / encoding.

#11 @westi
16 years ago

Milestone changed from 2.9.2 to 2.9.1
Resolution set to fixed
Status changed from reopened to closed

Please do not re-open tickets which were closed against a released milestone.

Please open a new ticket instead.

#12 @hakre
16 years ago

There you go: #11738

Note: See TracTickets for help on using tickets.

Trac UI Preferences

Download in other formats:

Make WordPress Core

Context Navigation

#11528 closed defect (bug) (fixed)

sanitize_text_field() issue with UTF-8 characters

Description

Change History (12)

#1 @SergeyBiryukov 16 years ago

#2 @azaozz 16 years ago

#3 @SergeyBiryukov 16 years ago

#4 @azaozz 16 years ago

#5 @azaozz 16 years ago

#6 @azaozz 16 years ago

#7 @westi 16 years ago

#8 @SergeyBiryukov 16 years ago

#9 @hakre 16 years ago

#10 @hakre 16 years ago

#11 @westi 16 years ago

#12 @hakre 16 years ago