Make WordPress Core

Opened 15 years ago

Closed 10 years ago

Last modified 10 years ago

#11738 closed defect (bug) (duplicate)

sanitize_text_field() issue with UTF-8 characters

Reported by: hakre's profile hakre Owned by: hakre's profile hakre
Milestone: Priority: normal
Severity: normal Version: 2.9.1
Component: Charset Keywords:
Focuses: Cc:

Description

See Description #11528

Suggested patch does not take UTF-8 properly into account. Function has been degraded in the commit(s) [12499], [12501] to not support shift-spaces any longer. Details about preg_replace and UTF-8 as well as a proper suggestion to fix w/o degration here.

Change History (19)

#1 follow-up: @azaozz
15 years ago

PCRE UTF-8 (the "u" modifier) is not supported everywhere. For reference see wp_check_invalid_utf8(). We probably could set a global instead of the static there and use that for the three functions since they are usually called multiple times.

In any case this will need testing in an affected locale/installation.

#2 follow-up: @westi
15 years ago

Also you assume that the string is UTF8 by using the "u" modifier what about people using a different charset?

#3 follow-up: @Denis-de-Bernardy
15 years ago

per the php doc:

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE8)

This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.

So it's available all the time given the current requirements. And we'd even get validity if we bump the requirement to php 4.3.5 or php 4.4.

#4 in reply to: ↑ 2 @hakre
15 years ago

Replying to westi:

Also you assume that the string is UTF8 by using the "u" modifier what about people using a different charset?

I suggested in the other ticket to check the encoding prior to apply the replace operation on it. we have a seems_utf8() function that could do the job and I referenced more code that even does the job properly to validate if input is properly utf8 encoded. For any other encoding, we can handle it as we did: just with a best guess and a preg_replace w/o the u-modifier.

Please do not read that I want to speak against other encodings, but factually it's not taken much care about input encoding in the wp codebase.

#5 in reply to: ↑ 1 @hakre
15 years ago

Replying to azaozz:

PCRE UTF-8 (the "u" modifier) is not supported everywhere. For reference see wp_check_invalid_utf8().
We probably could set a global instead of the static there and use that for the three functions since they are usually called multiple times.

wp_check_invalid_utf8() has it's own problems, I know. Please see the PHP documentation I pointed to in this tickets description and Denis did in a comment right in this ticket regarding a clear statement since which version the u-modifier is actually available. It is matching with WordPress current system requirements, so that function can benefit from a refactoring anyway.

Setting a static and/or global does not help since on each function call the input might have a different encoding. We have functions that are working independently from php extenstions like seems_utf8() for example. In another patch I offer a fallback save implementation as is_valid_utf8() that does the job in any case even if the preg functions do not support any u-modifier. Something the current code is currently missing. Please see that code, look for is_valid_utf8(). You can find additional documentation on my codex page regarding utf8 and php.

In any case this will need testing in an affected locale/installation.

Afaik the current implementation fails with shift-spaces. In the other ticket there is the test-case this function needs to cope with, those russian letters in UTF8. Prior to commit of the last patch that was the only thing "tested" against. No further review of the patch nor further tests.

#6 @hakre
15 years ago

  • Component changed from General to Formatting
  • Severity changed from normal to major

#7 @Denis-de-Bernardy
15 years ago

  • Component changed from Formatting to Charset
  • Owner set to hakre

#8 in reply to: ↑ 3 @azaozz
15 years ago

Replying to Denis-de-Bernardy:

per the php doc:

http://php.net/manual/en/reference.pcre.pattern.modifiers.php

u (PCRE8)

...
So it's available all the time given the current requirements. And we'd even get validity if we bump the requirement to php 4.3.5 or php 4.4.

It seems it should be but as far as I remember we had some issues with it in the past. Also PCRE (the library) can be build without UTF-8 support, in fact it seems it's disabled by default:

"If you want to make use of the support for UTF-8 Unicode character strings in
PCRE, you must add --enable-utf8 to the "configure" command. Without it, the
code for handling UTF-8 is not included in the library. Even when included,
it still has to be enabled by an option at run time. When PCRE is compiled
with this option, its input can only either be ASCII or UTF-8..."

http://www.pcre.org/readme.txt

#9 @Denis-de-Bernardy
15 years ago

Fair enough... looks like we're not the only ones who had issues, either:

http://forums.freebsd.org/showthread.php?t=656

#10 follow-up: @Denis-de-Bernardy
15 years ago

That said, I'm curious to understand why the ticket started by a discussion on the /u modifier. The functions from the related patch don't seem to use it:

http://core.trac.wordpress.org/attachment/ticket/5998/5998.2.patch

there is one issue with the regexp from is_valid_utf8_preg5998(), since the latter lacks delimiters and it could arguably start with "(?:" instead of "(" to avoid backtrace issues. But appart

#11 follow-up: @azaozz
15 years ago

  • Milestone changed from 2.9.2 to 3.0
  • Severity changed from major to normal

Replying to hakre:

Setting a static and/or global does not help since on each function call the input might have a different encoding.

Don't think so. This filters text coming either from the browser (wp-admin) or the db. In both cases we set the encoding according to get_option('blog_charset'). If the encoding doesn't match, that would mean the text is not coming from a "proper" place and it would probably fail the wp_check_invalid_utf8() test which is run on the same string.

We have functions that are working independently from php extenstions like seems_utf8() for example. In another patch I offer a fallback save implementation as is_valid_utf8() that does the job in any case even if the preg functions do not support any u-modifier.

So in some functions we would insist on using PCRE UTF-8 while in other we would bypass it and use a fallback?

In the other ticket there is the test-case this function needs to cope with, those russian letters in UTF8. Prior to commit of the last patch that was the only thing "tested" against. No further review of the patch nor further tests.

No, I've tested it with several UTF-8 locales, mostly Asian languages and since it was a regression the choices were either to fix it reliably or remove it. Reported cases: #11528, #11669, #11619.

Another possible inconsistency: seems that "\s" can match slightly different characters on different systems or versions of PCRE.

In that terms forcing \s with the "u" modifier doesn't look right. If the blog charset is UTF-8 we can detect whether PCRE supports it (like in wp_check_invalid_utf8) and use it but will need to have a fallback that matches specific characters as it is currently.

Having a global to store this and replace the static currently used in wp_check_invalid_utf8() seems the proper approach. Don't see why we should run the test whether PCRE UTF-8 is available every time one of these functions is called.

#12 in reply to: ↑ 11 @hakre
15 years ago

Replying to azaozz:

It seems it should be but as far as I remember we had some issues with it in the past.

Pleae provide references to them, this might be related to PHP version < 4.2.

Also PCRE (the library) can be build without UTF-8 support, in fact it seems it's disabled by default:

"If you want to make use of the support for UTF-8 Unicode character strings in
PCRE, you must add --enable-utf8 to the "configure" command. (...)

http://www.pcre.org/readme.txt

PHP preg_ functions are not the original PCRE. Please refer to PHP and it's documentation and not to PERL and it's documentation regarding functionality incl. modifiers, especially the /u modifier. There are differences between PCRE in Perl and the PCRE bundeled with PHP.

---

Replying to azaozz:

Replying to hakre:

We have functions that are working independently from php extenstions like seems_utf8() for example. In another patch I offer a fallback save implementation as is_valid_utf8() that does the job in any case even if the preg functions do not support any u-modifier.

So in some functions we would insist on using PCRE UTF-8 while in other we would bypass it and use a fallback?

Just wanted to give an example that it's possible to write regex that can handle utf-8 validation w/o the /u modifier properly (even for shift-space and the like). For this case here I foremost thought the /u modifier does make the pattern more simple to read. It can be done w/o the /u modifier properly as well.

In the other ticket there is the test-case this function needs to cope with, those russian letters in UTF8. Prior to commit of the last patch that was the only thing "tested" against. No further review of the patch nor further tests.

No, I've tested it with several UTF-8 locales, mostly Asian languages and since it was a regression the choices were either to fix it reliably or remove it. Reported cases: #11528, #11669, #11619.

Thats good to know, can you please reference / attach those tests so they can be used to test an alternative approach to #11528 / [12499].

Another possible inconsistency: seems that "\s" can match slightly different characters on different systems or versions of PCRE.

What \s matches is describben here. \s is matching any whitespace character (which includes xA0 / 160 that's why we have the bugreports in the other tickets).

In that terms forcing \s with the "u" modifier doesn't look right. If the blog charset is UTF-8 we can detect whether PCRE supports it (like in wp_check_invalid_utf8) and use it but will need to have a fallback that matches specific characters as it is currently.

\s with the /u modifier looks pretty right. Whitespace UTF-8 save. Done.

Well with PHP 4.3 /u modifier is there - anytime (if not PHP was compiled --without-pcre-regex but then the ascii 7bit centric "fix" won't work either). But as I've already written, there is no need for /u, it can be propperly handeled without /u but also taking care of valid UTF8 multibyte sequences to remove whitespaces.

Having a global to store this and replace the static currently used in wp_check_invalid_utf8() seems the proper approach. Don't see why we should run the test whether PCRE UTF-8 is available every time one of these functions is called.

That's right, I've been a bit shortsighted here. A variable can save a decision for the length of a request, true. The options I have here are (no specific order):

  • native PHP code
  • preg_ functions (enabled per default since 4.2.0)
  • mb_ functions (non-default extenstion)
  • iconv (non-default extenstion)

A static per function might do the best job since depending on what to do (validating, filtering) there might be different priorities of functions to use and as long this is on a beginning level of implementation I do not want to mess around with globals that much.

#13 @hakre
15 years ago

Normally isspace() matches:

  • space " " / x20 / 32
  • formfeed "\f" / xOC / 12
  • newline "\n" / x0A / 10
  • carriage return "\r" / x0D / 13
  • horizontal tab "\t" / x09 / 9
  • vertical tab "\v" / x0B / 11

This is the same like on the isspace() Linux Manpage.

Within SGML and HTML I could not find a concrete definition of whitspace (saying, xA0 in or not). Since users reported problems with double-byte chars containing xA0 and the degrade to ascii 7-bit did the fix, we must assume that \s w/o the /u modifier conains xA0 as well.

On a system where that was the case, a test should be run with \s and the /u modifier to check wether it helps or not. I did run that on my testbed and the /u modifier does solve the problem (test-code posted on the other ticket).

So IMHO the /u modifier is a good to go at least for a first run.

#14 in reply to: ↑ 10 ; follow-up: @hakre
15 years ago

Replying to Denis-de-Bernardy:

there is one issue with the regexp from is_valid_utf8_preg5998(), since the latter lacks delimiters and it could arguably start with "(?:" instead of "(" to avoid backtrace issues.

There is another issue with it, it just does not properly work because PHP itself reports problems here on some testcases I've been running:

Warning: preg_replace() [function.preg-replace]: Compilation failed: range out of order in character class at offset 6

#15 in reply to: ↑ 14 @hakre
15 years ago

Replying to hakre:

Replying to Denis-de-Bernardy:

there is one issue with the regexp from is_valid_utf8_preg5998(), since the latter lacks delimiters and it could arguably start with "(?:" instead of "(" to avoid backtrace issues.

There is another issue with it, it just does not properly work because PHP itself reports problems here on some testcases I've been running:

Warning: preg_replace() [function.preg-replace]: Compilation failed: range out of order in character class at offset 6

Actually that was a typo I made in the code there. Just ignore that error. That suggested function needs and update though to fix that problem (start of regex was wrongly copied over).

#16 @ryan
14 years ago

  • Milestone changed from 3.0 to 3.1

#17 @nacin
14 years ago

  • Milestone changed from Awaiting Triage to Future Release

#18 @chriscct7
10 years ago

  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #29717.

#19 @DrewAPicture
10 years ago

  • Keywords needs-patch removed
  • Milestone Future Release deleted
Note: See TracTickets for help on using tickets.