#27733 closed defect (bug) (invalid)
wpautop(): \s in regex destroys some UTF-8 characters
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Priority: | normal | |
| Severity: | major | Version: | 0.71 |
| Component: | Formatting | Keywords: | needs-patch needs-unit-tests wpautop |
| Focuses: | Cc: |
Description
\s in preg_replace() incorrectly targets some UTF-8 characters.
Steps to reproduce:
- Create a post with
ム new line
as a content.
- It will be output as
<p>�<br> new line</p>
Quick Test:
$pee = "<p>ム\n";
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
echo $pee; // outputs <p>�<br />\n
Solution:
Use [\r\n\t ] rather than \s.
Attachments (1)
Change History (19)
#2
@
12 years ago
- Keywords 4.0-early added
- Milestone changed from Awaiting Review to Future Release
- Version set to 0.71
#5
follow-up:
↓ 8
@
12 years ago
For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm
Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.
#6
@
12 years ago
- Keywords needs-patch needs-unit-tests added; has-patch removed
As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.
#8
in reply to:
↑ 5
@
11 years ago
Replying to miqrogroove:
Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.
I can still reproduce the issue in the latest trunk.
#11
@
11 years ago
- Severity changed from normal to major
Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.
I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.
if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') {
$pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee);
} else {
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
}
#13
@
4 months ago
- Resolution set to wontfix
- Status changed from new to closed
the problem here is probably not really that we have \s but rather that we’re mixing encodings, right?
on a system whose internal encoding is something like latin1 we may get U+00A0 encoded as 0xA0, which is what the PCRE pattern will incorporate as a no-break space. Adding an arbitrarily limited set of space characters appears to resolve this problem because that particular offending byte is no longer caught, but there are a thousand other places different bytes will trip up.
on systems with UTF-8 as their internal encoding, however ,the no-break space will be encoded as 0xC2 0xA0 and the PCRE pattern will look for that. it won’t mangle the ム.
given that UTF-8 is the default internal encoding in PHP and has been for years I’m inclined to close this as it shouldn’t practically be an issue any more. if we wanted to resolve it fully we’d have to check every place we call string-related functions for which encoding is going in and which is set as the default. this is an unfeasible task.
for that I think it would fall nicely as a duplicate of #62172. if we acknowledge that UTF-8 is the only actual supported encoding, this bug cannot appear. it’s really the obligation of whoever is integrating the database, server, PHP code, and plugins to ensure proper harmony between various text encodings.
going to mark as wontfix for now. feel free to re-open if you disagree.
#14
@
4 months ago
given that UTF-8 is the default internal encoding in PHP and has been for years
I think we determined this bug was unrelated to the PHP charset. PCRE relies on the system locale. See "Generic character types" at https://pcre.org/pcre.txt
#16
@
4 months ago
Again, nothing to do with the PHP charset.
For Windows binaries, this behavior changed with v7.0 in 2015.
WordPress currently advertises PHP 7.2.24 and up so I see no concerns.
#17
@
4 months ago
- Resolution changed from wontfix to fixed
Thanks for the investigation @miqrogroove
In my testing I was unable to get 1 === preg_match( '/\s/', "one\xA0two" ) regardless of my LC_CTYPE, LC_ALL, and other charset-related ENV values or php.ini settings. I think we are in agreement that the PCRE functions simply don’t do anything special.
(I was able to get it to match when adding the u flag as long as I updated the bytes to \xC2\xA0 for the proper UTF-8 encoding of the NO-BREAK SPACE).
It would be nice to know for sure that they are operating simply on bytes (regardless of encoding) or on UTF-8 if provided the UTF-8 flag.
This makes me think that WordPress in all of its supported environments will not and cannot create this scenario. Does that sound right? If so, I believe that wontfix is still fine, or maybe invalid would be more appropriate.
Replace All \s with [\r\n\t ]