Opened 11 years ago
Last modified 3 years ago
#27733 new defect (bug)
wpautop(): \s in regex destroys some UTF-8 characters
Reported by: | tenpura | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | major | Version: | 0.71 |
Component: | Formatting | Keywords: | needs-patch needs-unit-tests wpautop |
Focuses: | Cc: |
Description
\s in preg_replace() incorrectly targets some UTF-8 characters.
Steps to reproduce:
- Create a post with
ム new line
as a content.
- It will be output as
<p>�<br> new line</p>
Quick Test:
$pee = "<p>ム\n"; $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); echo $pee; // outputs <p>�<br />\n
Solution:
Use [\r\n\t ] rather than \s.
Attachments (1)
Change History (13)
#2
@
11 years ago
- Keywords 4.0-early added
- Milestone changed from Awaiting Review to Future Release
- Version set to 0.71
#5
follow-up:
↓ 8
@
10 years ago
For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm
Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.
#6
@
10 years ago
- Keywords needs-patch needs-unit-tests added; has-patch removed
As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.
#8
in reply to:
↑ 5
@
10 years ago
Replying to miqrogroove:
Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.
I can still reproduce the issue in the latest trunk.
#11
@
9 years ago
- Severity changed from normal to major
Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.
I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.
if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') { $pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee); } else { $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); }
Replace All \s with [\r\n\t ]