WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 3 years ago

#27733 new defect (bug)

wpautop(): \s in regex destroys some UTF-8 characters

Reported by: tenpura Owned by:
Milestone: Future Release Priority: normal
Severity: major Version: 0.71
Component: Formatting Keywords: 4.0-early needs-patch needs-unit-tests wpautop
Focuses: Cc:

Description

\s in preg_replace() incorrectly targets some UTF-8 characters.

Steps to reproduce:

  1. Create a post with
    ム
    new line
    
    as a content.
  1. It will be output as
    <p>�<br>
    new line</p>
    

Quick Test:

$pee = "<p>ム\n";
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); 
echo $pee; // outputs <p>�<br />\n

Solution: Use [\r\n\t ] rather than \s.

Attachments (1)

27733.diff (4.1 KB) - added by tenpura 4 years ago.
Replace All \s with [\r\n\t ]

Download all attachments as: .zip

Change History (12)

@tenpura
4 years ago

Replace All \s with [\r\n\t ]

#1 @tenpura
4 years ago

  • Keywords has-patch added

#2 @SergeyBiryukov
4 years ago

  • Keywords 4.0-early added
  • Milestone changed from Awaiting Review to Future Release
  • Version set to 0.71

Confirmed. Added in [13], modified in [106].

#3 @tenpura
4 years ago

Related: #26842

#4 @miqrogroove
4 years ago

See related [28708]

#5 follow-up: @miqrogroove
4 years ago

For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.

Last edited 4 years ago by miqrogroove (previous) (diff)

#6 @miqrogroove
4 years ago

  • Keywords needs-patch needs-unit-tests added; has-patch removed

As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.

#7 @miqrogroove
4 years ago

  • Keywords wpautop added

#8 in reply to: ↑ 5 @SergeyBiryukov
4 years ago

Replying to miqrogroove:

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.

I can still reproduce the issue in the latest trunk.

#9 @miqrogroove
3 years ago

#28302 was marked as a duplicate.

#10 @miqrogroove
3 years ago

#28937 was marked as a duplicate.

#11 @pavelxk
3 years ago

  • Severity changed from normal to major

Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.

I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.

if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') {
  $pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee);
} else {
  $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
}
Note: See TracTickets for help on using tickets.