Make WordPress Core

Opened 11 years ago

Last modified 3 years ago

#27733 new defect (bug)

wpautop(): \s in regex destroys some UTF-8 characters

Reported by: tenpura's profile tenpura Owned by:
Milestone: Priority: normal
Severity: major Version: 0.71
Component: Formatting Keywords: needs-patch needs-unit-tests wpautop
Focuses: Cc:

Description

\s in preg_replace() incorrectly targets some UTF-8 characters.

Steps to reproduce:

  1. Create a post with
    ム
    new line
    
    as a content.
  1. It will be output as
    <p>�<br>
    new line</p>
    

Quick Test:

$pee = "<p>ム\n";
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); 
echo $pee; // outputs <p>�<br />\n

Solution:
Use [\r\n\t ] rather than \s.

Attachments (1)

27733.diff (4.1 KB) - added by tenpura 11 years ago.
Replace All \s with [\r\n\t ]

Download all attachments as: .zip

Change History (13)

@tenpura
11 years ago

Replace All \s with [\r\n\t ]

#1 @tenpura
11 years ago

  • Keywords has-patch added

#2 @SergeyBiryukov
11 years ago

  • Keywords 4.0-early added
  • Milestone changed from Awaiting Review to Future Release
  • Version set to 0.71

Confirmed. Added in [13], modified in [106].

#3 @tenpura
11 years ago

Related: #26842

#4 @miqrogroove
10 years ago

See related [28708]

#5 follow-up: @miqrogroove
10 years ago

For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.

Last edited 10 years ago by miqrogroove (previous) (diff)

#6 @miqrogroove
10 years ago

  • Keywords needs-patch needs-unit-tests added; has-patch removed

As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.

#7 @miqrogroove
10 years ago

  • Keywords wpautop added

#8 in reply to: ↑ 5 @SergeyBiryukov
10 years ago

Replying to miqrogroove:

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.

I can still reproduce the issue in the latest trunk.

#9 @miqrogroove
9 years ago

#28302 was marked as a duplicate.

#10 @miqrogroove
9 years ago

#28937 was marked as a duplicate.

#11 @pavelxk
9 years ago

  • Severity changed from normal to major

Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.

I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.

if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') {
  $pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee);
} else {
  $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
}

#12 @johnbillion
3 years ago

  • Keywords 4.0-early removed
Note: See TracTickets for help on using tickets.