WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 3 weeks ago

#27733 closed defect (bug) (wontfix)

wpautop(): \s in regex destroys some UTF-8 characters

Reported by: tenpura Owned by:
Milestone: Priority: normal
Severity: major Version: 0.71
Component: Formatting Keywords: 4.0-early needs-patch needs-unit-tests wpautop
Focuses: Cc:

Description

\s in preg_replace() incorrectly targets some UTF-8 characters.

Steps to reproduce:

  1. Create a post with
    ム
    new line
    
    as a content.
  1. It will be output as
    <p>�<br>
    new line</p>
    

Quick Test:

$pee = "<p>ム\n";
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); 
echo $pee; // outputs <p>�<br />\n

Solution: Use [\r\n\t ] rather than \s.

Attachments (1)

27733.diff (4.1 KB) - added by tenpura 5 years ago.
Replace All \s with [\r\n\t ]

Download all attachments as: .zip

Change History (13)

@tenpura
5 years ago

Replace All \s with [\r\n\t ]

#1 @tenpura
5 years ago

  • Keywords has-patch added

#2 @SergeyBiryukov
5 years ago

  • Keywords 4.0-early added
  • Milestone changed from Awaiting Review to Future Release
  • Version set to 0.71

Confirmed. Added in [13], modified in [106].

#3 @tenpura
5 years ago

Related: #26842

#4 @miqrogroove
5 years ago

See related [28708]

#5 follow-up: @miqrogroove
5 years ago

For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.

Last edited 5 years ago by miqrogroove (previous) (diff)

#6 @miqrogroove
5 years ago

  • Keywords needs-patch needs-unit-tests added; has-patch removed

As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.

#7 @miqrogroove
5 years ago

  • Keywords wpautop added

#8 in reply to: ↑ 5 @SergeyBiryukov
4 years ago

Replying to miqrogroove:

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.

I can still reproduce the issue in the latest trunk.

#9 @miqrogroove
4 years ago

#28302 was marked as a duplicate.

#10 @miqrogroove
4 years ago

#28937 was marked as a duplicate.

#11 @pavelxk
4 years ago

  • Severity changed from normal to major

Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.

I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.

if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') {
  $pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee);
} else {
  $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
}

#12 @iseulde
3 weeks ago

  • Milestone Future Release deleted
  • Resolution set to wontfix
  • Status changed from new to closed

This ticket has not seen any activity in over *two* years, so I'm closing it as "wontfix".

The ticket may lack decisiveness, may have become irrelevant, or may not have gathered enough interest.

If you think this ticket does deserve some attention again, feel free to reopen.

For bugs, it would be great if you could provide updated steps to reproduce against the latest version of WordPress (5.0.2 at the time of writing). Remember images or a video can be superior to explain a problem. At the very least, quickly test again to make sure the bug still exists.

If it’s an enhancement or feature, some extra motivation may help.

Thank you for your contributions to WordPress! <3

Note: See TracTickets for help on using tickets.