Make WordPress Core

Opened 12 years ago

Closed 4 months ago

Last modified 4 months ago

#27733 closed defect (bug) (invalid)

wpautop(): \s in regex destroys some UTF-8 characters

Reported by: tenpura's profile tenpura Owned by:
Milestone: Priority: normal
Severity: major Version: 0.71
Component: Formatting Keywords: needs-patch needs-unit-tests wpautop
Focuses: Cc:

Description

\s in preg_replace() incorrectly targets some UTF-8 characters.

Steps to reproduce:

  1. Create a post with
    ム
    new line
    
    as a content.
  1. It will be output as
    <p>�<br>
    new line</p>
    

Quick Test:

$pee = "<p>ム\n";
$pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee); 
echo $pee; // outputs <p>�<br />\n

Solution:
Use [\r\n\t ] rather than \s.

Attachments (1)

27733.diff (4.1 KB) - added by tenpura 12 years ago.
Replace All \s with [\r\n\t ]

Download all attachments as: .zip

Change History (19)

@tenpura
12 years ago

Replace All \s with [\r\n\t ]

#1 @tenpura
12 years ago

  • Keywords has-patch added

#2 @SergeyBiryukov
12 years ago

  • Keywords 4.0-early added
  • Milestone changed from Awaiting Review to Future Release
  • Version set to 0.71

Confirmed. Added in [13], modified in [106].

#3 @tenpura
12 years ago

Related: #26842

#4 @miqrogroove
12 years ago

See related [28708]

#5 follow-up: @miqrogroove
12 years ago

For debugging: http://www.fileformat.info/info/unicode/char/30e0/index.htm

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns. This problem also will not affect all servers as it is related to the code page referenced by PCRE.

Last edited 12 years ago by miqrogroove (previous) (diff)

#6 @miqrogroove
12 years ago

  • Keywords needs-patch needs-unit-tests added; has-patch removed

As we are having a good chat about ticket management in IRC, I'm taking the time to note that the existing patch here needs to consider implications from related tickets and changes. One of the main concerns with modifying wpautop() is that we have to ensure each kind of space is parsed as expected. In the post Editor, spaces sometimes serve as invisible placeholders, and as noted in #26842, this function also has a JS counterpart to consider.

#7 @miqrogroove
12 years ago

  • Keywords wpautop added

#8 in reply to: ↑ 5 @SergeyBiryukov
11 years ago

Replying to miqrogroove:

Note this character terminates with 0xA0. This problem should be partly fixed now in trunk as wp_spaces_regexp() has been implemented for compatibility with smilies, shortcodes, and various wptexturize patterns.

I can still reproduce the issue in the latest trunk.

#9 @miqrogroove
11 years ago

#28302 was marked as a duplicate.

#10 @miqrogroove
11 years ago

#28937 was marked as a duplicate.

#11 @pavelxk
11 years ago

  • Severity changed from normal to major

Happened to me on 4.2.2. with character Š (U+0160). This issue prevents further editing of the content/page because the editor does not load any content with invalid characters. Empty editor window is displayed without any errors. This means it is a major issue for me. It is also difficult to troubleshoot as it is locale specific.

I would suggest to fix this by adding explicit UTF-8 pattern modifier for UTF-8 content.

if (mb_detect_encoding($pee, 'UTF-8', true) === 'UTF-8') {
  $pee = preg_replace('|(?<!<br />)\s*\n|u', "<br />\n", $pee);
} else {
  $pee = preg_replace('|(?<!<br />)\s*\n|', "<br />\n", $pee);
}

#12 @johnbillion
4 years ago

  • Keywords 4.0-early removed

#13 @dmsnell
4 months ago

  • Resolution set to wontfix
  • Status changed from new to closed

the problem here is probably not really that we have \s but rather that we’re mixing encodings, right?

on a system whose internal encoding is something like latin1 we may get U+00A0 encoded as 0xA0, which is what the PCRE pattern will incorporate as a no-break space. Adding an arbitrarily limited set of space characters appears to resolve this problem because that particular offending byte is no longer caught, but there are a thousand other places different bytes will trip up.

on systems with UTF-8 as their internal encoding, however ,the no-break space will be encoded as 0xC2 0xA0 and the PCRE pattern will look for that. it won’t mangle the .

given that UTF-8 is the default internal encoding in PHP and has been for years I’m inclined to close this as it shouldn’t practically be an issue any more. if we wanted to resolve it fully we’d have to check every place we call string-related functions for which encoding is going in and which is set as the default. this is an unfeasible task.

for that I think it would fall nicely as a duplicate of #62172. if we acknowledge that UTF-8 is the only actual supported encoding, this bug cannot appear. it’s really the obligation of whoever is integrating the database, server, PHP code, and plugins to ensure proper harmony between various text encodings.

going to mark as wontfix for now. feel free to re-open if you disagree.

#14 @miqrogroove
4 months ago

given that UTF-8 is the default internal encoding in PHP and has been for years

I think we determined this bug was unrelated to the PHP charset. PCRE relies on the system locale. See "Generic character types" at https://pcre.org/pcre.txt

#15 @miqrogroove
4 months ago

Edit: Doing some tests on Windows.

Last edited 4 months ago by miqrogroove (previous) (diff)

#16 @miqrogroove
4 months ago

Again, nothing to do with the PHP charset.

For Windows binaries, this behavior changed with v7.0 in 2015.

WordPress currently advertises PHP 7.2.24 and up so I see no concerns.

#17 @dmsnell
4 months ago

  • Resolution changed from wontfix to fixed

Thanks for the investigation @miqrogroove

In my testing I was unable to get 1 === preg_match( '/\s/', "one\xA0two" ) regardless of my LC_CTYPE, LC_ALL, and other charset-related ENV values or php.ini settings. I think we are in agreement that the PCRE functions simply don’t do anything special.

(I was able to get it to match when adding the u flag as long as I updated the bytes to \xC2\xA0 for the proper UTF-8 encoding of the NO-BREAK SPACE).

It would be nice to know for sure that they are operating simply on bytes (regardless of encoding) or on UTF-8 if provided the UTF-8 flag.


This makes me think that WordPress in all of its supported environments will not and cannot create this scenario. Does that sound right? If so, I believe that wontfix is still fine, or maybe invalid would be more appropriate.

#18 @dmsnell
4 months ago

  • Resolution changed from fixed to invalid

Sorry: it was not my intention to change the status of this ticket, and I think I must have mistakenly done that. I have reset it back to invalid instead of wontfix because I think it’s no longer possible to reproduce.

Note: See TracTickets for help on using tickets.