Make WordPress Core

Opened 8 years ago

Last modified 7 years ago

#35022 assigned defect (bug)

WP allows Unicode 0x00a0 spaces in editor but shortcode parser can't handle them

Reported by: steevithak's profile steevithak Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.4
Component: Shortcodes Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Since the release of WP 4.4, we have complaints on multiple sites that shortcodes are "mysteriously" failing despite being syntactically correct. Users claim merely editing and re-saving articles breaks the shortcodes. After several days of research, I discovered the problem with the non-working shortcodes is that they contain 0x00a0 Unicode non-breaking space characters. I've been unable to determine how the space characters are being created or why it has started generating complaints for us since the release of 4.4 but I have looked at the code in /wp-includes/shortcodes.php and verified that it will fail on shortcodes containing 0x00a0 characters in white space despite the editor apparently allowing them as legal white space characters.

Part of the problem is this line of code which can't identify a shortcode tag if it is followed by a 0x00a0 instead of a 0x20 space character. There are further problems parsing the shortcode parameters. (adding \xc2\xa0 to the regex seems to fix this line).

// Find all registered tag names in $content.
preg_match_all( '@\[([^<>&/\[\]\x00-\x20]++)@', $content, $matches

Attachments (8)

35022-test.diff (575 bytes) - added by boonebgorges 8 years ago.
35022.patch (1.6 KB) - added by steevithak 8 years ago.
Fix parsing of shortcodes that contain Unicode non-breaking spaces or Unicode quotes
35022-test.2.diff (578 bytes) - added by gitlost 8 years ago.
\xA0 -> \xC0\xA0
35022.2.patch (2.2 KB) - added by gitlost 8 years ago.
Change shortcode regex in wptexturize() as well.
35022.3.patch (3.6 KB) - added by gitlost 8 years ago.
Use single-byte mode. Unit tests included.
35022.4.patch (12.4 KB) - added by gitlost 8 years ago.
Refactoring to deal with seg faults in old versions of PCRE when using grouped expressions.
35022.5.patch (12.9 KB) - added by gitlost 7 years ago.
Refresh after [38877] (#37767).
35022.6.patch (13.0 KB) - added by gitlost 7 years ago.
Refresh after [39351] (#38914).

Download all attachments as: .zip

Change History (33)

#1 @Otto42
8 years ago

In the Visual editor, if you insert more than one space in a row, presumably in an attempt to space something out vertically, the Visual editor will make every other character a non-breaking space (A0). This prevents that vertical space from collapsing back to a single space when you switch between Text mode and Visual mode. It also prevents the HTML rendering of the data from collapsing the extra space as well.

This appears to be intentional and part of TinyMCE, which means we should probably recognize A0's for shortcodes as well.

#2 @boonebgorges
8 years ago

  • Keywords needs-unit-tests added

I just came across the same issue on a client site.

#3 @boonebgorges
8 years ago

  • Keywords needs-unit-tests removed
  • Milestone changed from Awaiting Review to 4.4.3

I bisected the behavior back to [34747], as demonstrated by 35022-test.diff. By adding \xA0 to the do_shortcode() regex pattern as suggested by @steevithak, I was able to get the first assertion in my new test to pass.

Interestingly, it appears that attributes following an A0 non-breaking space have never been parsed correctly. So the second assertion in 35022-test.diff fails. A bit of debugging shows that $text = preg_replace("/[\x{00a0}\x{200b}]+/u", " ", $text); in shortcode_parse_args() is resulting in a PREG_BAD_UTF8_ERROR. It could be that this is a result of my local config?

This ticket was mentioned in Slack in #core by chriscct7. View the logs.


8 years ago

#5 @kirasong
8 years ago

  • Owner set to boonebgeorges
  • Status changed from new to assigned

#6 @kirasong
8 years ago

  • Owner changed from boonebgeorges to boonebgorges

#7 @boonebgorges
8 years ago

  • Owner boonebgorges deleted

I would like to see this fixed but I don't feel confident judging a patch.

#8 @steevithak
8 years ago

I'd offer to help but I guess you need someone with commit access to do it? I'm not sure if there's anything I can do. I could take a shot at writing a patch for it. Are there any guidelines for writing patches (e.g. do you want patches against a stable version or a trunk or daily version of the code, etc.)

#9 @johnbillion
8 years ago

@steevithak Thanks for offering! Anyone can contribute a patch, definitely not just committers. Check out the guidelines in the Core Handbook, but the gist is that a patch should be created from the root of a checkout of the trunk version of WordPress and then uploaded to the ticket using the ticket number as the file name (ie. 35022.patch).

#10 @steevithak
8 years ago

@johnbillion OK, thanks for the info. I've added it to my ToDo list and can maybe get to it later this week.

This ticket was mentioned in Slack in #core by chriscct7. View the logs.


8 years ago

#12 @chriscct7
8 years ago

  • Keywords needs-patch added

@steevithak we're going to punt this ticket out of the next release, but if you can get a patch ready for review this week, ping me on trac using @chriscct7 and I'll re-milestone it back in.

#13 @chriscct7
8 years ago

  • Milestone changed from 4.4.3 to Future Release

#14 @steevithak
8 years ago

@chriscct7 @johnbillion I've grabbed the trunk source and started working on this. Looking back at prior versions of WP, I'm a bit baffled by some of the current code in shortcodes.php. Just a few versions back, say in 4.2.2, the code is very elegant and compact with a single regex that identifies shortcodes. That one regex in get_shortcode_regex() is loaded by other functions in shortcodes.php that need to identify or parse shortcodes. But in the latest shortcodes.php, there several different methods being used to parse short codes. Some functions still use get_shortcode_regex() but other functions now use their own, slightly different regexes. (one of the functions already handles unicode non-breaking spaces indirectly by replacing them with 0x20 spaces prior to running the regex.) Where can I get more information on the rational behind these changes? Is there a developer mailing list or forum thread someone can point me to that might clue me in to why these changes were made in recent versions? There must be some reason for using multiple different methods of parsing shortcodes in different places and without understanding why, I'm afraid I may break something else while fixing the non-breaking space issue. :(

Last edited 8 years ago by steevithak (previous) (diff)

#15 @steevithak
8 years ago

Ok, I think I answered my own question. Ticket #33517 identified performance problems with WP posts containing massive numbers of shortcodes. If I'm understanding what happened correctly, a patch was applied that greatly improved performance by allowing the various parsing functions to make quickie passes through the post content, pulling out a list of shortcode tags using simplified regexes (that, for example, don't bother to properly identify every possible type of space character, just the most common ones). The patch for that bug appears to coincide with both the creation of this bug and the proliferation of alternate regexes I'm seeing. Given that we're worried about regex performance issues here, maybe the best we can hope for is to extend the hack that's already in shortcode_parse_atts() and just do a global replace of all other space chars with an 0x20 prior to running the regex. That seems less likely to bring back performance issues for people who have many hundreds of shortcodes than what I proposed in the original bug report.

Last edited 8 years ago by steevithak (previous) (diff)

#16 @steevithak
8 years ago

Great, this just got more complicated. I found the problem causing the mis-parsing of the first attribute following a shortcode with a non-breaking space (the problem @boonebgorges refers to in comment:3). It looks like some other part of Wordpress (not shortcodes.php) is unable to identify the shortcodes as shortcodes, so it's doing some sort conversion of quote characters into fancy Unicode quote characters. So instead of [shortcode foo="1,2,3"], it gets corrupted into [shortcode foo=”1,2,3″]. Once that happens, the regexes in shortcodes.php mis-parse the parameters list in a way that causes at least the first one to be lost. Even with code added in shortcodes.php to remove the non-breaking space within the shortcode, it's still broken because of the quotes. I don't know where in WP the quote-mangling code might be found, so the best I can think of is to modify the shortcode parser to handle Unicode as well as ASCII quote characters. Any other ideas?

@steevithak
8 years ago

Fix parsing of shortcodes that contain Unicode non-breaking spaces or Unicode quotes

#17 @steevithak
8 years ago

Ok, first attempt at a patch is uploaded. To fix the parsing of shortcodes with Unicode non-breaking spaces, I modified the regex currently in use by replacing the list of ASCII whitespace chars (\x00-\x20) with the \s whitespace match and then converting the regex to Unicode mode by adding the 'u' modifier. In theory this means any legal white space character ASCII or Unicode should now work.

To fix the problem with corrupted quote characters that causes the first shortcode attribute to be dropped, I add a work around in which the Unicode quote entity values are replaced with conventional ASCII quote chars prior to running the shortcode attribute parser. This allowed a simple one-line fix that doesn't require messing with the complex attribute parsing regex. The downside is that this is really just a work-around because I don't know where in WP the quote characters are being corrupted. This one-line change could be reverted later if someone figures out the real source of this problem.

Dislaimer: this works for me but hasn't been tested extensively. Hopefully some core developers can take a look and make sure it doesn't break anything.

Version 0, edited 8 years ago by steevithak (next)

#18 @gitlost
8 years ago

@boonebgorges the non-breaking space should be in UTF-8, \xC2\xA0, is why the preg_replace() in shortcode_parse_atts() is failing.

@steevithak wptexturize() in "wp-includes/formatting.php" is subbing the quote characters, as it uses its own copy of the shortcode tagname regex so isn't recognizing the shortcode.

I'll upload versions of the patches with those changes, using a define to put the terminators in one place. Also I think it might be better just to add \x{00A0} explicitly rather than use \s, to be conservative. There might also be a case to also add zero width space \x{200B} which is mentioned in other tests in "tests/phpunit/tests/shortcode.php" (and isn't covered by \s).

An alternative to changing to UTF-8 mode would be to stay in single-byte mode and add \xC2 to the terminators. This would rule out all characters from U+0080 to U+00BF (http://www.fileformat.info/info/charset/UTF-8/list.htm) but they seem to be controls and/or symbols so probably unlikely to be used in shortcode tagnames.

@gitlost
8 years ago

\xA0 -> \xC0\xA0

@gitlost
8 years ago

Change shortcode regex in wptexturize() as well.

#19 @steevithak
8 years ago

Sounds good, I like the idea of consolidating the char list in one place as a define. I chose \s because it seemed to cover the most cases in the smallest space. However, in addition to 200B, I found this list of quasi-whitespace Unicode chars that are not recognized \s:

http://stackoverflow.com/questions/3469080/match-whitespace-but-not-newlines-perl

Maybe an alternate/easier way is to consider a shortcode terminator character to be *any* character that's not legal for use in the shortcode name? Maybe something like ^\w+ (or whatever the appropriate list is if not \w)?

Last edited 8 years ago by steevithak (previous) (diff)

#20 @gitlost
8 years ago

Hi, I think a big issue here is backward compatibility. If one was specing shortcodes now it would make sense to restrict names drastically, similar to PHP variable names or whatever. But until this change (see also the Shortcode Road Map Draft Two New Restriction on Shortcode Names and the codex Shortcode API Names) there wasn't any formal restriction on what characters could be used in shortcode names - things just didn't (or did) work.

The restrictions introduced were pretty minimal, and it seems more in keeping to continue this minimalism, hence the suggestion just to add \x{00A0} (and perhaps \x{200B}) rather than use \s. (Note that the list of chars you linked to are actually matched by \s in UTF-8 mode, along with LF, VT, FF and CR. The quasi-whitespaces that aren't that one might think should be are U+200B,U+200C, U+200D, U+2060 and U+FEFF - https://en.wikipedia.org/wiki/Whitespace_character.)

Similarly with switching to use \w, which also in UTF-8 mode could have performance issues in needing to look up large Unicode tables.

Plus the particular use-case here is TinyMCE inserting U+00A0 into content, so adding other stuff could be seen as creep. And besides personally I use the Mongolian Vowel Separator in all my shortcode names.

#21 @steevithak
8 years ago

lol, good point, the main thing is to fix that 00a0 character. As long as that gets done my users will be happy. So any version of the fix that will allow the patch to be accepted is OK with me! :)

@gitlost
8 years ago

Use single-byte mode. Unit tests included.

#22 @gitlost
8 years ago

Discovered one can't rely on PCRE being installed with UTF-8 enabled. Also the check for U+00A0 should only happen when the charset is UTF-8. (Apart from that the previous patch was fine.)

So the simplest thing I think is just to use PCRE in single-byte mode and to only use the extended check when the blog charset is UTF-8, which the above patch does with conditional defines (it now needs two each for a positive and a negative match). Could add extra defines for particular legacy charsets like latin1 if one wanted.

(I spent quite a bit of time trying to track down where TinyMCE was adding the \x00a0's, and (as any schoolboy knows) it turns out it doesn't - it's the browser automatically adding &nbsp;'s to ContentEditable divs, which TinyMCE then encodes into \x00a0's. Chrome adds alternate "space &nbsp; space &nbsp;" etc while Firefox does "&nbsp; &nbsp;... space". IE does something similar to Firefox, but more microsofty. So there you go...)

@gitlost
8 years ago

Refactoring to deal with seg faults in old versions of PCRE when using grouped expressions.

#23 @gitlost
8 years ago

Unfortunately it turns out that using a grouped expression with unlimited repetition can trigger seg faults or match failures in versions of PCRE <= 8.12 (PHP <= 5.4.8, 5.3.18, 5.2) on large (~20K) matches.

So what was a simple patch turns into something else...

The new patch does some refactoring, putting the search for tagnames into its own function get_shortcode_tagnames() and putting the shortcode terminators into their own function shortcode_name_terminators(). The PCRE <= 8.12 restriction is got around by searching with the simple character class regex first and then post-processing. The good news is that this DRYs up some code, makes terminator matching consistent and reduces references to the $shortcode_tags global. The bad news we hold to be self-evident.

@gitlost
7 years ago

Refresh after [38877] (#37767).

#24 @gitlost
7 years ago

  • Keywords has-patch has-unit-tests added; needs-patch removed

#25 @swissspidy
7 years ago

#35770 was marked as a duplicate.

@gitlost
7 years ago

Refresh after [39351] (#38914).

Note: See TracTickets for help on using tickets.