WordPress.org

Make WordPress Core

Opened 3 months ago

Last modified 2 weeks ago

#47014 reviewing defect (bug)

Tag balancing corrupts Custom Elements

Reported by: westonruter Owned by: flixos90
Milestone: 5.3 Priority: normal
Severity: normal Version: 2.0.4
Component: Formatting Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

When working with Custom Elements (for Web Components), I noticed that force_balance_tags() corrupts them.

For example, given <my-element>test</my-element>, this gets incorrectly mutated by force_balance_tags() as follows:

wp> force_balance_tags('<my-element>test</my-element>')
=> string(22) "<my -element>test</my>"

Notice the space being added after my in the start tag and then -element being dropped in the end tag. So it's clear that -element is being incorrectly interpreted as an attribute.

The issue is that the hyphen is not included in the characters expected to be in an element name. So the fix seems merely to be:

  • src/wp-includes/formatting.php

    a b function force_balance_tags( $text ) { 
    24652465        // WP bug fix for LOVE <3 (and other situations with '<' before a number)
    24662466        $text = preg_replace( '#<([0-9]{1})#', '&lt;$1', $text );
    24672467
    2468         while ( preg_match( '/<(\/?[\w:]*)\s*([^>]*)>/', $text, $regex ) ) {
     2468        while ( preg_match( '/<(\/?[\w:\-]*)\s*([^>]*)>/', $text, $regex ) ) {
    24692469                $newtext .= $tagqueue;
    24702470
    24712471                $i = strpos( $text, $regex[0] );

This issue has come up at least twice in supporting the AMP plugin:

Attachments (6)

47014.0.diff (1.7 KB) - added by westonruter 3 months ago.
Fix with tests
47014.diff (8.9 KB) - added by flixos90 7 weeks ago.
47014-1.diff (5.9 KB) - added by dmsnell 7 weeks ago.
Updated the RegExp to move all parsing into the pattern and use helper variables to clarify logic
47014-2.diff (10.9 KB) - added by dmsnell 6 weeks ago.
Fix comment grammar, improve custom element matcher, expand tests
47014-3.diff (10.9 KB) - added by dmsnell 6 weeks ago.
Fix whitespace
47014-4.diff (11.0 KB) - added by dmsnell 6 weeks ago.
Fix comment grammar - sorry again - this is the last upload for now ;-)

Download all attachments as: .zip

Change History (18)

@westonruter
3 months ago

Fix with tests

#1 @westonruter
3 months ago

  • Keywords has-patch has-unit-tests added

#2 @swissspidy
3 months ago

  • Milestone changed from Awaiting Review to Future Release

#3 @flixos90
3 months ago

  • Milestone changed from Future Release to 5.3
  • Owner set to flixos90
  • Status changed from new to reviewing

Good catch! That should be fixed - web components are a standard and thus need to be supported by this.

#4 @dmsnell
3 months ago

Here are some thoughts from a parsing standpoint; these are my interpretations of the standards and may be wrong.

  • Custom elements are not HTML5 tags as the spec only allows alphanumeric characters in a tag name.
  • Custom elements are not a standard yet but rather a draft at this point still
  • The Custom Element draft spec is significantly more permissive for names than HTML tag names

This patch supports some cases of custom element names but leaves most still unsupported. For example, using the example from the draft spec <emotion-😍> fools the pattern into thinking that the custom element name is emotion- and its attributes are 😍. I believe that this will result in creating a closing tag </emotion-> which is also incorrect.

It appears that this code is trying to find tag-like entities in the document and loop over them, performing semantic checks at each opening or closing. It might be a good time to not only patch over a single instance where this fails, but to consider improving the overall strength of the matching pattern to eliminate a number of other errors involved. For the purposes of force_balance_tags() it seems safe to treat custom elements the same way as HTML tags.

Note that you exposed another glaring bug in the original pattern, only appearing now because we're relaxing the restrictions on the tag name: the \s* matches even when there are no spaces separating the element name from the attribute list.

While ugly, this pattern will match custom element names as currently proposed:

# See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-name
$tag_like_element = '#<(\/?[-.0-9_a-z\xb7\xc0-\xd6\xd8-\xf6\xf8-\x{37d}\x{37f}-\x{1fff}\x{200c}-\x{200d}\x{203f}-\x{2040}\x{2070}-\x{218f}\x{2c00}-\x{2fef}\x{3001}-\x{d7ff}\x{f900}-\x{fdcf}\x{fdf0}-\x{fffd}\x{10000}-\x{effff}]*)\s*([^>]*)>#iu';

There's a small problem here and that is that it now breaks HTML5 tags, as custom elements are forbidden from having uppercase letters in their name while HTML tags aren't.

They do not contain any uppercase ASCII letters, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively. Notes in draft spec

As a practical step it seems valuable to expand the allowable character set beyond the basic - since we expect in the wild to find custom elements with names that take advantage of their allowances. Any change here, even the proposed change, does incur some breakage on existing HTML tag names but the risk is that we'll allow names that are supposed to be invalid, possibly treating something that isn't really a tag or element as if it were.

some quick examples on regex101.com


For real fun we might consider creating a more robust parser that can differentiate and prevent the bugs, but I doubt anyone is going to want to do that here and now. 😉

#5 @flixos90
7 weeks ago

@dmsnell Thanks for these detailed observations. There are indeed several shortcomings in the current parser, but fixing all of them goes beyond the scope of this ticket. I also think only covering regular word characters in this iteration is a valuable enough change.

I agree though that we must make sure to not accidentally allow invalid tags (e.g. custom element tags including capital letters as you mention). Something I'm also wondering about: Is it actually valid to have a tag like <--->? According to the spec it seems so.

@flixos90
7 weeks ago

#6 @flixos90
7 weeks ago

47014.diff makes the following updates:

  • Consider capitalized tags only if they don't include a dash (i.e. allow dashes only in combination with [a-z0-9]).
  • Update the regex so that we can check whether spaces were provided after what we assume to be the tag or not.
  • If there were no spaces provided and the first following character is a dash, we are in a tag that uses capitalization in combination with dashes, which is invalid - so the part is kept unchanged. That is not a super-elegant solution, but what I came up with that doesn't require tons of changes.
  • Add more tests, for invalid custom elements and for a capitalized regular tag to make sure those continue to work.

I noticed, while this patch works completely for all the balanceTags() tests, it seems to cause breakage in three other tests for wp_targeted_link_rel(), which I'm not sure about why they're happening - would be great if one of y'all could verify if that happens for you too. Either way, I wanted to upload this patch already as possible iteration.

@dmsnell
7 weeks ago

Updated the RegExp to move all parsing into the pattern and use helper variables to clarify logic

#7 follow-up: @dmsnell
7 weeks ago

@flixos90 for your perusal I also submitted a patch based on your work. in it I've taken the expanded tag name rules you provided but then moved the parsing logic mostly back into the RegExp. In the process I created a series of helper variables to make the logic clearer for me to understand.

Concerning the other tests I was curious why you removed the checks that specifically handled those cases? Did you intend on removing those in your patch? the is_serialized() test and changes in wp_targeted_link_rel_callback and such?

#8 in reply to: ↑ 7 @flixos90
7 weeks ago

Replying to dmsnell:

@flixos90 for your perusal I also submitted a patch based on your work. in it I've taken the expanded tag name rules you provided but then moved the parsing logic mostly back into the RegExp. In the process I created a series of helper variables to make the logic clearer for me to understand.

Awesome, those changes look good to me.

Concerning the other tests I was curious why you removed the checks that specifically handled those cases? Did you intend on removing those in your patch? the is_serialized() test and changes in wp_targeted_link_rel_callback and such?

So I really just noticed these changes actually. I swear I didn't make them (at least not on purpose), I don't know how they got into this patch. Guess it now makes sense why I was completely confused about those other tests failing. 🙈 Long story short, thanks for bringing them back.

#9 @birgire
6 weeks ago

I really like the effort made by @dmsnell in 47014-1.diff to expand the regex pattern with comments, it's really helpful. I wish this would be a standard in core :-)

I think it would also be nice to have the stripped regex pattern available, either as part of an inline comment or vice versa, with the expanded version as a comment.

What do you think about capitalizing the regex comments and end each line with a dot, when possible?

I would suggest adding to the test cases in data_custom_elements(), to cover all the capturing groups in the regex. Here's an example:

<my-custom-element data-attribute="value"/>

to cover the 4th and the 5th capturing groups.

And e.g. <my-custom-element/> and <my-custom-element /> to the existing cases for the first three capturing groups.

Can this be a strict comparison:

$is_single_tag = in_array( $tag, $single_tags );

as flagged by the coding standard check?

#10 @dmsnell
6 weeks ago

So I really just noticed these changes actually. I swear I didn't make them (at least not on purpose), I don't know how they got into this patch.

The same thing just happened to me on another ticket but with different changes. I don't know how those got there either.

it's really helpful. I wish this would be a standard in core :-)

thanks for the kind words @birgire - I wasn't sure how it would be received. RegExp patterns are source code and I like to treat them that way beyond trivial matches: indentation, comments, etc…

have the stripped regex pattern available, either as part of an inline comment or vice versa, with the expanded version as a comment

at this point I have a hard time knowing how to feel. if we add the description as a comment it's bound to be wrong as soon as the first change to the pattern occurs 🙃

before I submitted I thought about including the full pattern by itself in a comment to make it easier for people to copy and paste into something like regex101.com. I came across the same problem though which was that keeping the comment and the actual pattern in sync was hard - even while developing it.

Having the pattern written as code made it easier for me to adjust it as I caught a few bugs in the tests - easier than the pattern without the structure and comments was. However, leaving that in as a full comment feels more like a teaching tool than code and we lose the benefit to maintaining it.

Sorry, this is me rambling on. I'll lean on the preferences people share here, but my experience tells me that it's valuable to have only one source of it based on explanatory RegExp comments that I've seen that described different RegExp patterns than they said they did 😛

What do you think about capitalizing the regex comments and end each line with a dot, when possible?

Will update the patch - I'm not opposed at all, just in a different habit.

Can this be a strict comparison:

Thanks! Good catch.

I would suggest adding to the test cases in data_custom_elements()

Also a good idea - I'll try and add those in today.

@dmsnell
6 weeks ago

Fix comment grammar, improve custom element matcher, expand tests

@dmsnell
6 weeks ago

Fix whitespace

@dmsnell
6 weeks ago

Fix comment grammar - sorry again - this is the last upload for now ;-)

#11 @birgire
6 weeks ago

Good point @dmsnell regarding the extra maintenance burden of having the regex pattern duplicated.

Having an easy way to remove the comments from the regex pattern, e.g. from the PHP shell example you provided, sounds good to me, if one needs to test it further with regex tools.

For the tests in 47014-4.diff I would suggest :

  • adding @ticket 47014 annotation to new test_ methods
  • @since 5.3.0 annotation to new non test_ methods.
  • adding some missing end of line dots (also in tests)
  • e.g. use @see annotation in the comment: // see http....

PS: also few missing end of line dots, both in code comments and tests comments.

All the best

#12 @pento
2 weeks ago

#36445 was marked as a duplicate.

Note: See TracTickets for help on using tickets.