#47014 closed defect (bug) (fixed)
Tag balancing corrupts Custom Elements
Reported by: | westonruter | Owned by: | flixos90 |
---|---|---|---|
Milestone: | 5.3 | Priority: | normal |
Severity: | normal | Version: | 2.0.4 |
Component: | Formatting | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
When working with Custom Elements (for Web Components), I noticed that force_balance_tags()
corrupts them.
For example, given <my-element>test</my-element>
, this gets incorrectly mutated by force_balance_tags()
as follows:
wp> force_balance_tags('<my-element>test</my-element>') => string(22) "<my -element>test</my>"
Notice the space being added after my
in the start tag and then -element
being dropped in the end tag. So it's clear that -element
is being incorrectly interpreted as an attribute.
The issue is that the hyphen is not included in the characters expected to be in an element name. So the fix seems merely to be:
-
src/wp-includes/formatting.php
a b function force_balance_tags( $text ) { 2465 2465 // WP bug fix for LOVE <3 (and other situations with '<' before a number) 2466 2466 $text = preg_replace( '#<([0-9]{1})#', '<$1', $text ); 2467 2467 2468 while ( preg_match( '/<(\/?[\w: ]*)\s*([^>]*)>/', $text, $regex ) ) {2468 while ( preg_match( '/<(\/?[\w:\-]*)\s*([^>]*)>/', $text, $regex ) ) { 2469 2469 $newtext .= $tagqueue; 2470 2470 2471 2471 $i = strpos( $text, $regex[0] );
This issue has come up at least twice in supporting the AMP plugin:
Attachments (7)
Change History (22)
#3
@
5 years ago
- Milestone changed from Future Release to 5.3
- Owner set to flixos90
- Status changed from new to reviewing
Good catch! That should be fixed - web components are a standard and thus need to be supported by this.
#4
@
5 years ago
Here are some thoughts from a parsing standpoint; these are my interpretations of the standards and may be wrong.
- Custom elements are not HTML5 tags as the spec only allows alphanumeric characters in a tag name.
- Custom elements are not a standard yet but rather a draft at this point still
- The Custom Element draft spec is significantly more permissive for names than HTML tag names
This patch supports some cases of custom element names but leaves most still unsupported. For example, using the example from the draft spec <emotion-😍>
fools the pattern into thinking that the custom element name is emotion-
and its attributes are 😍
. I believe that this will result in creating a closing tag </emotion->
which is also incorrect.
It appears that this code is trying to find tag-like entities in the document and loop over them, performing semantic checks at each opening or closing. It might be a good time to not only patch over a single instance where this fails, but to consider improving the overall strength of the matching pattern to eliminate a number of other errors involved. For the purposes of force_balance_tags()
it seems safe to treat custom elements the same way as HTML tags.
Note that you exposed another glaring bug in the original pattern, only appearing now because we're relaxing the restrictions on the tag name: the \s*
matches even when there are no spaces separating the element name from the attribute list.
While ugly, this pattern will match custom element names as currently proposed:
# See https://w3c.github.io/webcomponents/spec/custom/#valid-custom-element-name
$tag_like_element = '#<(\/?[-.0-9_a-z\xb7\xc0-\xd6\xd8-\xf6\xf8-\x{37d}\x{37f}-\x{1fff}\x{200c}-\x{200d}\x{203f}-\x{2040}\x{2070}-\x{218f}\x{2c00}-\x{2fef}\x{3001}-\x{d7ff}\x{f900}-\x{fdcf}\x{fdf0}-\x{fffd}\x{10000}-\x{effff}]*)\s*([^>]*)>#iu';
There's a small problem here and that is that it now breaks HTML5 tags, as custom elements are forbidden from having uppercase letters in their name while HTML tags aren't.
They do not contain any uppercase ASCII letters, ensuring that the user agent can always treat HTML elements ASCII-case-insensitively. Notes in draft spec
As a practical step it seems valuable to expand the allowable character set beyond the basic -
since we expect in the wild to find custom elements with names that take advantage of their allowances. Any change here, even the proposed change, does incur some breakage on existing HTML tag names but the risk is that we'll allow names that are supposed to be invalid, possibly treating something that isn't really a tag or element as if it were.
some quick examples on regex101.com
For real fun we might consider creating a more robust parser that can differentiate and prevent the bugs, but I doubt anyone is going to want to do that here and now. 😉
#5
@
5 years ago
@dmsnell Thanks for these detailed observations. There are indeed several shortcomings in the current parser, but fixing all of them goes beyond the scope of this ticket. I also think only covering regular word characters in this iteration is a valuable enough change.
I agree though that we must make sure to not accidentally allow invalid tags (e.g. custom element tags including capital letters as you mention). Something I'm also wondering about: Is it actually valid to have a tag like <--->
? According to the spec it seems so.
#6
@
5 years ago
47014.diff makes the following updates:
- Consider capitalized tags only if they don't include a dash (i.e. allow dashes only in combination with
[a-z0-9]
). - Update the regex so that we can check whether spaces were provided after what we assume to be the tag or not.
- If there were no spaces provided and the first following character is a dash, we are in a tag that uses capitalization in combination with dashes, which is invalid - so the part is kept unchanged. That is not a super-elegant solution, but what I came up with that doesn't require tons of changes.
- Add more tests, for invalid custom elements and for a capitalized regular tag to make sure those continue to work.
I noticed, while this patch works completely for all the balanceTags()
tests, it seems to cause breakage in three other tests for wp_targeted_link_rel()
, which I'm not sure about why they're happening - would be great if one of y'all could verify if that happens for you too. Either way, I wanted to upload this patch already as possible iteration.
@
5 years ago
Updated the RegExp to move all parsing into the pattern and use helper variables to clarify logic
#7
follow-up:
↓ 8
@
5 years ago
@flixos90 for your perusal I also submitted a patch based on your work. in it I've taken the expanded tag name rules you provided but then moved the parsing logic mostly back into the RegExp. In the process I created a series of helper variables to make the logic clearer for me to understand.
Concerning the other tests I was curious why you removed the checks that specifically handled those cases? Did you intend on removing those in your patch? the is_serialized()
test and changes in wp_targeted_link_rel_callback
and such?
#8
in reply to:
↑ 7
@
5 years ago
Replying to dmsnell:
@flixos90 for your perusal I also submitted a patch based on your work. in it I've taken the expanded tag name rules you provided but then moved the parsing logic mostly back into the RegExp. In the process I created a series of helper variables to make the logic clearer for me to understand.
Awesome, those changes look good to me.
Concerning the other tests I was curious why you removed the checks that specifically handled those cases? Did you intend on removing those in your patch? the
is_serialized()
test and changes inwp_targeted_link_rel_callback
and such?
So I really just noticed these changes actually. I swear I didn't make them (at least not on purpose), I don't know how they got into this patch. Guess it now makes sense why I was completely confused about those other tests failing. 🙈 Long story short, thanks for bringing them back.
#9
@
5 years ago
I really like the effort made by @dmsnell in 47014-1.diff to expand the regex pattern with comments, it's really helpful. I wish this would be a standard in core :-)
I think it would also be nice to have the stripped regex pattern available, either as part of an inline comment or vice versa, with the expanded version as a comment.
What do you think about capitalizing the regex comments and end each line with a dot, when possible?
I would suggest adding to the test cases in data_custom_elements()
, to cover all the capturing groups in the regex. Here's an example:
<my-custom-element data-attribute="value"/>
to cover the 4th and the 5th capturing groups.
And e.g. <my-custom-element/>
and <my-custom-element />
to the existing cases for the first three capturing groups.
Can this be a strict comparison:
$is_single_tag = in_array( $tag, $single_tags );
as flagged by the coding standard check?
#10
@
5 years ago
So I really just noticed these changes actually. I swear I didn't make them (at least not on purpose), I don't know how they got into this patch.
The same thing just happened to me on another ticket but with different changes. I don't know how those got there either.
it's really helpful. I wish this would be a standard in core :-)
thanks for the kind words @birgire - I wasn't sure how it would be received. RegExp patterns are source code and I like to treat them that way beyond trivial matches: indentation, comments, etc…
have the stripped regex pattern available, either as part of an inline comment or vice versa, with the expanded version as a comment
at this point I have a hard time knowing how to feel. if we add the description as a comment it's bound to be wrong as soon as the first change to the pattern occurs 🙃
before I submitted I thought about including the full pattern by itself in a comment to make it easier for people to copy and paste into something like regex101.com. I came across the same problem though which was that keeping the comment and the actual pattern in sync was hard - even while developing it.
Having the pattern written as code made it easier for me to adjust it as I caught a few bugs in the tests - easier than the pattern without the structure and comments was. However, leaving that in as a full comment feels more like a teaching tool than code and we lose the benefit to maintaining it.
Sorry, this is me rambling on. I'll lean on the preferences people share here, but my experience tells me that it's valuable to have only one source of it based on explanatory RegExp comments that I've seen that described different RegExp patterns than they said they did 😛
What do you think about capitalizing the regex comments and end each line with a dot, when possible?
Will update the patch - I'm not opposed at all, just in a different habit.
Can this be a strict comparison:
Thanks! Good catch.
I would suggest adding to the test cases in data_custom_elements()
Also a good idea - I'll try and add those in today.
#11
@
5 years ago
Good point @dmsnell regarding the extra maintenance burden of having the regex pattern duplicated.
Having an easy way to remove the comments from the regex pattern, e.g. from the PHP shell example you provided, sounds good to me, if one needs to test it further with regex tools.
For the tests in 47014-4.diff I would suggest :
- adding
@ticket 47014
annotation to newtest_
methods @since 5.3.0
annotation to new nontest_
methods.- adding some missing end of line dots (also in tests)
- e.g. use
@see
annotation in the comment:// see http...
.
PS: also few missing end of line dots, both in code comments and tests comments.
All the best
#13
@
5 years ago
47014.2.diff makes the suggested doc improvements and fixes some related WPCS violations. This has extensive test coverage and should be good to go.
Fix with tests