#60385 closed defect (bug) (fixed)
HTML API: Text nodes may be incorrectly split
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 6.5 | Priority: | normal |
Severity: | normal | Version: | 6.5 |
Component: | HTML API | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
Certain inputs cause WP_HTML_Tag_Processor::next_token()
to split text nodes that should be a single text node.
Example input: test< /A>
This should produce a single text node of test< /A>
(via ::get_modifiable_text()
). Instead it finds two adjacent text nodes of test
immediately followed by < /A>
.
See the single text node parsed here.
This was found thanks to the external test suite proposed in #60227.
Change History (7)
#2
@
14 months ago
- Keywords has-patch has-unit-tests added
- Milestone changed from Awaiting Review to 6.5
- Version set to trunk
This ticket was mentioned in PR #5976 on WordPress/wordpress-develop by @jonsurrell.
14 months ago
#3
Fix an issue where a <
character will always break a text node.
It should not break a text node if we will not start another node type, i.e. we find <
followed by !
, /
, ?
or an ascii alpha character.
Trac ticket: https://core.trac.wordpress.org/ticket/60385
#4
@
14 months ago
- Resolution fixed deleted
- Status changed from closed to reopened
@dmsnell noticed that there's a remaining bug if <
is followed by:
[
\
]
^
_
- ` (backtick)
I'll work on a fix.
This ticket was mentioned in PR #6041 on WordPress/wordpress-develop by @dmsnell.
14 months ago
#5
A fix was introduced to the Tag Processor to ensure that contiguous text in an HTML document emerges as a single text node spanning the full sequence. Unfortunately, that patch was marginally over-zealous in checking if a "<" started a syntax token or not. It used the following:
if ( 'A' <= $c && 'z' >= $c ) { ... }
This was based on the assumption that the A-Z and a-z letters are contiguous in the ASCII range; they aren't, and there's a gap of several characters in between. The result of this is that in some cases the parser created a text boundary when it didn't need to. Text boundaries can be surprising and can be created when reaching invalid syntax, HTML comments, and more hidden elements, so semantically this wasn't a major bug, but it was an aesthetic challenge.
In this patch the check is properly compared for both upper- and lower-case variants that could potentially form tag names.
if ( ( 'A' <= $c && 'Z' >= $c ) || ( 'a' <= $c && 'z' >= $c ) ) { ... }
This solves the problem and ensures that contiguous text appears as a single text node when scanning tokens.
Follow-up to [57489]
Props dmsnell, jonsurrell
See Core-60385
In 57489: