#60385 closed defect (bug) (fixed)
HTML API: Text nodes may be incorrectly split
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 6.5 | Priority: | normal |
| Severity: | normal | Version: | 6.5 |
| Component: | HTML API | Keywords: | has-patch has-unit-tests |
| Focuses: | Cc: |
Description
Certain inputs cause WP_HTML_Tag_Processor::next_token() to split text nodes that should be a single text node.
Example input: test< /A>
This should produce a single text node of test< /A> (via ::get_modifiable_text()). Instead it finds two adjacent text nodes of test immediately followed by < /A>.
See the single text node parsed here.
This was found thanks to the external test suite proposed in #60227.
Change History (7)
#2
@
23 months ago
- Keywords has-patch has-unit-tests added
- Milestone changed from Awaiting Review to 6.5
- Version set to trunk
This ticket was mentioned in PR #5976 on WordPress/wordpress-develop by @jonsurrell.
23 months ago
#3
Fix an issue where a < character will always break a text node.
It should not break a text node if we will not start another node type, i.e. we find < followed by !, /, ? or an ascii alpha character.
Trac ticket: https://core.trac.wordpress.org/ticket/60385
#4
@
23 months ago
- Resolution fixed deleted
- Status changed from closed to reopened
@dmsnell noticed that there's a remaining bug if < is followed by:
[\]^_- ` (backtick)
I'll work on a fix.
This ticket was mentioned in PR #6041 on WordPress/wordpress-develop by @dmsnell.
23 months ago
#5
A fix was introduced to the Tag Processor to ensure that contiguous text in an HTML document emerges as a single text node spanning the full sequence. Unfortunately, that patch was marginally over-zealous in checking if a "<" started a syntax token or not. It used the following:
if ( 'A' <= $c && 'z' >= $c ) { ... }
This was based on the assumption that the A-Z and a-z letters are contiguous in the ASCII range; they aren't, and there's a gap of several characters in between. The result of this is that in some cases the parser created a text boundary when it didn't need to. Text boundaries can be surprising and can be created when reaching invalid syntax, HTML comments, and more hidden elements, so semantically this wasn't a major bug, but it was an aesthetic challenge.
In this patch the check is properly compared for both upper- and lower-case variants that could potentially form tag names.
if ( ( 'A' <= $c && 'Z' >= $c ) || ( 'a' <= $c && 'z' >= $c ) ) { ... }
This solves the problem and ensures that contiguous text appears as a single text node when scanning tokens.
Follow-up to [57489]
Props dmsnell, jonsurrell
See Core-60385
In 57489: