Make WordPress Core

Opened 14 months ago

Closed 14 months ago

Last modified 14 months ago

#60385 closed defect (bug) (fixed)

HTML API: Text nodes may be incorrectly split

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 6.5 Priority: normal
Severity: normal Version: 6.5
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Certain inputs cause WP_HTML_Tag_Processor::next_token() to split text nodes that should be a single text node.

Example input: test< /A>

This should produce a single text node of test< /A> (via ::get_modifiable_text()). Instead it finds two adjacent text nodes of test immediately followed by < /A>.

See the single text node parsed here.

This was found thanks to the external test suite proposed in #60227.

Change History (7)

#1 @dmsnell
14 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 57489:

HTML API: Fix splitting single text node.

When next_token() was introduced, it brought a subtle bug. When encountering a < in the HTML stream which did not lead to a tag or comment or other token, it was treating the full text span to that point as one text node, and the following span another text node.

The entire span should be one text node.

In this patch the Tag Processor properly detects this scenario and combines the spans into one text node.

Follow-up to [57348]

Props jonsurrell
Fixes #60385

#2 @dmsnell
14 months ago

  • Keywords has-patch has-unit-tests added
  • Milestone changed from Awaiting Review to 6.5
  • Version set to trunk

This ticket was mentioned in PR #5976 on WordPress/wordpress-develop by @jonsurrell.


14 months ago
#3

Fix an issue where a < character will always break a text node.

It should not break a text node if we will not start another node type, i.e. we find < followed by !, /, ? or an ascii alpha character.

Trac ticket: https://core.trac.wordpress.org/ticket/60385

#4 @jonsurrell
14 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

@dmsnell noticed that there's a remaining bug if < is followed by:

  • [
  • \
  • ]
  • ^
  • _
  • ` (backtick)

I'll work on a fix.

This ticket was mentioned in PR #6041 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#5

A fix was introduced to the Tag Processor to ensure that contiguous text in an HTML document emerges as a single text node spanning the full sequence. Unfortunately, that patch was marginally over-zealous in checking if a "<" started a syntax token or not. It used the following:

if ( 'A' <= $c && 'z' >= $c ) { ... }

This was based on the assumption that the A-Z and a-z letters are contiguous in the ASCII range; they aren't, and there's a gap of several characters in between. The result of this is that in some cases the parser created a text boundary when it didn't need to. Text boundaries can be surprising and can be created when reaching invalid syntax, HTML comments, and more hidden elements, so semantically this wasn't a major bug, but it was an aesthetic challenge.

In this patch the check is properly compared for both upper- and lower-case variants that could potentially form tag names.

if ( ( 'A' <= $c && 'Z' >= $c ) || ( 'a' <= $c && 'z' >= $c ) ) { ... }

This solves the problem and ensures that contiguous text appears as a single text node when scanning tokens.

Follow-up to [57489]
Props dmsnell, jonsurrell
See Core-60385

#6 @dmsnell
14 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 57542:

HTML API: Join text nodes on invalid-tag-name boundaries.

A fix was introduced to the Tag Processor to ensure that contiguous text
in an HTML document emerges as a single text node spanning the full
sequence. Unfortunately, that patch was marginally over-zealous in
checking if a "<" started a syntax token or not. It used the following:

<?php
if ( 'A' <= $c && 'z' >= $c ) { ... }

This was based on the assumption that the A-Z and a-z letters are
contiguous in the ASCII range; they aren't, and there's a gap of
several characters in between. The result of this is that in some
cases the parser created a text boundary when it didn't need to.
Text boundaries can be surprising and can be created when reaching
invalid syntax, HTML comments, and more hidden elements, so
semantically this wasn't a major bug, but it was an aesthetic
challenge.

In this patch the check is properly compared for both upper- and
lower-case variants that could potentially form tag names.

<?php
if ( ( 'A' <= $c && 'Z' >= $c ) || ( 'a' <= $c && 'z' >= $c ) ) { ... }

This solves the problem and ensures that contiguous text appears
as a single text node when scanning tokens.

Developed in https://github.com/WordPress/wordpress-develop/pull/6041
Discussed in https://core.trac.wordpress.org/ticket/60385

Follow-up to [57489]
Props dmsnell, jonsurrell
Fixes #60385

Note: See TracTickets for help on using tickets.