Make WordPress Core

Changeset 57542


Ignore:
Timestamp:
02/06/2024 07:21:36 PM (8 months ago)
Author:
dmsnell
Message:

HTML API: Join text nodes on invalid-tag-name boundaries.

A fix was introduced to the Tag Processor to ensure that contiguous text
in an HTML document emerges as a single text node spanning the full
sequence. Unfortunately, that patch was marginally over-zealous in
checking if a "<" started a syntax token or not. It used the following:

<?php
if ( 'A' <= $c && 'z' >= $c ) { ... }

This was based on the assumption that the A-Z and a-z letters are
contiguous in the ASCII range; they aren't, and there's a gap of
several characters in between. The result of this is that in some
cases the parser created a text boundary when it didn't need to.
Text boundaries can be surprising and can be created when reaching
invalid syntax, HTML comments, and more hidden elements, so
semantically this wasn't a major bug, but it was an aesthetic
challenge.

In this patch the check is properly compared for both upper- and
lower-case variants that could potentially form tag names.

<?php
if ( ( 'A' <= $c && 'Z' >= $c ) || ( 'a' <= $c && 'z' >= $c ) ) { ... }

This solves the problem and ensures that contiguous text appears
as a single text node when scanning tokens.

Developed in https://github.com/WordPress/wordpress-develop/pull/6041
Discussed in https://core.trac.wordpress.org/ticket/60385

Follow-up to [57489]
Props dmsnell, jonsurrell
Fixes #60385

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php

    r57527 r57542  
    15291529            if ( $at > $was_at ) {
    15301530                /*
    1531                  * A "<" has been found in the document. That may be the start of another node, or
    1532                  * it may be an "ivalid-first-character-of-tag-name" error. If this is not the start
    1533                  * of another node the "<" should be included in this text node and another
    1534                  * termination point should be found for the text node.
     1531                 * A "<" normally starts a new HTML tag or syntax token, but in cases where the
     1532                 * following character can't produce a valid token, the "<" is instead treated
     1533                 * as plaintext and the parser should skip over it. This avoids a problem when
     1534                 * following earlier practices of typing emoji with text, e.g. "<3". This
     1535                 * should be a heart, not a tag. It's supposed to be rendered, not hidden.
     1536                 *
     1537                 * At this point the parser checks if this is one of those cases and if it is
     1538                 * will continue searching for the next "<" in search of a token boundary.
    15351539                 *
    15361540                 * @see https://html.spec.whatwg.org/#tag-open-state
     
    15381542                if ( strlen( $html ) > $at + 1 ) {
    15391543                    $next_character  = $html[ $at + 1 ];
    1540                     $at_another_node =
     1544                    $at_another_node = (
    15411545                        '!' === $next_character ||
    15421546                        '/' === $next_character ||
    15431547                        '?' === $next_character ||
    1544                         ( 'A' <= $next_character && $next_character <= 'z' );
     1548                        ( 'A' <= $next_character && $next_character <= 'Z' ) ||
     1549                        ( 'a' <= $next_character && $next_character <= 'z' )
     1550                    );
    15451551                    if ( ! $at_another_node ) {
    15461552                        ++$at;
Note: See TracChangeset for help on using the changeset viewer.