Make WordPress Core

Opened 5 days ago

Closed 5 days ago

Last modified 5 days ago

#61974 closed enhancement (fixed)

HTML API: Add method to split text nodes by null or empty prefixes.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.7 Priority: normal
Severity: normal Version: trunk
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

There are places in the HTML Processor that need to parse differently according to whether text content is a sequence of NULL bytes or whitespace characters after decoding. It's awkward and inefficient to do this within the HTML Processor, however, as it requires eagerly decoding text nodes.

The Tag Processor could expose a method to efficiently split apart a text node when needed, and then classify it, to aid in the parsing. This method could further be used to identify inter-element whitespace, which is usually ignored when rendering HTML.

Change History (4)

This ticket was mentioned in PR #7236 on WordPress/wordpress-develop by @dmsnell.


5 days ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-61974

HTML parsing rules at times differentiate character tokens that are all null bytes, all whitespace, or other content. This patch introduces a new function which may be used to classify text node sub-regions and lead to more efficient application of these parsing rules.

Further, when classified in this way, application code may skip some rules and decoding entirely, improving performance.

## Example script

<?php

require_once __DIR__ . '/src/wp-load.php';

$p = new class( "\x00\n\r&#x000000020;\n\x00\x00 \tStuff\x00\f<p>&#x13; &#9;\ntext<p>a \x00b<p>\x00\x00" ) extends WP_HTML_Tag_Processor {
        public function get_token_length() {
                $this->set_bookmark('here');
                return $this->bookmarks['here']->length;
        }
};

while ( $p->next_token() ) {
        if ( '#text' !== $p->get_token_name() ) {
                echo "\e[3;90mSkipping \e[0;2;32m{$p->get_token_name()}\e[m\n";
                continue;
        }

        $did_split = $p->subdivide_text_appropriately();
        $text = $p->get_modifiable_text();
        $text = str_replace( [ "\x00", "\t", "\f", "\r", "\n" ], [ '␤', '␉', '␌', '␍', '␤' ], $text );
        $after = $did_split ? " \e[3mafter splitting" : '';
        $length = $p->get_token_length();
        echo "\e[90mFound (\e[33m{$length}\e[90m) '\e[34m{$text}\e[90m'{$after}\e[m\n";
}

https://github.com/user-attachments/assets/84d1c11d-939c-4662-b5af-31e43221d914

## html5lib tests

-Tests: 2433, Assertions: 4116, Skipped: 421.
+Tests: 2433, Assertions: 4112, Skipped: 425.

@dmsnell commented on PR #7236:


5 days ago
#2

sorry for the late ticket creation, but I thought it was best to separate this as a feature enhancement. it's mostly an internal method, but since it has potential use to application code it can retain its own ticket.

#3 @dmsnell
5 days ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 58970:

HTML API: Allow subdividing text nodes by meaningful prefixes.

HTML parsing rules at times differentiate character tokens that are all null bytes, all whitespace, or other content. This patch introduces a new function which may be used to classify text node sub-regions and lead to more efficient application of these parsing rules.

Further, when classified in this way, application code may skip some rules and decoding entirely, improving performance. For example, this can be used to ease the implementation of skipping inter-element whitespace, which is usually not rendered.

Developed in https://github.com/WordPress/wordpress-develop/pull/7236
Discussed in https://core.trac.wordpress.org/ticket/61974

Props dmsnell, jonsurrell.
Fixes #61974.

Note: See TracTickets for help on using tickets.