Context Navigation

← Previous Ticket
Next Ticket →

#61974 closed enhancement (fixed)

HTML API: Add method to split text nodes by null or empty prefixes.

Reported by:	dmsnell	Owned by:	dmsnell
Milestone:	6.7	Priority:	normal
Severity:	normal	Version:	6.7
Component:	HTML API	Keywords:	has-patch has-unit-tests
Focuses:		Cc:

Description

There are places in the HTML Processor that need to parse differently according to whether text content is a sequence of NULL bytes or whitespace characters after decoding. It's awkward and inefficient to do this within the HTML Processor, however, as it requires eagerly decoding text nodes.

The Tag Processor could expose a method to efficiently split apart a text node when needed, and then classify it, to aid in the parsing. This method could further be used to identify inter-element whitespace, which is usually ignored when rendering HTML.

Change History (4)

This ticket was mentioned in PR #7236 on WordPress/wordpress-develop by @dmsnell.

16 months ago #1

Keywords has-patch has-unit-tests added

Trac ticket: Core-61974

HTML parsing rules at times differentiate character tokens that are all null bytes, all whitespace, or other content. This patch introduces a new function which may be used to classify text node sub-regions and lead to more efficient application of these parsing rules.

Further, when classified in this way, application code may skip some rules and decoding entirely, improving performance.

## Example script

<?php

require_once __DIR__ . '/src/wp-load.php';

$p = new class( "\x00\n\r&#x000000020;\n\x00\x00 \tStuff\x00\f<p>&#x13; &#9;\ntext<p>a \x00b<p>\x00\x00" ) extends WP_HTML_Tag_Processor {
        public function get_token_length() {
                $this->set_bookmark('here');
                return $this->bookmarks['here']->length;
        }
};

while ( $p->next_token() ) {
        if ( '#text' !== $p->get_token_name() ) {
                echo "\e[3;90mSkipping \e[0;2;32m{$p->get_token_name()}\e[m\n";
                continue;
        }

        $did_split = $p->subdivide_text_appropriately();
        $text = $p->get_modifiable_text();
        $text = str_replace( [ "\x00", "\t", "\f", "\r", "\n" ], [ '␤', '␉', '␌', '␍', '␤' ], $text );
        $after = $did_split ? " \e[3mafter splitting" : '';
        $length = $p->get_token_length();
        echo "\e[90mFound (\e[33m{$length}\e[90m) '\e[34m{$text}\e[90m'{$after}\e[m\n";
}

## html5lib tests

-Tests: 2433, Assertions: 4116, Skipped: 421.
+Tests: 2433, Assertions: 4112, Skipped: 425.

@dmsnell commented on PR #7236:

16 months ago #2

sorry for the late ticket creation, but I thought it was best to separate this as a feature enhancement. it's mostly an internal method, but since it has potential use to application code it can retain its own ticket.

#3 @dmsnell
16 months ago

Owner set to dmsnell
Resolution set to fixed
Status changed from new to closed

In 58970:

HTML API: Allow subdividing text nodes by meaningful prefixes.

Further, when classified in this way, application code may skip some rules and decoding entirely, improving performance. For example, this can be used to ease the implementation of skipping inter-element whitespace, which is usually not rendered.

Developed in https://github.com/WordPress/wordpress-develop/pull/7236
Discussed in https://core.trac.wordpress.org/ticket/61974

Props dmsnell, jonsurrell.
Fixes #61974.

@dmsnell commented on PR #7236:

16 months ago #4

Merged in [58970]
https://github.com/wordpress/wordpress-develop/commit/95eb879c47c52413f33c3a62f3006262cc2b0062

Note: See TracTickets for help on using tickets.

Trac UI Preferences

Download in other formats:

Make WordPress Core