Make WordPress Core

Opened 15 months ago

Closed 11 months ago

Last modified 11 months ago

#60170 closed enhancement (fixed)

HTML API: Scan every token in an HTML document.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.5 Priority: normal
Severity: normal Version: 6.5
Component: HTML API Keywords: has-patch has-unit-tests has-dev-note
Focuses: Cc:

Description

The Tag Processor currently visits every syntax element in an HTML document but it only exposes inspecting and modifying tags. In order to build certain features it needs to support visiting every kind of syntax token. In this patch the HTML Tag Processor exposes a new system for doing that.

Adding full token-scanning opens up new opportunities, including but not limited to the following list:

  • A version of wp_truncate_html() that only parses as much HTML as is necessary and which provides the ability to preserve the HTML tags while ignoring their contribution to the HTML string length. gist
  • Improve the reliability (and potentially the performance of) existing functions like wp_strip_all_tags().
  • Introduce new filtering pipelines to coalesce multiple filters into a single pass that only operates on text nodes, skipping multiple iterations of multiple buggy regular-expression-based transforms.
  • Rendering HTML as text.

This patch introduces the concept of modifiable text, which represents plain text inside an HTML document which cannot contain other tags or syntax tokens. #text nodes are modifiable text, as are the contents of a SCRIPT or STYLE element, and also the text inside an HTML comment is modifiable text. Modifiable text can be modified without changing the structure of the HTML document.

This patch introduces new methods on the Tag Processor:

  • get_token_type() reports what kind of token is currently matched, e.g. a tag or a comment.
  • get_token_name() roughly reports the DOM name that would be assigned to the node in a browser.
  • get_modifiable_text() returns the modifiable text for a node..

There is no ability provided yet to modify the modifiable text. See the Github PR for more elaborate descriptions of the new modifiable text concept.

Change History (22)

#1 @dmsnell
14 months ago

  • Milestone changed from Awaiting Review to 6.5

#2 @dmsnell
14 months ago

Pending any blockers, I'd like to merge this in the next few days and iterate after merge.

There is a known issue handling CDATA sections within SVG elements whose resolution is not clear, particularly because the browsers handle things differently from each other and from the HTML specification. The approach taken at the moment is conservative and will avoid trashing a post in one of the rare cases that a CDATA section in an SVG element containing a > character is encountered.

#3 @dmsnell
14 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 57348:

HTML API: Scan all syntax tokens in a document, read modifiable text.

Since its introduction in WordPress 6.2 the HTML Tag Processor has
provided a way to scan through all of the HTML tags in a document and
then read and modify their attributes. In order to reliably do this, it
also needed to be aware of other kinds of HTML syntax, but it didn't
expose those syntax tokens to consumers of the API.

In this patch the Tag Processor introduces a new scanning method and a
few helper methods to read information about or from each token. Most
significantly, this introduces the ability to read #text nodes in the
document.

What's new in the Tag Processor?
================================

  • next_token() visits every distinct syntax token in a document.
  • get_token_type() indicates what kind of token it is.
  • get_token_name() returns something akin to DOMNode.nodeName.
  • get_modifiable_text() returns the text associated with a token.
  • get_comment_type() indicates why a token represents an HTML comment.

Example usage.
==============

<?php
function strip_all_tags( $html ) {
        $text_content = '';
        $processor    = new WP_HTML_Tag_Processor( $html );

        while ( $processor->next_token() ) {
                if ( '#text' !== $processor->get_token_type() ) {
                        continue;
                }

                $text_content .= $processor->get_modifiable_text();
        }

        return $text_content;
}

What changes in the Tag Processor?
==================================

Previously, the Tag Processor would scan the opening and closing tag of
every HTML element separately. Now, however, there are special tags
which it only visits once, as if those elements were void tags without
a closer.

These are special tags because their content contains no other HTML or
markup, only non-HTML content.

  • SCRIPT elements contain raw text which is isolated from the rest of the HTML document and fed separately into a JavaScript engine. There are complicated rules to avoid escaping the script context in the HTML. The contents are left verbatim, and character references are not decoded.
  • TEXTARA and TITLE elements contain plain text which is decoded before display, e.g. transforming &amp; into &. Any markup which resembles tags is treated as verbatim text and not a tag.
  • IFRAME, NOEMBED, NOFRAMES, STYLE, and XMP elements are similar to the textarea and title elements, but no character references are decoded. For example, &amp; inside a STYLE element is passed to the CSS engine as the literal string &amp; and _not_ as &.

Because it's important not treat this inner content separately from the
elements containing it, the Tag Processor combines them when scanning
into a single match and makes their content available as modifiable
text (see below).

This means that the Tag Processor will no longer visit a closing tag for
any of these elements unless that tag is unexpected.

    <title>There is only a single token in this line</title>
    <title>There are two tokens in this line></title></title>
    </title><title>There are still two tokens in this line></title>

What are tokens?
================

The term "token" here is a parsing term, which means a primitive unit in
HTML. There are only a few kinds of tokens in HTML:

  • a tag has a name, attributes, and a closing or self-closing flag.
  • a text node, or #text node contains plain text which is displayed in a browser and which is decoded before display.
  • a DOCTYPE declaration indicates how to parse the document.
  • a comment is hidden from the display on a page but present in the HTML.

There are a few more kinds of tokens that the HTML Tag Processor will
recognize, some of which don't exist as concepts in HTML. These mostly
comprise XML syntax elements that aren't part of HTML (such as CDATA and
processing instructions) and invalid HTML syntax that transforms into
comments.

What is a funky comment?
========================

This patch treats a specific kind of invalid comment in a special way.
A closing tag with an invalid name is considered a "funky comment." In
the browser these become HTML comments just like any other, but their
syntax is convenient for representing a variety of bits of information
in a well-defined way and which cannot be nested or recursive, given
the parsing rules handling this invalid syntax.

  • </1>
  • </%avatar_url>
  • </{"wp_bit": {"type": "post-author"}}>
  • </[post-author]>
  • </__( 'Save Post' );>

All of these examples become HTML comments in the browser. The content
inside the funky content is easily parsable, whereby the only rule is
that it starts at the < and continues until the nearest >. There
can be no funky comment inside another, because that would imply having
a > inside of one, which would actually terminate the first one.

What is modifiable text?
========================

Modifiable text is similar to the innerText property of a DOM node.
It represents the span of text for a given token which may be modified
without changing the structure of the HTML document or the token.

There is currently no mechanism to change the modifiable text, but this
is planned to arrive in a later patch.

Tags
====

Most tags have no modifiable text because they have child nodes where
text nodes are found. Only the special tags mentioned above have
modifiable text.

    <div class="post">Another day in HTML</div>
    └─ tag ──────────┘└─ text node ─────┘└────┴─ tag
    <title>Is <img> &gt; <image>?</title>
    │      └ modifiable text ───┘       │ "Is <img> > <image>?"
    └─ tag ─────────────────────────────┘

Text nodes
==========

Text nodes are entirely modifiable text.

    This HTML document has no tags.
    └─ modifiable text ───────────┘

Comments
========

The modifiable text inside a comment is the portion of the comment that
doesn't form its syntax. This applies for a number of invalid comments.

    <!-- this is inside a comment -->
    │   └─ modifiable text ──────┘  │
    └─ comment token ───────────────┘
    <!-->
    This invalid comment has no modifiable text.
    <? this is an invalid comment -->
    │ └─ modifiable text ────────┘  │
    └─ comment token ───────────────┘
    <[CDATA[this is an invalid comment]]>
    │       └─ modifiable text ───────┘ │
    └─ comment token ───────────────────┘

Other token types also have modifiable text. Consult the code or tests
for further information.

Developed in https://github.com/WordPress/wordpress-develop/pull/5683
Discussed in https://core.trac.wordpress.org/ticket/60170

Follows [57575]

Props bernhard-reiter, dlh, dmsnell, jonsurrell, zieladam
Fixes #60170

#4 @westonruter
14 months ago

It seems that r57348 introduced a regression where not all closing tags are visited. See #60392.

#5 @dmsnell
14 months ago

@westonruter check out the section above titled What changes in the Tag Processor? I've also left a note in the linked ticket. These special tags are now handled atomically in order to guard the fact that contents inside of them is special and needs special rules for escaping and parsing.

#6 @dmsnell
14 months ago

  • Keywords needs-dev-note added

This ticket was mentioned in PR #6021 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#7

When parser states were introduced in WordPress/wordpress-develop#5683, nothing in the seek() method reset the parser state. This is problematic because it could leave the parser in the wrong state.

In this patch the parser state is reset so that it get's properly adjusted on the successive call to next_token().

Follows [57348]
See Core-60170

Props @kevin940726 for finding and reporting.

This ticket was mentioned in PR #6021 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#8

  • Keywords has-unit-tests added

When parser states were introduced in WordPress/wordpress-develop#5683, nothing in the seek() method reset the parser state. This is problematic because it could leave the parser in the wrong state.

In this patch the parser state is reset so that it get's properly adjusted on the successive call to next_token().

Follows [57348]
See Trac ticket 60170

Props @kevin940726 for finding and reporting.

This ticket was mentioned in PR #6021 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#9

Trac ticket: Core-60428

When parser states were introduced in WordPress/wordpress-develop#5725, nothing in the seek() method reset the parser state. This is problematic because it could leave the parser in the wrong state.

In this patch the parser state is reset so that it get's properly adjusted on the successive call to next_token().

Follows [57211]
See Trac ticket 60170

Props @kevin940726 for finding and reporting.

This ticket was mentioned in PR #6021 on WordPress/wordpress-develop by @dmsnell.


14 months ago
#10

Trac ticket: Core-60428

When parser states were introduced in WordPress/wordpress-develop#5725, nothing in the seek() method reset the parser state. This is problematic because it could leave the parser in the wrong state.

In this patch the parser state is reset so that it get's properly adjusted on the successive call to next_token().

Follows [57211]
See Trac ticket 60170

Props @kevin940726 for finding and reporting.

#12 @sabernhardt
13 months ago

  • Keywords has-dev-note added; needs-dev-note removed

This ticket was mentioned in PR #6412 on WordPress/wordpress-develop by @dmsnell.


11 months ago
#13

Fixes Core-60170

Funky comments must be at least one character long, but we were only checking _after_ the first character for where it ends. This patch fixes that.

#14 @dmsnell
11 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Found an issue with short "funky comments" whereby a closing tag with an invalid tag name of a single character length would skip past the actual closing > and consume another tag.

For instance, </!><p> turns into a single funky comment with modifiable text !><p

@dmsnell commented on PR #6412:


11 months ago
#15

cc: @cbravobernal @darerodz I've removed one of the Interactivity API tests because it was accidentally passing when it shouldn't have been. it was testing whether the Server Directive Processor bails on unbalanced content, except it was testing with a </ > which isn't a tag closer but rather a funky comment. That particular funky comment wasn't properly detected before, and fixing it broke the test.

@gziolo commented on PR #6412:


11 months ago
#16

@dmsnell, I assume you would like to include this fix in 6.5.3 as you reopened the ticket, right?

@dmsnell commented on PR #6412:


11 months ago
#17

would like to include this fix in 6.5.3 as you reopened the ticket, right?

@gziolo this could have been a mistake on my part, but I was told something that led me to believe I need to reopen a ticket if I'm making a bug-fix for it. maybe I misunderstood. people wanted more of the work to be consolidated on fewer Trac tickets, so I was trying to do that. should I re-close the ticket?

#18 @dmsnell
11 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 58040:

HTML API: Fix detection of single-length funky comments.

Since [60428] the Tag Processor has been misidentifying single-character
funky comments. It has been asserting that the full token-length for a
funky comment must be at least three characters after the opening (e.g.
</1>), but it has been starting to look for the closing > after
those same three characters. This means that it has been skipping the
actual close of these funky comments and swallowing up the next syntax
until it finds a >, often consuming the next tag in the process.

This patch fixes the detector and restores finding the following token.

Developed in https://github.com/WordPress/wordpress-develop/pull/6412
Discussed in https://core.trac.wordpress.org/ticket/60170

Follow-up to [60428].
Fixes #60170.
Props dmsnell, gziolo, jonsurrell.

@gziolo commented on PR #6412:


11 months ago
#20

@aaronjorbin, if I am not mistaken, you are leading 6.5.x releases, can you help clarify the workflow that @dmsnell explained in https://github.com/WordPress/wordpress-develop/pull/6412#issuecomment-2074280668.

this could have been a mistake on my part, but I was told something that led me to believe I need to reopen a ticket if I'm making a bug-fix for it. maybe I misunderstood. people wanted more of the work to be consolidated on fewer Trac tickets, so I was trying to do that. should I re-close the ticket?

@jorbin commented on PR #6412:


11 months ago
#21

In general, tickets on closed milestones shouldn't be reopened but the ticket should still be referenced on the follow up commit so that when someone is later looking at the feature, they can see the followup work.

The doc on commit messages talks a bit about how to do this reference, but here is an example

To have a change considered for backport, the process should be to create a ticket on trac and put it in the 6.5.3 milestone. Once there, it follows a workflow with the fixed-major keyword to signify that code has been committed to trunk, dev-feedback to indicate that a second committer should sign off on it and dev-reviewed to show that the second committer has approved it.

@dmsnell commented on PR #6412:


11 months ago
#22

Thanks @aaronjorbin. I don't think this needs to be back-ported, as it's probably a fairly rare bug in practice. Next time I'll leave the ticket closed.

Note: See TracTickets for help on using tickets.