Context Navigation

← Previous Ticket
Next Ticket →

#62036 closed enhancement (fixed)

HTML API: Introduce normalization methods.

Reported by:	dmsnell	Owned by:	dmsnell
Milestone:	6.7	Priority:	normal
Severity:	normal	Version:	6.7
Component:	HTML API	Keywords:	has-patch has-unit-tests
Focuses:		Cc:

Description

Since the HTML Processor knows how to parse HTML, it should provide a method for normalizing HTML, returning a fully-well-formed equivalent of what it sent its way.

<?php
echo WP_HTML_Processor::normalize('<a href=#anchor v=5 href="/" enabled>One</a another v=5><!--');
// <a href="#anchor" v="5" enabled>One</a>

Such a method can power a safe implementation of set_inner_html() which prevents problems typical of slicing and concatenating HTML.

Additionally, this approach of serializing HTML in order to normalize it provides a more robust interface for what force_balance_tags() attempts to do. In fact, normalizing HTML at the start of any processing pipeline will ensure that naive HTML processing code down the line won't get confused by HTML's syntax quirks.

Change History (11)

This ticket was mentioned in PR #7331 on WordPress/wordpress-develop by @dmsnell.

3 months ago #1

Keywords has-patch added

Trac ticket: Core-62036.

The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.

This patch introduces WP_HTML_Processor::normalize( $html ) as a
method which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.

@jonsurrell commented on PR #7331:

3 months ago #2

There are some known issues from HTML5lib tests similar to the PI problems mentioned here: https://github.com/WordPress/wordpress-develop/pull/7331#discussion_r1755229824

https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php#L28-L30

There's no good way to read the comment under some circumstances and something like a get_raw_comment_content() method would be helpful.

https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat#L156
https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat#L170
https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/html5test-com.dat#L130

Each of these does not satisfy the PI constraint (missing ? before the > closer) so they're treated as invalid HTML comments. The initial ? isn't accessible through get_modifiable_text(), modifying that character could change the token to something completely different.

There are a couple of cases like this, I think they're all <? or <!-started strings triggering the bogus comment state.

Input	Expected	Actual
`<?xml foo >`
`<!>`

@jonsurrell commented on PR #7331:

3 months ago #3

We'd talked about a method to really inspect different types of comment text content. I've proposed a method in https://github.com/WordPress/wordpress-develop/pull/7342. That would be helpful here.

This ticket was mentioned in PR #7342 on WordPress/wordpress-develop by @jonsurrell.

3 months ago #4

Keywords has-unit-tests added

Trac ticket: Core-62036

There are certain circumstances in the HTML where the full contents of a comment node as it would be in the browser cannot be known without inspecting internal state of the HTML API classes.

See https://github.com/WordPress/wordpress-develop/pull/7331#issuecomment-2346401233 for an example and details.

In short, HTML parsing enters "bogus comment state" which may be represented several ways in the HTML processor, but comments like <!c> and <?c> have _no way_ of knowing whether the comment would be equivalent to or . They're both apparently (if we use get_modifiable_text() to inspect comment text), although in fact <!c> becomes while <?c> becomes <--?c-->.

Additionally, it makes it clear what "lookalike" comment types would have as their comment text, so CDATA and Processing Instruction lookalike comments can also be queried simply:

<![CDATA[foo]]> becomes
<?pi bar?> becomes

This is useful for the html5lib tests or anyone seeking to understand the comment text content as a browser would.

This should be helpful for normalization: https://github.com/WordPress/wordpress-develop/pull/7331

This fixes 3 tests in the HTML5lib test suite with known failures (due to being unable to determine the comment contents described above).

Trac ticket:

#5 @dmsnell
3 months ago

In 59075:

HTML API: Add get_full_comment_text() method.

Previously, there were a few cases where the modifiable text read from an HTML comment differs slightly from the parsed value of its inner text in a browser. This is due to the specific way that invalid HTML syntax tokens become "bogus comments."

This patch introduces a new method to the Tag Processor to allow differentiating these specific cases, such as when copying or serializing HTML from one source to another. Similar code has already been in use in the html5lib tests, and this patch simplifies the test runner, evidencing the fact that this method was already needed.

Developed in https://github.com/wordpress/wordpress-develop/pull/7342
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell.
See #62036.

@dmsnell commented on PR #7342:

3 months ago #6

Merged in [59075]
https://github.com/wordpress/wordpress-develop/commit/675a1aa5e5583a465e452c13d4d631666a486b5f

#7 @dmsnell
3 months ago

Owner set to dmsnell
Resolution set to fixed
Status changed from new to closed

In 59076:

HTML API: Add normalize() to give us the HTML we always wanted.

HTML often appears in ways that are unexpected. It may be missing implicit tags, may have unquoted, single-quoted, or double-quoted attributes, may contain duplicate attributes, may contain unescaped text content, or any number of other possible invalid constructions. The HTML API understands all fo these inputs, but downline parsers may not, and HTML snippets which are safe on their own may introduce problems when joined with other HTML snippets.

This patch introduces the serialize() method on the HTML Processor, which prints a fully-normative HTML output, eliminating invalid markup along the way. It produces a string which contains every missing tag, double-quoted attributes, and no duplicates. A normalize() static method on the HTML Processor provides a convenient wrapper for constructing a fragment parser and immediately serializing.

Subclasses relying on the serialize_token() method may perform structural HTML modifications with as much security as the upcoming \Dom\HTMLDocument() parser will, though these are not
able to provide the full safety that will eventually appear with set_inner_html().

Further work may explore serializing to XML (which involves a number of other important transformations) and adding constraints to serialization (such as only allowing inline/flow/formatting elements and text).

Developed in https://github.com/wordpress/wordpress-develop/pull/7331
Discussed in https://core.trac.wordpress.org/ticket/62036

Props dmsnell, jonsurrell, westonruter.
Fixes #62036.