#62036 closed enhancement (fixed)
HTML API: Introduce normalization methods.
Reported by: | dmsnell | Owned by: | dmsnell |
---|---|---|---|
Milestone: | 6.7 | Priority: | normal |
Severity: | normal | Version: | 6.7 |
Component: | HTML API | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
Since the HTML Processor knows how to parse HTML, it should provide a method for normalizing HTML, returning a fully-well-formed equivalent of what it sent its way.
<?php echo WP_HTML_Processor::normalize('<a href=#anchor v=5 href="/" enabled>One</a another v=5><!--'); // <a href="#anchor" v="5" enabled>One</a>
Such a method can power a safe implementation of set_inner_html()
which prevents problems typical of slicing and concatenating HTML.
Additionally, this approach of serializing HTML in order to normalize it provides a more robust interface for what force_balance_tags()
attempts to do. In fact, normalizing HTML at the start of any processing pipeline will ensure that naive HTML processing code down the line won't get confused by HTML's syntax quirks.
Change History (11)
This ticket was mentioned in PR #7331 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#1
- Keywords has-patch added
@jonsurrell commented on PR #7331:
3 months ago
#2
There are some known issues from HTML5lib tests similar to the PI problems mentioned here: https://github.com/WordPress/wordpress-develop/pull/7331#discussion_r1755229824
There's no good way to read the comment under some circumstances and something like a get_raw_comment_content()
method would be helpful.
https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat#L156
https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/comments01.dat#L170
https://github.com/WordPress/wordpress-develop/blob/47122105f20624e5a6b8ed151d732157f2f2605c/tests/phpunit/data/html5lib-tests/tree-construction/html5test-com.dat#L130
Each of these does not satisfy the PI constraint (missing ?
before the >
closer) so they're treated as invalid HTML comments. The initial ?
isn't accessible through get_modifiable_text()
, modifying that character could change the token to something completely different.
There are a couple of cases like this, I think they're all <?
or <!
-started strings triggering the bogus comment state.
Input | Expected | Actual | |
---|---|---|---|
<?xml foo > | |
| |
<!> | | |
|
@jonsurrell commented on PR #7331:
3 months ago
#3
We'd talked about a method to really inspect different types of comment text content. I've proposed a method in https://github.com/WordPress/wordpress-develop/pull/7342. That would be helpful here.
This ticket was mentioned in PR #7342 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#4
- Keywords has-unit-tests added
Trac ticket: Core-62036
There are certain circumstances in the HTML where the full contents of a comment node as it would be in the browser cannot be known without inspecting internal state of the HTML API classes.
See https://github.com/WordPress/wordpress-develop/pull/7331#issuecomment-2346401233 for an example and details.
In short, HTML parsing enters "bogus comment state" which may be represented several ways in the HTML processor, but comments like <!c>
and <?c>
have _no way_ of knowing whether the comment would be equivalent to or
. They're both apparently
(if we use
get_modifiable_text()
to inspect comment text), although in fact <!c>
becomes while
<?c>
becomes <--?c-->
.
Additionally, it makes it clear what "lookalike" comment types would have as their comment text, so CDATA and Processing Instruction lookalike comments can also be queried simply:
<![CDATA[foo]]>
becomes<?pi bar?>
becomes
This is useful for the html5lib tests or anyone seeking to understand the comment text content as a browser would.
This should be helpful for normalization: https://github.com/WordPress/wordpress-develop/pull/7331
This fixes 3 tests in the HTML5lib test suite with known failures (due to being unable to determine the comment contents described above).
Trac ticket:
#7
@
3 months ago
- Owner set to dmsnell
- Resolution set to fixed
- Status changed from new to closed
In 59076:
This ticket was mentioned in PR #7580 on WordPress/wordpress-develop by @jonsurrell.
2 months ago
#9
Possesive "it's" should be "its."
Trac ticket: https://core.trac.wordpress.org/ticket/62036
@cbravobernal commented on PR #7580:
2 months ago
#11
Committed in r59245
Trac ticket: Core-62036.
The HTML Processor understands HTML regardless of how it's written, but
many other functions are unable to do so. There are all sorts of syntax
peculiarities and semantics that would be helpful to eliminate using the
knowledge contained in the HTML Processor.
This patch introduces
WP_HTML_Processor::normalize( $html )
as amethod which takes a fragment of HTML as input and then returns a
serialized version of the input, "cleaning it up" by balancing all
tags, providing all missing optional tags, re-encoding all text,
removing all duplicate attributes, and double-quote-escaping all
attribute values.