Make WordPress Core

Opened 3 months ago

Closed 2 weeks ago

#61576 closed enhancement (fixed)

HTML API: Improved spec support in 6.7

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: 6.7 Priority: normal
Severity: normal Version: 6.6
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

The HTML Processor, introduced in #58517, remains a work in progress until it fully supports the HTML5 specification, or until all that will be supported is (there are a few corners of the specification that don't obviously fit well into the paradigm, specifically foster parenting and some parts of the adoption agency algorithm).

However, during WordPress 6.7's development cycle it is hoped to rapidly add whatever remaining tag and algorithm support as is able, having the refactor during 6.6 (visiting all nodes, real and virtual, from #61348) which unclocks most of the remaining rules from a design standpoint.

This is a tracking ticket for that work during this release cycle.


Notable Support

Change History (163)

This ticket was mentioned in PR #6968 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#1

  • Keywords has-patch added

Trac ticket: Core-61576

#2 @dmsnell
3 months ago

In 58676:

HTML API: Add current_node_is() helper method to stack of open elements.

As part of work to add more spec support to the HTML API, this new method
will make it easier to implement the logic when in the SELECT and TABLE
insertion modes.

Developed in https://github.com/WordPress/wordpress-develop/pull/6968
Discussed in https://core.trac.wordpress.org/ticket/51576

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #5908 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#4

  • Keywords has-unit-tests added

Trac ticket: Core-61576

A start tag whose tag name is "select"
Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Set the frameset-ok flag to "not ok".

If the insertion mode is one of "in table", "in caption", "in table body", "in row", or "in cell", then switch the insertion mode to "in select in table". Otherwise, switch the insertion mode to "in select".

Skip fewer of the html5lib-tests:

OK, but incomplete, skipped, or risky tests!
-Tests: 607, Assertions: 174, Skipped: 433.
+Tests: 607, Assertions: 184, Skipped: 423.

#5 @dmsnell
3 months ago

In 58677:

HTML API: Support SELECT insertion mode.

As part of work to add more spec support to the HTML API, this patch adds
support for the SELECT, OPTION, and OPTGROUP elements, including the
requisite support for the IN SELECT insertion mode.

Developed in https://github.com/WordPress/wordpress-develop/pull/5908
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #6972 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#7

Trac ticket: Core-61576

## Summary

  • [ ] There are some failing tests that need to pass.
  • [ ] This needs a _full_ and careful audit.

## Description

As part of work to add more spec support to the HTML API, this patch adds support for the remaining missing tags in the IN BODY insertion mode. Not all of the added tags are supported, because in some cases they reset the insertion mode and are reprocessed where they will be rejected.

html5lib tests

- Tests: 607, Assertions: 174, Skipped: 433.
+ Tests: 607, Assertions: 224, Errors: 2, Failures: 17, Skipped: 381.

This ticket was mentioned in PR #6973 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#8

Trac ticket: Core-61576

## Description

As part of work to add more spec support to the HTML API, this patch adds stubs for all of the remaining parser insertion modes. These modes are not all supported, but they will be necessary to continue adding support for other tags and markup.

#9 @dmsnell
3 months ago

In 58679:

HTML API: Stub out remaining insertion modes in the HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
stubs for all of the remaining parser insertion modes in the HTML Processor.
These modes are not all supported yet, but they will be necessary to continue
adding support for other tags and markup.

Developed in https://github.com/WordPress/wordpress-develop/pull/6973
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#11 @TobiasBg
3 months ago

@dmsnell: There's a single @since 6.4.0 in [58679] that probably also should be a @since 6.7.0, no?

#12 @dmsnell
3 months ago

In 58680:

HTML API: Fix wrong @since tag.

When the remaining insertion modes were stubbed in the HTML Processor,
a @since tag was mistakenly copied with 6.4.0 instead of 6.7.0.

This patch fixes the invalid tag.

Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58679].

Props tobiasbg.
See #61576.

#13 @dmsnell
3 months ago

@TobiasBg you have an incredible ability to notice details! thanks for the comment. I've fixed this in the commit above. Much appreciated! 🙇‍♂️

This ticket was mentioned in PR #6977 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#14

Trac ticket: Core-61576

## Description

As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY.

Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses.

This ticket was mentioned in PR #6981 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#15

Trac ticket: Core-61576

Since the HTML Processor started visiting all nodes in a document, both real and virtual, the breadcrumb accounting became a bit complicated and it's not entirely clear that it is fully reliable.

In this patch the breadcrumbs are rebuilt separately from the stack of open elements in order to eliminate the problem of the stateful stack interactions and the post-hoc event queue.

Breadcrumbs are greatly simplified as a result, and more verifiably correct, in this construction.

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#16

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#17

This ticket was mentioned in PR #6984 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#18

Trac ticket: Core-61576

Many tests from the html5lib test suite fail because of differences in
text handling between a DOM API and the HTML API, even though the
semantics of the parse are equivalent. For example, it's possible in
the HTML API to read multiple successive text nodes when the tokens
between them are ignored.

The test suite didn't account for this and so was failing tests. This
patch improves the construction of the representation to compare
against the test suite so that those tests don't fail inaccurately.

This ticket was mentioned in PR #6988 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#19

Trac ticket: Core-61576

The generate_implied_end_tags() algorithm has been comparing the current node to a list of node names, which means that it won't ever pop any elements from the stack of open elements.

This patch corrects the mistake by comparing node name against the list, thus fixing the algorithm. This was noted in development work for the 6.7 release.

#20 @dmsnell
3 months ago

In 58702:

HTML API: Correct node name in generate_implied_end_tags().

The generate_implied_end_tags() algorithm has been comparing the
current node to a list of node names, which means that it won't ever
pop any elements from the stack of open elements.

This patch corrects the mistake by comparing node name against the
list, thus fixing the algorithm. This was noted in development work
for the 6.7 release.

Developed in https://github.com/WordPress/wordpress-develop/pull/6988
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#22 @dmsnell
3 months ago

In 58712:

HTML API: Join successive text nodes in html5lib test representation.

Many tests from the html5lib test suite fail because of differences in
text handling between a DOM API and the HTML API, even though the
semantics of the parse are equivalent. For example, it's possible in
the HTML API to read multiple successive text nodes when the tokens
between them are ignored.

The test suite didn't account for this and so was failing tests. This
patch improves the construction of the representation to compare
against the test suite so that those tests don't fail inaccurately.

Developed in https://github.com/WordPress/wordpress-develop/pull/6984
Discussed in https://core.trac.wordpress.org/ticket/61576

Props bernhard-reiter, dmsnell, jonsurrell.
See #61576.

#24 @dmsnell
3 months ago

In 58713:

HTML API: Simplify breadcrumb accounting.

Since the HTML Processor started visiting all nodes in a document, both
real and virtual, the breadcrumb accounting became a bit complicated
and it's not entirely clear that it is fully reliable.

In this patch the breadcrumbs are rebuilt separately from the stack of
open elements in order to eliminate the problem of the stateful stack
interactions and the post-hoc event queue.

Breadcrumbs are greatly simplified as a result, and more verifiably
correct, in this construction.

Developed in https://github.com/WordPress/wordpress-develop/pull/6981
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58590].

Props bernhard-reiter, dmsnell.
See #61576.

#27 @hellofromTonya
3 months ago

In 58733:

HTML API: Fix "${var} in strings" deprecation notice in html5lib test.

Changeset [58712] introduced the following compile time PHP deprecation notice on >= PHP 8.2 test runs:

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /var/www/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php on line 257
PHPUnit 9.6.20 by Sebastian Bergmann and contributors.

The ${ syntax for string interpolation was deprecated in PHP 8.2 and should not be used anymore.

Ref: https://wiki.php.net/rfc/deprecate_dollar_brace_string_interpolation

Follow-up to [58712].

Props jrf.
See #61530, #59654, #61576.

#28 @dmsnell
3 months ago

Thanks @hellofromTonya - this was a mistake on my part; never intended to add that in there.

@jonsurrell commented on PR #6040:


3 months ago
#29

@dmsnell This is ready for review.

This ticket was mentioned in PR #7041 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#30

Trac ticket: https://core.trac.wordpress.org/ticket/61576

This builds on TABLE support from https://github.com/WordPress/wordpress-develop/pull/6040 (merged here).

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests, (compared with https://github.com/WordPress/wordpress-develop/pull/6040):

-Tests: 614, Assertions: 217, Skipped: 397.
+Tests: 614, Assertions: 223, Skipped: 391.

Trac ticket:

This ticket was mentioned in PR #7043 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#32

Insertion modes may include instructions like "process the token in
another insertion mode." This means that the step_in_X method may be
called to process in the insertion mode _without_ changing the
state of the insertion mode.

This can result in unsupported errors that are incorrect.

The bail messages for each step_in_ method should explicitly
mention its insertion mode to ensure the error messages are
correct.

Trac ticket: https://core.trac.wordpress.org/ticket/61576

@dmsnell commented on PR #6972:


3 months ago
#35

@westonruter @sirreal I think all of the feedback has been addressed. additionally:

  • get_modifiable_text() in the Tag Processor has been updated to properly handle leading newlines and NULL bytes.
  • an additional test asserts this behavior for get_modifiable_text().
  • I've pulled in clear_up_to_last_marker() from #6040 (thanks @sirreal).
  • Fixed some mistakes/oversights after additional review.

I might merge this tomorrow, but I'd appreciate any reviews _in post_. This is so exciting: we're there - at the end of IN BODY.

@dmsnell commented on PR #6972:


3 months ago
#36

Thanks for the work on this.

This is a remaining issue where a newline is removed from a text node that is only a newline, resulting in a text node with no text. The text node should not be visited at all in this case:

<pre>&#x0A;</pre>

This PR is big enough and in a good place, I don't mind landing now and handling that known issue in a follow-up.

Thanks @sirreal. I'm puzzled on this one. While there's no child node in the DOM, I find it strange that anyone would consider there to be no text node there. Perhaps I would feel differently about the actual newline byte 0x0A, but intentionally encoding it makes it seem that from a syntactic perspective, someone wanted it to be there.

Let's continue to explore this in the follow-up.

#37 @dmsnell
3 months ago

In 58779:

HTML API: Add missing tags in IN BODY insertion mode to HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

This patch also improves the support of get_modifiable_text(), removing
a leading newline inside a LISTING, PRE, or TEXTAREA element.

Developed in https://github.com/WordPress/wordpress-develop/pull/6972
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell, westonruter.
See #61576.

#38 @dmsnell
3 months ago

In 58780:

HTML API: Remove empty test file after adding support for missing elements.

When support was added for the remaining tags in the IN BODY insertion mode, a test
file indicating that support was necessary for certain parts of the parser was
removed, but it wasn't removed from SVN when sending over the patch from git.

This patch removes that empty file so that the WPCS workflows pass.

Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58779].

See #61576.

#40 @dmsnell
3 months ago

In 58781:

HTML API: Fix unsupported insertion mode messages.

Insertion modes in an HTML parser may include instructions like "process
the token in the IN HEAD insertion mode." The rules do not change the
insertion mode of the parser, but the errors are triggered outside of the
rules for the current insertion mode. These will be misleading when
bailing on these instructions, because it will point someone to the wrong
place in the code to find the source of the error.

In this patch all of the bail-points due to lacking insertion mode support
are hard-coded to better orient someone to the section of the code lacking
support for handling the input HTML.

Developed in https://github.com/wordpress/wordpress-develop/pull/7043
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58679].

Props: dmsnell, jonsurrell.
See #61576.

@dmsnell commented on PR #6040:


3 months ago
#42

@sirreal I've merged trunk into this branch, and resolved conflicts. Please review and ensure I didn't accidentally break your changes.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#43

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#44

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#45

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#46

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#47

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#48

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#49

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#50

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#51

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

@jonsurrell commented on PR #6040:


3 months ago
#52

There's an error in the new form test with the change to auto-close the form where the form closer is not visited by next_token. It's fixed with this patch, although I haven't investigated why:

  • src/wp-includes/html-api/class-wp-html-processor.php

    diff --git before/src/wp-includes/html-api/class-wp-html-processor.php after/src/wp-includes/html-api/class-wp-html-processor.php
    index 1e6a18389b..0d1c7c1496 100644
    old new class WP_HTML_Processor extends WP_HTML_Tag_Processor { 
    23562356                                }
    23572357
    23582358                                // This FORM is special because it immediately closes and cannot have other children.
    2359                                 $this->state->current_token->has_self_closing_flag = true;
    2360 
    23612359                                $this->insert_html_element( $this->state->current_token );
    23622360                                $this->state->form_element = $this->state->current_token;
     2361                                $this->state->stack_of_open_elements->pop();
    23632362                                return true;
    23642363                }

I'll push that change and tests should be passing again. It's effectively reverting part of your updates so happy to discuss further.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#53

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

@jonsurrell commented on PR #6040:


3 months ago
#54

@dmsnell The changes you've pushed look good, thanks for catching a couple of bugs and I do like the method for creating virtual nodes.

The last remaining thing is to sort out the FORM element in table insertion mode that's immediately popped.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#55

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

@dmsnell commented on PR #6040:


3 months ago
#56

I'll push that change and tests should be passing again. It's effectively reverting part of your updates so happy to discuss further.

I was considering this because I wanted the code to function as the void tags function, but I can't figure out what's wrong. Along the way I found some bugs in next_token() I introduced when fixing breadcrumbs, but the fix is incoming and easy. We don't close out the remaining open stack elements at the end of the document _because they never get popped_.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#57

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

#58 @dmsnell
3 months ago

In 58806:

HTML API: Add TABLE support in HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
support for various table-related insertion modes. This includes support
for tables, table rows, table cells, table column groups, etc...

Developed in https://github.com/wordpress/wordpress-develop/pull/6040
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#60

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #7085 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#61

When the model of breadcrumb generation in the HTML Processor and node traversal was simplified, the change introduced a bug whereby unclosed nodes at the end of a document would remain unvisited and unclosed.

In this patch, a fix is applied to ensure that all open elements close while traversing a document. A couple of minor documentation typos are fixed in the patch as well.

Follow-up to [58713].

See #61576.

@gziolo commented on PR #7085:


3 months ago
#62

It would be helpful to add test coverage for this case. Something simple and similar to the existing tests here:

https://github.com/WordPress/wordpress-develop/blob/4acefacbf6f7c956f1cf30e406d3c9c04bc855f9/tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php#L291

@dmsnell commented on PR #7085:


3 months ago
#63

It would be helpful to add test coverage for this case. Something simple and similar to the existing tests here:

Your wish is my command!

This ticket was mentioned in PR #6006 on WordPress/wordpress-develop by @dmsnell.


3 months ago
#64

Trac ticket: Core-61576

## Status

I'm nervous sharing this because it's full of flaws, needs extensive review, and documented test cases.

### Earlier versions

  • First draft (ba2c8896f08ee36d3c61a500a86422b736a00923)
    • By implementing the rules at the top of Tree Construction/Dispatch, we don't need $skip_next_foreign_content_processing, because we pop nodes off of the stack of open elements until we're back at an html element, or an HTML integration point, or a MathML integration point. Returning to step() will choose _insertion mode_ instead of _foreign content_ and prevent an infinite loop.

### Notes

  • There's an interplay we have to figure out with casing of tag names. For all of our comparison functions we check the upper-case variants, but there's value in reporting the foreign content tag names as lower-cased or as mixed case.
  • There are some SVG tag names that are supposed to report in mixed case. For both SVG and MathML there are attributes which are supposed to report in mixed case. Unlike my first assumption, however, they still all parse with ASCII case insensitivity. That is, <circle clippath=1 clipPath=2 CLIPPATH=3 cliPPath=4> has a single attribute named clipPath and whose value is 1. This is convenient because we don't have to change the Tag Processor, but inconvenient when things are mostly based on lower-case attribute names.
  • We probably need to repeal the idea that tag names are upper-case and add another bit to communicate tag name vs. doctype declaration. Maybe? Can we test that math:mi or math:MI is in the math namespace? There can be no HTML tag named math:mi (it would be MATH:MI).
  • Could be handy to have something like base_class_get_tag() or private function comparison_tag_name() etc… to report an upper-case tag name, while preserving the case-variants required in foreign content to outside calls.
  • Tracking the namespace seems a bit fishy. When we're in the Tag Processor we can know that if we encounter an SVG or MATH tag when we're in the html namespace, that it should change the namespace to svg or math, respectively, and lower-case the tag names. However, the role of integration points and parsing things in the insertion mode is still vague.
  • Foreign content _is not_ an insertion mode. This is important when handling HTML tags inside HTML integration points. This is _very important_ for get_modifiable_text() which doesn't know if a text node inside foreign content is being processed _as foreign content_ or _in the insertion mode_, where it determines if NULL bytes should be replaced or removed.
  • It's probably necessary to start over again with these insights and more in mind.

## Description

We should reliable detect foreign content and we need to do it in the Tag Processor, specifically because of the rules for CDATA sections. The HTML Processor needs this as well to determine if things like self-closing flags for HTML elements should be respected.

Unlocks:

  • proper detection of CDATA
  • parsing of SVG and MathML
  • indicating if one should expect a closing tag or not

https://github.com/WordPress/wordpress-develop/assets/5431237/00ca7c32-d269-4da7-a9ce-04aff9640f8a

https://github.com/WordPress/wordpress-develop/assets/5431237/7a104953-9dbb-48e4-ae45-be40f3df9736

#65 @dmsnell
3 months ago

In 58828:

HTML API: Close all elements at the end of a document.

When the model of breadcrumb generation in the HTML Processor and node
traversal was simplified, the change introduced a bug whereby unclosed
nodes at the end of a document would remain unvisited and unclosed.

In this patch, a fix is applied to ensure that all open elements close
while traversing a document. A couple of minor documentation typos are
fixed in the patch as well.

Developed in https://github.com/wordpress/wordpress-develop/pull/7085
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58713].

Props: dmsnell, gziolo, jonsurrell.
See #61576.

#67 @dmsnell
3 months ago

In 58833:

HTML API: Add TEMPLATE and related support in HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
support for the IN TEMPLATE and IN HEAD insertion modes. These changes are
primarily about adding support for TEMPLATE elements in the HTML Processor,
but include support for other tags commonly found in the document head, such
as LINK, META, SCRIPT, STYLE, and TITLE.

Developed in https://github.com/wordpress/wordpress-develop/pull/7046
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell, westonruter.
See #61576.

#68 @dmsnell
2 months ago

In 58836:

HTML API: Introduce full parsing mode in HTML Processor.

The HTML Processor has only supported a specific kind of parsing mode
called _the fragment parsing mode_, where it behaves in the same way
that node.innerHTML = html does in the DOM. This mode assumes a
context node and doesn't support parsing an entire document.

As part of work to add more spec support to the HTML API, this patch
introduces a full parsing mode, which can parse a full HTML document
from start to end, including the doctype declaration and head tags.

Developed in https://github.com/wordpress/wordpress-develop/pull/6977
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

@dmsnell commented on PR #6006:


2 months ago
#70

@sirreal @westonruter I've pushed some updates that incorporate trunk and a few other fixes. notably, I had the wrong code for adjusted_current_node().

@jonsurrell commented on PR #6006:


2 months ago
#71

This is very tricky. I'll share some cases that are problematic right now:

### <svg><a/>Text after svg:a self closing tag.
Errors with Cannot run adoption agency when "any other end tag" is required.

### <math><mtext><b>
https://github.com/user-attachments/assets/ab38da64-cdc1-424c-b25a-c4d547bd5089

### <svg></svg><![CDATA[ bogus comment ]]>
https://github.com/user-attachments/assets/c426ad17-e2a1-4ce2-8bdf-46096c7399b7

### <svg><foreignObject><![CDATA[ bogus comment ]]>
https://github.com/user-attachments/assets/f7a953f2-c71e-4cbf-bd3e-7abb630e9dac

### <svg><title><div>
https://github.com/user-attachments/assets/932b45c8-b035-436f-b8af-23941e817481

### <svg><title><rect><div>
https://github.com/user-attachments/assets/c1d74e64-5dda-4882-b435-aff720873175

### <svg><title><svg><div>
https://github.com/user-attachments/assets/370a1b4a-4945-434d-b58c-22f7e88030be

### <svg></br><foo>
https://github.com/user-attachments/assets/33c56197-f15b-45f7-b304-3449319f77df

### <math><mi><div></div></mi><mi>
https://github.com/user-attachments/assets/236169af-22d1-4671-a554-7401bc85f596

### <math><mi><svg><foreignObject><div>
https://github.com/user-attachments/assets/fe4ce497-2ce5-45e4-bbf8-7edf3a8873b9

### <svg><foreignObject>\x00filler\x00text (null bytes)
https://github.com/user-attachments/assets/bc5a28eb-d407-418b-8be3-a321c1a4a292

#73 @dmsnell
2 months ago

In 58839:

HTML API: Add support for IN COLUMN GROUP parsing.

As part of work to add more spec support to the HTML API, this patch adds
support for the IN COLUMN GROUP insertion mode. This small section of the
spec handles rules for the <colgroup> element.

Developed in https://github.com/wordpress/wordpress-develop/pull/7042
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

@jonsurrell commented on PR #7041:


2 months ago
#75

I'd be fine with splitting that up.

#76 @dmsnell
2 months ago

In 58840:

HTML API: Add support for IN CAPTION parsing.

As part of work to add more spec support to the HTML API, this patch adds
support for the IN CAPTION insertion mode. This small section of the
spec handles rules for the <caption> element.

Developed in https://github.com/wordpress/wordpress-develop/pull/7041
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

#78 @dmsnell
2 months ago

In 58841:

HTML API: Add support for IN SELECT IN TABLE parsing.

As part of work to add more spec support to the HTML API, this patch adds
support for the IN SELECT IN TABLE insertion mode. This small section of the
spec handles rules for the <select> element and its children when found
inside of a <table>.

Developed in https://github.com/wordpress/wordpress-develop/pull/7044
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

@jonsurrell commented on PR #6006:


2 months ago
#80

I've made some progress on this fixing some of the issues mentioned above: https://github.com/WordPress/wordpress-develop/pull/6006#issuecomment-2262585496

Right now some self-closing tags in foreign content break. HTML like <svg><a/> seems to enter step_in_body, runs the adoption agency algorithm, and fails:

WP_HTML_Unsupported_Exception: Cannot run adoption agency when "any other end tag" is required. in /var/www/html/wp-includes/html-api/class-wp-html-processor.php:450
Stack trace:
#0 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(5145): WP_HTML_Processor->bail('Cannot run adop...')
#1 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(2423): WP_HTML_Processor->run_adoption_agency_algorithm()
#2 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(4108): WP_HTML_Processor->step_in_body()
#3 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(889): WP_HTML_Processor->step_in_foreign_content()
#4 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(635): WP_HTML_Processor->step()}}}

This ticket was mentioned in PR #7141 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#81

Trac-ticket: Core-61576.

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the <body> context. In this patch, any context node is allowed.

@dmsnell commented on PR #7141:


2 months ago
#82

@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:

  • we have to enter within a certain tokenizer state, but that's not how our parser is designed.
  • the DOM is able to store things like the text </textarea> inside a TEXTAREA node, but HTML is not able to do this. we have to determine what to do in these cases.

This ticket was mentioned in PR #7157 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#83

Trac ticket: Core-61576

As part of work to add more spec support to the HTML API, this patch adds support for the relevant foreign elements in the HTML algorithms within the stack of open elements. Although the HTML Processor cannot yet step into these elements, the format of how they will be represented was determined in the related work from which this patch is extracted.

This patch extracted from https://github.com/wordpress/wordpress-develop/pull/6006.

Developed in https://github.com/wordpress/wordpress-develop/pull/ Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #7158 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#84

Trac ticket: Core-61576

As part of work to add more spec support to the HTML API, this patch removes an insertion mode that was added but shouldn't have been. The rules for parsing foreign content are real, but they don't change the insertion mode for the parser - the interactions are more complex than this.

Follow-up to [58679].
Props dmsnell, jonsurrell.
See #61576.

@dmsnell commented on PR #7158:


2 months ago
#85

Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.

@dmsnell commented on PR #7157:


2 months ago
#86

Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.

@dmsnell commented on PR #7158:


2 months ago
#87

Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.

#88 @dmsnell
2 months ago

In 58867:

HTML API: Add support for SVG and MathML (Foreign content)

As part of work to add more spec support to the HTML API, this patch adds
support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and
introduce additional complexity into the processor, but is important in
order to avoid getting lost when inside these elements.

Developed in https://github.com/wordpress/wordpress-develop/pull/6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Props: dmsnell, jonsurrell, westonruter.
See #61576.

#90 @dmsnell
2 months ago

In 58868:

HTML API: Add support for SVG and MathML (Foreign content) (remove file)

As part of work to add more spec support to the HTML API, this patch adds support for SVG and MathML elements, or more generally, "foreign content."

The rules in foreign content are a mix of XML and HTML parsing rules and introduce additional complexity into the processor, but is important in order to avoid getting lost when inside these elements.

This patch follows the first by deleting the empty files, which were mistakenly left in during the initial merge.

Developed in https://github.com/wordpress/wordpress-develop/pull/6006
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props: dmsnell, jonsurrell, westonruter.
See #61576.

This ticket was mentioned in PR #7164 on WordPress/wordpress-develop by @jonsurrell.


2 months ago
#92

Trac ticket: https://core.trac.wordpress.org/ticket/61576

Follow-up to: [58868] (GitHub https://github.com/WordPress/wordpress-develop/pull/6006).

  • Add a unit test for a bug that was discovered in manual testing and fixed in [58868].
  • Modify behavior to follow-spec (resolves todo).

The spec states the following:

If the token has its self-closing flag set, then run the appropriate steps from the following list:

If the token's tag name is "script", and the new current node is in the SVG namespace
Acknowledge the token's self-closing flag, and then act as described in the steps for a "script" end tag below.

An end tag whose tag name is "script", if the current node is an SVG script element

Any other end tag

I believe the original implementation was incorrect and the intention is to move into the An end tag whose tag name is "script", if the current node is an SVG script element instructions and not the Any other end tag. Ultimately, the result was the same.

This also adds a missing return true from the script end tag block.

#93 @dmsnell
2 months ago

In 58870:

HTML API: expect_closer() should report false for self-closing foreign elements.

Previously, WP_HTML_Processor::expects_closer() would report true for self-closing foreign elements when called without supplying a node in question, but it should have been reporting true just as it does for HTML elements.

This patch adds a test case demonstrating the issue and a bugfix.

The html5lib test runner was relying on the incorrect behavior, accidentally working. This is also corrected and the html5lib test now relies on the correct behavior of expects_closer().

Developed in https://github.com/wordpress/wordpress-develop/pull/7162
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58868].

Props: dmsnell.
See #61576.

#95 @dmsnell
2 months ago

In 58871:

HTML API: Test and fix SVG script handling.

When support was added for foreign content, an ambiguity in the HTML specification led to code that followed the wrong path when encountering a self-closing SCRIPT element in the SVG namespace. Further, a fallthrough was discovered during manual testing.

This patch adds a new test to assert the proper behaviors and fixes these issues. In the case of the SCRIPT element, the outcome was the same with the wrong code path, making the defect benign. In the case of the fallthrough, the wrong behavior would occur.

The updates in this patch also resolves a todo relating to the spec ambiguity.

Developed in https://github.com/wordpress/wordpress-develop/pull/7164
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58868].

Props: dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #7165 on WordPress/wordpress-develop by @dmsnell.


2 months ago
#97

Trac ticket: Core-61576.

As part of work to add more spec support to the HTML API, this patch
adds support for the FRAMESET-related insertion modes, as well as the
set of missing _after_ insertion modes. These modes run at the end of
parsing a document, closing it and taking care of any lingering tags.

Developed in https://github.com/wordpress/wordpress-develop
Discussed in https://core.trac.wordpress.org/ticket/61576

See #61576.

@dmsnell commented on PR #7165:


2 months ago
#98

@sirreal there are failures in the tests that look like BODY isn't closing. If you have insight, I'd appreciate it.

@jonsurrell commented on PR #7165:


2 months ago
#99

I explored some ideas for dealing with those comments in https://github.com/dmsnell/wordpress-develop/pull/18.

For now, I'm going to push a change to this PR to switch the bail conditions in after body. It seems better to bail on comments outside of BODY and HTML nodes (we might even be able to skip this) and support a </body> tag in the middle of the document that has almost no impact if text or tags follow it.

@dmsnell commented on PR #7165:


2 months ago
#100

I'm not sure how best to handle this 😕 Reviewing the spec, I believe that only comments are inserted outside of the BODY and HTML nodes. I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.

This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and next_token() would pull them out when it sees the POP event for BODY and then for HTML.

From a performance standpoint, I don't see this being that troublesome, especially if all we stored in those lists were the WP_HTML_Token or bookmark name.

@jonsurrell commented on PR #7141:


2 months ago
#101

The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂

I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.

The following steps form the HTML fragment parsing algorithm. The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, input, a string to parse, and an optional boolean allowDeclarativeShadowRoots (default false). It returns a list of zero or more nodes.

@jonsurrell commented on PR #7165:


2 months ago
#102

I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.

This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and next_token() would pull them out when it sees the POP event for BODY and then for HTML.

I implemented that in this PR in a1304b5b9b77f93294fc832a5a102e8dcea40c3c. I reverted the change. I had issues getting the text for the comments outside of body because get_modifiable_text is a method on the Tag Processor and it expects to operate on the token it's parsing. It may be enough to seek to all the comments to get their text, I did not try that.

This ticket was mentioned in PR #7195 on WordPress/wordpress-develop by @jonsurrell.


2 months ago
#103

This change adds a new class to handle DOCTYPE token information according to the specification.

The Tag Processor handles parsing DOCTYPE Tokens in greater detail and on-demand in order to create instances of WP_HTML_Doctype_Info. This is useful because DOCTYPE tokens may appear in many places in HTML but are ignores in most situations. This allows the detailed parsing of DOCTYPE tokens to be handled on-demand when a DOCTYPE token is reached under the appropriate circumstances.

The WP_HTML_Doctype_Info class exposes a method to that handles the complex rules for determining quirks mode which involve inspecting the DOCTYPE token name, public identifier, system identifier, and force_quirks_flag.

Trac ticket: https://core.trac.wordpress.org/ticket/61576

---

<details><summary>Survey of existing DOCTYPE declarations</summary>

Download the DOCTYPE report and cat report-doctypes.txt to see the color output.

Here is a preview:

https://github.com/user-attachments/assets/6417f4b8-b295-421f-a496-ab7a00f35bb5

https://github.com/user-attachments/assets/84e6e661-0efd-4ea3-b291-87318376c97e

</details>

@dmsnell commented on PR #7141:


8 weeks ago
#104

@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially).

One case I was specifically thinking through is a <script> context. The idea is simple: parse the inner HTML of a SCRIPT element. The problem is when </script> appears and there is more content. In the DOM you can actually create a text node containing </script> but this is not possible in the HTML. The spec even notes situations like this where the DOM can create trees that cannot be expressed in HTML.

So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for _that_ function to call out something special for elements like SCRIPT so that it falls into the equivalent of ->skip_script_data().

Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable.

However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the svg or math namespace.

@jonsurrell commented on PR #7141:


8 weeks ago
#105

I'll share some context from a conversation.

---

As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided.

---

The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, <body> provides a suitable context, but there could be other cases like <head>. Again, this seems relatively straightforward to support. However, it gets more complicated with things like <svg> as a context element, or especially <rect>, or even trickier a script tag in the SVG namespace!

The context element will be used to establish the correct insertion mode _and_ the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes).

It may make sense to maintain this interface as-is. <body> is likely a good context for most use-cases but in order to properly support fragments some advanced interface likely needs to support:

  • Passing a context element along with its tag name, namespace, and attributes.
  • The context element's document determines the quirks mode of the fragment parser.

@jonsurrell commented on PR #7195:


7 weeks ago
#106

Thank you! I've reviewed your changes and I'm happy with them.

This ticket was mentioned in PR #7232 on WordPress/wordpress-develop by @jonsurrell.


7 weeks ago
#107

This ticket was mentioned in PR #7230 on WordPress/wordpress-develop by @jonsurrell.


7 weeks ago
#108

#109 @dmsnell
7 weeks ago

In 58925:

HTML API: Parse DOCTYPE tokens and set HTML parser mode accordingly.

This patch adds until-now missing code to parse the structure of HTML DOCTYPE declarations. The DOCTYPE is mostly unused but can dictate the document compatability mode, which governs whether CSS class names match in a ASCII-case-insensitive way or not, and whether TABLE elements close an open P element.

The DOCTYPE information is made available through a new method on the Tag Processor, get_doctype_info().

Developed in https://github.com/wordpress/wordpress-develop/pull/7195
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#111 @dmsnell
7 weeks ago

In 58926:

HTML API: Add support for missing FRAMESET and "after" insertion modes.

As part of work to add more spec support to the HTML API, this patch adds support for the FRAMESET-related insertion modes, as well as the set of missing after insertion modes. These modes run at the end of parsing a document, closing it and taking care of any lingering tags.

Developed in https://github.com/wordpress/wordpress-develop/7165
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

@jonsurrell commented on PR #7141:


7 weeks ago
#113

in order to properly support fragments some advanced interface likely needs to support…

I had a thought that may be helpful.

I'd like to add an instance method to the HTML process like create_fragment_parser_at_node( string $html_fragment ). This method would return a new HTML processor, created as a fragment parser with the context element created from the current tag opener (or null). This would allow use to set the document compat type and the context element and its attributes correctly according to the specification.

Once we have this method, advanced fragment parser creation could be handled by that method:

$full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// …

I think internally this method would be used for things like set_inner_html that require a fragment parser.

#114 @dmsnell
7 weeks ago

In 58940:

HTML API: Fix a bug where the namespace was forced to 'html'

While working on other reviews and audits, a bug was discovered in the HTML API where the wrong namespace was being assigned to a token because the default value of 'html' was used. This patch fixes the bug by calling the parent::get_namespace() method instead of assuming 'html'.

Developed in https://github.com/wordpress/wordpress-develop/7232
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58868].

Props jonsurrell.
See #61576.

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


7 weeks ago
#116

@dmsnell commented on PR #6983:


7 weeks ago
#117

@sirreal mind sharing any thoughts on this? I don't think it adds much other than more descriptive bailing messages, but it does structure the algorithm more and include more comments. my thinking is that this might be useful to guide us when we spend more focus on trying to handle these situations.

@jonsurrell commented on PR #7230:


7 weeks ago
#118

A comment about html5lib-tests and this PR. There are currently 2 test failures on trunk. I believe a latent whitespace bug was exposed in #7232.

When removing all the related whitespace html5lib-test skips, there are a total of 8 failures on trunk. All of them are fixed by this PR:

- FAILURES!
- Tests: 1500, Assertions: 1082, Failures: 8, Skipped: 418.
+ OK, but incomplete, skipped, or risky tests!
+ Tests: 1500, Assertions: 1076, Skipped: 424.

@jonsurrell commented on PR #7230:


7 weeks ago
#119

A comment about html5lib-tests and this PR. There are currently 2 test failures on trunk. I believe a latent whitespace bug was exposed in #7232.

When removing all the related whitespace html5lib-test skips, there are a total of 8 failures on trunk. All of them are fixed by this PR:

- FAILURES!
- Tests: 1500, Assertions: 1082, Failures: 8, Skipped: 418.
+ OK, but incomplete, skipped, or risky tests!
+ Tests: 1500, Assertions: 1076, Skipped: 424.

@jonsurrell commented on PR #6982:


7 weeks ago
#120

I think the HTML5lib tests will help here. I'll try to make some progress on this.

FAILURES!
-Tests: 1494, Assertions: 1076, Failures: 2, Skipped: 418.
+Tests: 1494, Assertions: 1103, Failures: 11, Skipped: 391.

@jonsurrell commented on PR #6982:


7 weeks ago
#121

I'm exploring the Noah's Ark work to limit equivalent elements on the active formatting element stack in https://github.com/dmsnell/wordpress-develop/pull/19.

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


7 weeks ago
#122

@jonsurrell commented on PR #6983:


6 weeks ago
#123

This is difficult to interpret and interacts with active format reconstruction in interesting ways. I think my preference would be to work on that first (#6982) then come back to this.

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#124

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#125

@dmsnell commented on PR #6982:


6 weeks ago
#126

Note from conversation: can we simply bail if we have more than three of the same kind of tag name on the list of active formatting elements (up to the nearest marker) and then defer any format Noah's Ark processing?

this could potentially let us process the most common form of active format reconstruction without implementing the complicated and costly parts of the algorithm.

@dmsnell commented on PR #6982:


6 weeks ago
#127

Need to report attributes even on virtual reconstructed attributes. Prior to now we haven't actually recreated any formatting elements, but once we start doing so, we have to ensure we don't report that attributes are missing

#128 @dmsnell
6 weeks ago

In 58966:

HTML API: Fix logic bug in HTML Processor when opening A element.

A mistake in the original code handling opening A elements in the HTML Processor led to mistakes in parsing where the Processor would bail in situations when it could have proceeded. While this was errant behavior, it didn't violate the public contract since it would bail in these situations.

This patch fixes the mistake, which was to only break out of the innermost loop instead of breaking from the containing loop, which resolves the issue.

Developed in https://github.com/WordPress/wordpress-develop/pull/7281
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [56274].

Props jonsurrell.
See #61576.

#129 @dmsnell
6 weeks ago

In 58967:

HTML API: Add missing NOBR end tag handling to HTML Processor.

When the HTML Processor was introduced, an oversight led to a missing case for handling a closing NOBR tag. The NOBR element is a deprecated tag and should not be used.

This patch adds the missing case so that the deprecated NOBR end tag is appropriately handled.

Developed in https://github.com/WordPress/wordpress-develop/pull/7282
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [56274].

Props jonsurrell.
See #61576.

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#130

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#131

@jonsurrell commented on PR #7230:


6 weeks ago
#132

After reviewing this and text subdivision, my preference is to go the other direction. I've pushed a change to stop running subdivision on CDATA and let this change handle CDATA on its own.

This ticket was mentioned in PR #7286 on WordPress/wordpress-develop by @jonsurrell.


6 weeks ago
#133

@jonsurrell commented on PR #7286:


6 weeks ago
#134

I've collected other potential "fixes" to the same problem, but I haven't been able to verify incorrect behaviors yet: https://github.com/sirreal/wordpress-develop/pull/7

#135 @dmsnell
6 weeks ago

In 58977:

HTML API: Ensure that NULL and whitespace-only CDATA sections don't forbid FRAMESET.

When CDATA sections (which can only occur inside SVG and MathML content) consist only of NULL bytes or whitespace characters they should not clear the "frameset ok" flag. Previously they have always been clearing this flag, but in this patch the logic is updated to detect these sequences properly.

Developed in https://github.com/WordPress/wordpress-develop/pull/7230
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props dmsnell, jonsurrell.
See #61576.

#137 @dmsnell
6 weeks ago

In 58992:

HTML API: Only examine HTML nodes in pop_until() instack of open elements.

The pop_until( $tag_name ) method in the stack of open elements should only be examining HTML elements, but it has only been checking the tag name. This has led to closing the wrong tags when run from inside foreign content. A very specific situation where this may arise is when a TEMPLATE closer is found inside foreign content, inside another template.

HTML:template   SVG:template                 HTML:/template
<template><svg><template><foreignObject><div></template><div>
╰──< this outer TEMPLATE is closed by this one >───╯

This patch constains the method to checking for elements matching the tag name which are in the HTML namespace so that the proper detection occurs.

Developed in https://github.com/WordPress/wordpress-develop/pull/7286
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58867].

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#139

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#140

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#141

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#142

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


5 weeks ago
#143

This ticket was mentioned in PR #7312 on WordPress/wordpress-develop by @jonsurrell.


5 weeks ago
#144

Improve support for content appearing after BODY, HTML nodes where
possible. Some HTML may be representable by the HTML API without
introducing ordering problems.

When HTML moves to the "after-{body,html,…}" insertion modes, BODY and
HTML remain on the stack of open elements. Certain content is inserted
in place (comments), some content is inserted in BODY (white space
text), and most other content will switch back to the "in body"
insertion mode.

The difficulty with the HTML API is that the result in the tree may be
out-of-order compared to the positioning of the corresponding tokens in
the HTML input text. For example:

</body>FOO

Produces this tree where the ordering is the same even though a BODY
closer is reached:

[DOCUMENT]

└HTML

├HEAD
├BODY
│ └#text: FOO
└#comment: comment

Compare with this case:

</body>FOO

The tree is identical, the FOO text is _out of order_ in the tree
compared to the HTML string. This case continues to be invalid because
when the FOO text needs to be reached, the parser would have already
reached a BODY tag closer and content outside of the body:

[DOCUMENT]

└HTML

├HEAD
├BODY
│ └#text: FOO
└#comment: comment

This approach deviates from the explicit steps in the spec and will pop
the BODY and HTML tags from the stack of open elements when content
should appear outside of them. Flags are set for each "after-" mode to
prevent returning to BODY and then continuing to produce content
outside, which would produce the unsupported out-of-order behavior.

See https://html.spec.whatwg.org/#parsing-main-afterbody.

Trac ticket: Core-61576

@dmsnell commented on PR #7312:


5 weeks ago
#145

I'm wondering if here we could lean on some of the work I've explored for XML well-formedness error-recovery.

Imagine we start with "proper HTML parsing in tree-order node traversal" but you could opt-in to certain errors.

$processor->parser_tolerance( WP_HTML_Processor::VISIT_NODES_OUT_OF_ORDER, true );
$processor->parser_tolerance( WP_HTML_Processor::OVERLOOK_FOSTERED_NODES, true );

I'm concerned with patches like this because I think there's a real semantic challenge when thinking about operations like set_inner_html(). For a chunk of HTML where the content after BODY itnerleaves IN BODY and AFTER BODY content, can we provide a rational answer to the question: how should the document change when setting inner HTML of the BODY?

This ticket was mentioned in PR #7330 on WordPress/wordpress-develop by @dmsnell.


5 weeks ago
#146

The HTML specification indicates that an HTML tag with the name "IMAGE" should be renamed as "IMG" and handled as if it were an "IMG", but this only applies to elements in the HTML namespace.

In this patch the HTML Processor is updated to ensure that it doesn't remap the tag name when processing foreign content, such as SVG and MathML markup.

See #61576.

#147 @dmsnell
5 weeks ago

In 59014:

HTML API: Make WP_HTML_Processor::get_tag() namespace aware.

The HTML specification indicates that an HTML tag with the name "IMAGE"
should be renamed as "IMG" and handled as if it were an "IMG", but this
only applies to elements in the HTML namespace.

In this patch the HTML Processor is updated to ensure that it doesn't
remap the tag name when processing foreign content, such as SVG and
MathML markup.

Developed in https://github.com/wordpress/wordpress-develop/7330
Discussed in https://core.trac.wordpress.org/ticket/61656

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #7332 on WordPress/wordpress-develop by @dmsnell.


5 weeks ago
#149

Trac ticket: Core-61576.

Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.

This patch changes the behavior accordingly. No tests are impacted by this change.

Follow-up to [58867].

@dmsnell commented on PR #7332:


5 weeks ago
#150

@sirreal I don't feel strongly about this, but I do think that if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break. I've expanded the docblock with a comparison to get_tag().

@jonsurrell commented on PR #7332:


5 weeks ago
#151

I've thought about this some more and I don't think we should make this change.

Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.

I'm not convinced this is the purpose of the method, although it depends what is meant by "printing and display."

Primarily, HTML API is for working with HTML input and HTML output. In HTML, the casing of tag names is irrelevant. There's no reason for svg tags to be "correctly" cased (<altGlyph> instead of <altglyph>), just like there's no reason to use upper or lower or any particular casing for HTML tag names. This method handles the correct casing for SVG element tag names, but that's not important for serializing and printing the svg tags in an HTML document.

It doesn't seem more correct to me to lowercase HTML tag names, MathML tag names, and then used some mixed casing on SVG tag names when the element name differs.

Here's my take on this method.

This method applies the rules from the specification on parsing foreign content:

If the adjusted current node is an element in the SVG namespace, and the token's tag name is one of the ones in the first column of the following table, change the tag name to the name given in the corresponding cell in the second column. (This fixes the case of SVG elements that are not all lowercase.)

| Tag name | Element name |
|:--- |:--- |
| altglyph | altGlyph |
| altglyphdef | altGlyphDef |
… | …

This seems to adjust for a difference between an HTML tag name (case insensitive) and an SVG element name. This adjustment is important as a consideration of tree construction where HTML tokens are transformed into elements in a tree.

At the moment, this roughly corresponds to `Node.nodeName` for elements (and `Element.tagName`). If we make this change, then it becomes an arbitrary decision.

[(An interesting note, qualified name _is_ a DOM concept and does seem to correspond to this method as implemented.)
](https://dom.spec.whatwg.org/#concept-element-qualified-name)

And to state the obvious, it should be trivial for consuming code to lowercase tag names if desired.

---

if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break

Definitely.

@dmsnell commented on PR #7332:


5 weeks ago
#152

It doesn't seem more correct to me to lowercase HTML tag names

It's less about being correct and more about expectations. I think if you survey a bunch of people and ask them how they feel about turning <p><a><span> into <P><A><SPAN> they will have opinions about that. Also, we can survey HTML in the wild and see what a sampling of global expectations might be, given the prevalence of the styling.

Here is a survey from my list of ~300k HTML pages.

Type of tag Count Percent
ALL UPPER 307,499 2.7%
all lower 12,388,675 96%
Mixed Case 221,307 1.7%

My point in sharing these numbers isn't to say they dictate what we do; just noting that the overwhelming majority of HTML out there is using lower-case tag names and people have grown accustomed to them.

In the case of normalization, this is the _default_ behavior, which is why I care about it.

But also going back in time, the reason I remember for introducing these functions was just to ensure that the html5lib tests pass which check against the adjusted foreign content tag names and attribute names. I don't feel these have a central important role in the spec compliance.

Your point is sound: it's trivial for calling code to lower-case-fold the tag names. Except, then they also have to remember to only do that for elements in the HTML namespace and not to do so for foreign elements. That leaves calling code calling this function and then immediately asking if it's an HTML element and then lower-casing.

$tag_name = $this->get_qualified_tag_name();
if ( 'html' === $this->get_namespace() ) {
        $tag_name = strtolower( $tag_name );
}

maybe I had a gut reaction since get_qualified_tag_name() _just_ made the exact same namespace check before returning, and it felt like spreading out the same semantic between the inside and outside of that method.

I was going to close this, but I think I'll leave it open at least a little longer to continue pondering.

@dmsnell commented on PR #7332:


5 weeks ago
#153

For general interest: here is the list of all-caps and mixed-case tag names from my survey. The list includes tag closers, and I didn't attempt to check if the closer casing matched the opening casing. Obvious HTML errors are evident, especially in the list of once-seen tags.

<details><summary>The list</summary>

ALL UPPER TAGS
    65,179: A
    30,621: BR
    28,545: TD
    23,392: FONT
    18,875: P
    15,786: IMG
    15,154: B
    13,723: TR
    13,142: LI
    10,102: OPTION
     6,578: META
     5,069: TABLE
     4,701: HEAD
     4,699: HTML
     4,658: TITLE
     4,576: BODY
     4,137: I
     4,008: ID
     3,945: EM
     3,476: H1
     3,167: CENTER
     2,532: HR
     2,351: SPAN
     1,654: INPUT
     1,529: STRONG
     1,515: UL
     1,098: AREA
       990: SCRIPT
       853: H2
       844: O:P
       792: H3
       785: WBR
       777: DIV
       740: LINK
       738: TH
       661: MENU
       620: BLOCKQUOTE
       610: TBODY
       418: U
       400: FORM
       379: H4
       318: SMALL
       270: STYLE
       266: FRAME
       200: PRE
       186: ADDRESS
       174: BIG
       168: MAP
       166: PARAM
       151: SELECT
       142: FRAMESET
       124: TT
        94: SPACER
        92: ABBR
        88: CITE
        81: SUP
        78: NOBR
        72: NOSCRIPT
        63: H6
        56: BASE
        52: H5
        51: COL
        49: IFRAME
        36: COLGROUP
        35: DD
        34: NOFRAMES
        33: DT
        30: OBJECT
        28: EMBED
        27: SUMMARY
        25: MARQUEE
        25: SUB
        22: HEADER
        22: LAYER
        20: EC
        19: BASEFONT
        19: OL
        15: FIGURE
        14: NAV
        13: DFN
        12: V:F
        12: ARTICLE
        12: ILAYER
        11: SECTION
        11: X-CLARIS-WINDOW
        11: X-CLARIS-TAGVIEW
        11: CAPTION
        11: BUTTON
         9: THEAD
         9: FIGCAPTION
         8: O
         8: ZBLINK
         8: LABEL
         7: CODE
         7: DIR
         7: LH
         7: AUDIO
         6: BLINK
         6: STYLE='MSO-BIDI-FONT-WEIGHT:
         5: X-SAS-WINDOW
         5: BGSOUND
         4: FOOTER
         4: TEXTAREA
         4: ALIGN=LEFT
         4: X-CLARIS-REMOTESAVE
         4: NOINDEX
         3: APPLET
         3: LEFT
         3: BOLD
         3: KBD
         3: SP
         2: Y
         2: E=
         2: C
         2: N͓
         2: MARK
         2: STRIKE
         2: WSJ
         2: NOLAYER
         2: ALIGN="LEFT"
         2: INSERT_COUNT*
         2: BD
         2: NOFRAME
         2: INS
         2: LNAME
         2: FNAME
         2: URL
         2: ACRONYM
Once-seen tags: A,, AAREA, ALIGN="RIGHT", ALIGN=CENTER, ASIDE, B<P, BGCOLOR="#000000", BOTTOM, BR<B, BRïï, CENTRE, CFINCLUDE, CLEAR, COLDEF, COLDEFS, CONNECTED,PREFERRED, CSACTION, CSACTIONDICT, CSACTIONITEM, CSACTIONS, CSSCRIPTDICT, CUFON, CUFONCANVAS, CUFONTEXT, DL, DOC, DOCTYPE, EF_B_RED, FIELDSET, GCSE:SEARCH, GCSE:SEARCHBOX-ONLY, H, HEADS_TAG, HRNOSHADE, HTM, HTML!, IF_ERRORPARAM, IF_ERRORSTR, IF_ERRORTYPE, INSERTFLASHHEAD, INSTITUTE, JSON, LEGEND, LI,<A, MAJOR, METANAME="DESCRIPTION", METANAME="KEYWORDS", NAME, NOAUTOLINK, NOF, O:LOCK, OCCUPATION, ONTOLOGY, ROWS, SAMP, SPOUSE, T, TAIL, TILTE, TIME, U7:P, UNION_TAG_INDEX_FOOTER, UNION_TAG_INDEX_HEADER_1, UNION_TAG_INDEX_HEADER_2, UNION_TAG_INDEX_TITLE, UP-21, V:FORMULAS, V:PATH, V:SHAPETYPE, V:STROKE

Mixed Tags
    32,047: Key
    32,041: Contents
    32,041: LastModified
    32,041: ETag
    32,041: Size
    28,041: StorageClass
     4,007: Owner
     4,004: DisplayName
     4,000: Generation
     4,000: MetaGeneration
     1,578: feColorMatrix
     1,466: feComposite
     1,390: feComponentTransfer
     1,390: feFuncA
     1,387: feFuncR
     1,387: feFuncG
     1,387: feFuncB
       807: linearGradient
       680: clipPath
       459: Error
       459: Code
       459: Message
       380: RequestId
       379: Option
       373: HostId
       155: feGaussianBlur
       125: feBlend
       122: feOffset
       111: bR
        91: Br
        80: tD
        79: feFlood
        75: Td
        74: feMergeNode
        71: Font
        64: o:SmartTagType
        49: textPath
        48: st1:State
        45: Meta
        43: radialGradient
        41: tR
        38: animateTransform
        38: ListBucketResult
        38: Name
        38: Prefix
        38: Marker
        38: IsTruncated
        37: feMerge
        37: st1:City
        34: MaxKeys
        33: Tr
        31: Center
        31: Table
        30: rdf:RDF
        30: asp:ListItem
        29: Img
        28: feMorphology
        26: Strong
        26: foaf:givenName
        26: foaf:familyName
        25: QueryParameterName
        25: QueryParameterValue
        25: Reason
        24: st1:PlaceName
        23: AccountName
        23: cc:Work
        21: Script
        21: Title
        20: RecommendDoc
        19: st1:PlaceType
        19: class="Text"
        16: Li
        13: noScript
        13: psi:contextVar
        13: contBox-x
        12: Input
        11: u51:SmartTagType
        10: Details
        10: Body
        10: ItemTemplate
        10: u46:SmartTagType
        10: u48:SmartTagType
         9: Dd
         9: foreignObject
         9: Th
         9: psi:queryVar
         9: u26:SmartTagType
         9: u40:SmartTagType
         9: u45:SmartTagType
         9: u52:SmartTagType
         9: u53:SmartTagType
         9: st1:Street
         8: u28:SmartTagType
         8: u29:SmartTagType
         8: u31:SmartTagType
         8: u42:SmartTagType
         8: u47:SmartTagType
         8: u49:SmartTagType
         8: u54:SmartTagType
         8: u55:SmartTagType
         8: Button
         8: asp:RequiredFieldValidator
         8: asp:TextBox
         7: color="#CC0000"
         7: MainOrArchivePage
         7: rdf:Description
         7: u23:SmartTagType
         7: u24:SmartTagType
         7: u27:SmartTagType
         7: u33:SmartTagType
         7: u36:SmartTagType
         7: u37:SmartTagType
         7: u43:SmartTagType
         7: Head
         7: color=#fFee00
         6: Event-Card-Open-Close-Toggle
         6: HFBusiness
         6: u1:SmartTagType
         6: u4:SmartTagType
         6: u25:SmartTagType
         6: u30:SmartTagType
         6: u34:SmartTagType
         6: u35:SmartTagType
         6: u38:SmartTagType
         6: u39:SmartTagType
         6: u41:SmartTagType
         6: u44:SmartTagType
         6: u50:SmartTagType
         6: u57:SmartTagType
         6: u58:SmartTagType
         6: u60:SmartTagType
         6: u62:SmartTagType
         6: u63:SmartTagType
         6: u64:SmartTagType
         6: u65:SmartTagType
         6: u66:SmartTagType
         6: u67:SmartTagType
         6: u68:SmartTagType
         6: u69:SmartTagType
         6: iconSm-x
         6: Select
         6: asp:Panel
         5: psi:sessionVar
         5: u32:SmartTagType
         5: u61:SmartTagType
         5: st1:PostalCode
         5: liNK
         4: InLineReplace
         4: Style
         4: psi:sortOp
         4: u7:SmartTagType
         4: u8:SmartTagType
         4: u9:SmartTagType
         4: u10:SmartTagType
         4: u11:SmartTagType
         4: u12:SmartTagType
         4: u17:SmartTagType
         4: u21:SmartTagType
         4: u22:SmartTagType
         4: u56:SmartTagType
         4: u59:SmartTagType
         4: NextMarker
         4: httpStatusCode
         4: Form
         3: Resource
         3: ListAllMyBucketsResult
         3: Buckets
         3: Ul
         3: u5:SmartTagType
         3: u6:SmartTagType
         3: u13:SmartTagType
         3: u14:SmartTagType
         3: u15:SmartTagType
         3: u16:SmartTagType
         3: u73:SmartTagType
         3: u72:SmartTagType
         3: u74:SmartTagType
         3: toggleSection
         3: Initial-scale=1.0"
         3: Ozelliklerimiz<
         3: BucketName
         3: asp:Label
         3: invalidTag
         3: String
         2: hR
         2: xmpMM:DerivedFrom
         2: QlÃIPq¼I“3ˆJ]ߝ¢*5׏¾¢GC
         2: Wo|¥†ƒ´bFôȉ®D:ýx3¨j8V~ùs¸xÑ,4P[\†ô÷sDóÃ1#ð£y)F?|ù‰
         2: k*H)ϼzЄ’â5˜U%Oýƒ

         2: Z.¶seiÁ%Ž<Aùƒ¯~õÀZÇv¸„¼ºXBË
                                        är9KÇãào¼KNôT·2 ÛÁcÚFáÌúŒ¾èMJ`—.ôSÕû”UÐÀ¡Õ7»H$2³„Èhe¤þPžEç‰ûP°IBZ)R]
                                                                                                            8R°ÊŒÆ2dd5æ
         2: sÓ?©JhJ¢ÉéëcZzֆAÝ6âd£5èI”å
         2: H©õj&[؇²ê³èsúòšÆýÑ
         2: NYž´Ø&bTɘ9žC¢ˆq¼Vº
         2: JÏS֖!Ðr¼K’“9‰äy´è(J±œÄ¤8xõ ’L&c1Ç
Ga [Yj*Ëf«:þ¹ŠDWW0¨ˆ‡ÜÑ(H™Ôž
         2: W°éÿ„Û=“ek¾6‹¹Æ^Äu
         2: hX¼Wy2ûºfzÐ[*v–Àq°«
         2: KMx¢ÛY75)-¾=¤vw3ê;¿î=-Sç‚\»ÑŒ7Ëu\2J^S[¤‚C&CC-'ððÒ©ä}'‰é¯XLjҒ ðXu•†AÔdd5æ
         2: Rñú9w1÷ñÿ\²þd=¡ùºÞ¿òel¿Ô³¤ø¥ãôrîcï‡cþ¹eüÈp~n·¯ü™[
         2: y+x¬û¤óéI¡*Y1"J+ʼn
         2: Q·öÝm§o}Ý)eóVQú=QNY€$GIËZĀ¥,LÌkÐPXh¾±œÂA–#ÌÿUp$<‘3,$ûG˜ë@Ü9@32e¯C@@Œæ8žt<ÿhàÿy¿ÍŠÏú#´âÞþ
         2: pñçîð†²’Ä²Zóö—èh5„ßϖê¶
         2: i‘!?Æ
%ˆµÀl[N&åoÖHA%$&G
         2: WK¼bøÒ²ÿ³«úO=tãóÒî2ãm%ü[6*
                                      =
         2: ré*Fu,Ü0,ž¯g!j4VÉTKݢÜ4LKgN?ØΤ§gѼž«o"(äFö˜Eo¡÷»ã4Hîן·ž@ËNºPê"áG“ӏ5È¡[‰Lq
         2: Zâè:P%ÞвŒXS1Œó<e@ʀK›ôw,*ËË1ҁ•s{¹¥
         2: kXu¦!.\
         2: AùCd»´{×ýÚà"CÁÂ҂0ë÷NG»
         2: o'©SVVp
         2: YƕIeUVë¦X«+M¦oòÖÞEöY{¸²Ö"ïb›)¯Ü‹o´4YZË
         2: Bþ®J¼½+˹»EUÅUððV2¾UiÜc‹KŽ¶PA(P*˜ÑE#
         2: j«Ct³èî֗{]ò”“Z¦Ûk]Ï%%ËzT~
³îËÖÒÁ®Ú×µó÷ybmMqjê©»±²pÐÕBp‚_ß<À}õ)iBËN’Ø(ã4à‰ÉÆêìl¤Ô¤êÛÑR¸VJ¬Ý[ëgHáÍÊë;hNqXbš¢ÖUùM¿&K8ªui›Mžaf¶¼[A{ö"ÐÛÕÝ}Y
                         dës\ajwjêö¾u.¤F¥(CxR²JN
         2: p»Jk<ð§Ÿ=¡+ü㜣h²Ò8ÖäõM¼êú©*V¬ÆZX·qñ+,’”½ÿ5e5xžHö+«ÙöÑX[Æ¿ºú
         2: n£Øbðl•ÄÜÄCilüì]aûÛ÷ª˜Æâb·XZ°%À
                                            jH*
         2: Pmj¼Ï'Þ²dõCUÙz¢ª©ªèJ¦«„@j
         2: G£„øg@]($÷G†TJŒ‰ó V¼§øP}ãà(ˆâ÷iÊ
tñtRŽá\Yt“ÝP*2
         2: Aù›¯våÀXpÙ~%*qà°à
         2: F%{þöéyàO4ôéEs#¢†=ó^G:²

         2: bgcolor="#FFFFFF"
         2: Valign="middle"
         2: tdOCTYPE
         2: u18:SmartTagType
         2: u19:SmartTagType
         2: u20:SmartTagType
         2: u70:SmartTagType
         2: u71:SmartTagType
         2: AHREF="http:
         2: Address
         2: socketType
         2: i:pgfRef
         2: ServerTime
         2: TraceId
         2: Transition
         2: font-face="Times
         2: Werewolf
         2: Civilian
         2: feTurbulence
         2: feDisplacementMap
         2: Area
         2: Html
         2: headHTML
         2: asp:HyperLink
         2: asp:DropDownList
         2: asp:CheckBox
Once-seen tags: AAddress, AHREF="Collections.html", AHREF="mailto:hydra@umail.ucsb.edu", Basilica, BlockQuote, Bstyle="color:black;background-color:#ffff66", Bucket, CENTEeR, Content-Type:, Endpoint, Footer, GallerySection, Gblockquote, HREF="http:, Hostip, HostipLookupResultSet, Iframe, In, InstragramWidget, Left, Link, Mmeta, Numbertemplate, P,Here, Page, Rosa, Rotten, SCRILongDateAGE="JavaS0riSunday,, SetModTime, Span, Special, TABle, UdeM_menu, UnknownOperationException, Xlink, YX’*~³;]<úæ
¹Ò˜Õ^ö¬Ö†Õ'ÿԊwN„JcZo^˜«èk\E£ò⑔»P\éäD~ƒÑ4Þ÷ä97
                                                    «Æ ý, alt="El, aname="OLE_LINK2", aria-label="Instagram", aria-label="YouTube", asp:Button, asp:CustomValidator, asp:LinkButton, asp:Localize, asp:RegularExpressionValidator, asp:ValidationSummary, bDopo, clientConfig, cmLogo, com:Text, contPad, countryAbbrev, countryName, customHeaders, displayName, displayShortName, dnn:DnnCssInclude, emailProvider, fOnt, face="Verdana", font="C60000, f~ÐىSBøhš†_ÝXÄĚÁ÷GV‰~ucš¦ác{3÷Övñ:B¹1°$æ1·¸ž°BÛ?YçÁkUHY6~ðå,€GVðƒ, gml:Null, gml:Point, gml:boundedBy, gml:featureMember, gml:pointProperty, google-site-verification=OgY2mih_AxAZzi7f8b33QTOGHoScolbNOE6aTlqold0, httpProtocol, incomingServer, ipLocation, j‚A|ó+§ðü©cÈå,œ¼²e[WId2ùSÂë#㈧Rxáô¾~æBµÕ™œÞôq–m#ºÃ—׆ÐÒÂ
              §Ž¡¶*€ó7†Qh2Ré, link="#0000FF", mes:Error, mes:ErrorMessage, meta name="description" content="Evo, meta name="keywords" content="Bike, meta property="og:description" content="Rent, navBarCenter, ns0:City, ns0:PlaceName, ns0:PlaceType, ns0:State, onMouseover="over_effect(event,'outset')", outgoingServer, pYes,, psi:ZOTERO_COinS, psi:iktList, psi:sortOptions, quicklinkComp, skippedTag, title="Fairfield, titleB, toggleCont, togglePad, xáP¾~æ$
Found 12,917,481 tags
  of those:
    ALL UPPER: 307,499 (2.38%)
    all lower: 12,388,675 (95.906%)
   Mixed Case: 221,307 (1.713%)

</details>

This ticket was mentioned in PR #7347 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#155

In the rules for parsing tokens in foreign content there are 2 places that indicated tokens should be processed "according to the rules given in the section corresponding to the current insertion mode in HTML content."

First (bold mine):

A start tag whose tag name is one of: "b", "big", "blockquote", "body", "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", "table", "tt", "u", "ul", "var"
A start tag whose tag name is "font", if the token has any attributes named "color", "face", or "size"
An end tag whose tag name is "br", "p"
Parse error.

While the current node is not a MathML text integration point, an HTML integration point, or an element in the HTML namespace, pop elements from the stack of open elements.

Reprocess the token according to the rules given in the section corresponding to the current insertion mode in HTML content.

And later (bold mine):

Any other end tag
Run these steps:

  1. Initialize node to be the current node (the bottommost node of the stack).
  2. If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.
  3. Loop: If node is the topmost element in the stack of open elements, then return. (fragment case)
  4. If node's tag name, converted to ASCII lowercase, is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.
  5. Set node to the previous entry in the stack of open elements.
  6. If node is not an element in the HTML namespace, return to the step labeled loop.
  7. Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.

While working on fragment parsing and html5lib-tests in https://github.com/dmsnell/wordpress-develop/pull/22, I discovered an infinite loop that seems to occur in the following situation. At an svg:svg context node, create a fragment parser for with HTML </p>. This is the first condition mentioned.

In a full parser, the instruction "While the current node is not a MathML text integration point, an HTML integration point, or an element in the HTML namespace, pop elements from the stack of open elements." would ensure that when reprocessing the token it does not re-enter the foreign content rules because by popping elements the token would no longer be foreign content. However, in a fragment parser there may not be any nodes to pop and the context element may continue to cause foreign content handling to be applied. But the instruction seems to indicate that the token should be handled in the current HTML insertion mode.

This is fixed by moving to the same place in both cases which is a reproduction of the HTML handling switch on the current insertion mode.

This fixes the infinite loop that that appeared in https://github.com/dmsnell/wordpress-develop/pull/22.

Observed in https://github.com/dmsnell/wordpress-develop/pull/22.

Follow-up to [58868]
See: #61576

#156 @dmsnell
4 weeks ago

In 59024:

HTML API: Prevent infinite loop in foreign content reprocessing step.

An infinite loop was discovered in specific situations within foreign content inside the HTML Processor when a given node inside foreign content must be handled in the rules for the current insertion mode.

This patch resolves the loop by handling those nodes directly instead of reprocessing the node, which previously was redirecting control flow back to where the loop started.

Developed in https://github.com/wordpress/wordpress-develop/7347
Discussed in https://core.trac.wordpress.org/ticket/61656

Follow-up to [58868].

Props jonsurrell.
See #61576.

@jonsurrell commented on PR #7425:


3 weeks ago
#159

@dmsnell My first thought was that this logic should be used to determine the parsing mode before calling Tag_Processor (parent) step:

https://github.com/WordPress/wordpress-develop/blob/34c861b6a53cb82721ce0e82a3e2024d91c6ef7e/src/wp-includes/html-api/class-wp-html-processor.php#L873-L892

And _maybe_ the tag procuessor just needs to know html/foreign more than the actual namespace it's in 🤔

@jonsurrell commented on PR #7425:


3 weeks ago
#160

Thanks for the fixes here, I think this is good to go @dmsnell.

#161 @dmsnell
2 weeks ago

In 59099:

HTML API: Switch to HTML namespace when entering Integration Points.

When encountering inline SVG and MathML content in an HTML document, there are certain "integration points" which transition back into the HTML parsing ruleset. Previously, the HTML API was incorrectly switching into the namespace of the element transitioning into that ruleset.

In this patch, the correct transition is made, where all integration points refer to HTML rules, while non-integration points refer to the rules of the namespace corresponding to the token itself.

Developed in https://github.com/wordpress/wordpress-develop/pull/7425
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#163 @davidbaumwald
2 weeks ago

  • Resolution set to fixed
  • Status changed from new to closed

With 6.7 Beta 1 releasing shortly, I'm closing this ticket as Fixed. Any follow-up work can happen in new tickets. If there's additional iteration needed here specifically, this can be reopened as a Task (Blessed) in 6.7.

Note: See TracTickets for help on using tickets.