Make WordPress Core

Opened 3 weeks ago

Last modified 8 hours ago

#61576 new enhancement

HTML API: Improved spec support in 6.7

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: 6.7 Priority: normal
Severity: normal Version: 6.6
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

The HTML Processor, introduced in #58517, remains a work in progress until it fully supports the HTML5 specification, or until all that will be supported is (there are a few corners of the specification that don't obviously fit well into the paradigm, specifically foster parenting and some parts of the adoption agency algorithm).

However, during WordPress 6.7's development cycle it is hoped to rapidly add whatever remaining tag and algorithm support as is able, having the refactor during 6.6 (visiting all nodes, real and virtual, from #61348) which unclocks most of the remaining rules from a design standpoint.

This is a tracking ticket for that work during this release cycle.


Notable Support

Change History (46)

This ticket was mentioned in PR #6968 on WordPress/wordpress-develop by @jonsurrell.


3 weeks ago
#1

  • Keywords has-patch added

Trac ticket: Core-61576

#2 @dmsnell
3 weeks ago

In 58676:

HTML API: Add current_node_is() helper method to stack of open elements.

As part of work to add more spec support to the HTML API, this new method
will make it easier to implement the logic when in the SELECT and TABLE
insertion modes.

Developed in https://github.com/WordPress/wordpress-develop/pull/6968
Discussed in https://core.trac.wordpress.org/ticket/51576

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #5908 on WordPress/wordpress-develop by @jonsurrell.


3 weeks ago
#4

  • Keywords has-unit-tests added

Trac ticket: Core-61576

A start tag whose tag name is "select"
Reconstruct the active formatting elements, if any.

Insert an HTML element for the token.

Set the frameset-ok flag to "not ok".

If the insertion mode is one of "in table", "in caption", "in table body", "in row", or "in cell", then switch the insertion mode to "in select in table". Otherwise, switch the insertion mode to "in select".

Skip fewer of the html5lib-tests:

OK, but incomplete, skipped, or risky tests!
-Tests: 607, Assertions: 174, Skipped: 433.
+Tests: 607, Assertions: 184, Skipped: 423.

#5 @dmsnell
3 weeks ago

In 58677:

HTML API: Support SELECT insertion mode.

As part of work to add more spec support to the HTML API, this patch adds
support for the SELECT, OPTION, and OPTGROUP elements, including the
requisite support for the IN SELECT insertion mode.

Developed in https://github.com/WordPress/wordpress-develop/pull/5908
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

This ticket was mentioned in PR #6972 on WordPress/wordpress-develop by @dmsnell.


3 weeks ago
#7

Trac ticket: Core-61576

## Summary

  • [ ] There are some failing tests that need to pass.
  • [ ] This needs a _full_ and careful audit.

## Description

As part of work to add more spec support to the HTML API, this patch adds support for the remaining missing tags in the IN BODY insertion mode. Not all of the added tags are supported, because in some cases they reset the insertion mode and are reprocessed where they will be rejected.

html5lib tests

- Tests: 607, Assertions: 174, Skipped: 433.
+ Tests: 607, Assertions: 224, Errors: 2, Failures: 17, Skipped: 381.

This ticket was mentioned in PR #6973 on WordPress/wordpress-develop by @dmsnell.


3 weeks ago
#8

Trac ticket: Core-61576

## Description

As part of work to add more spec support to the HTML API, this patch adds stubs for all of the remaining parser insertion modes. These modes are not all supported, but they will be necessary to continue adding support for other tags and markup.

#9 @dmsnell
3 weeks ago

In 58679:

HTML API: Stub out remaining insertion modes in the HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
stubs for all of the remaining parser insertion modes in the HTML Processor.
These modes are not all supported yet, but they will be necessary to continue
adding support for other tags and markup.

Developed in https://github.com/WordPress/wordpress-develop/pull/6973
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#11 @TobiasBg
3 weeks ago

@dmsnell: There's a single @since 6.4.0 in [58679] that probably also should be a @since 6.7.0, no?

#12 @dmsnell
3 weeks ago

In 58680:

HTML API: Fix wrong @since tag.

When the remaining insertion modes were stubbed in the HTML Processor,
a @since tag was mistakenly copied with 6.4.0 instead of 6.7.0.

This patch fixes the invalid tag.

Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58679].

Props tobiasbg.
See #61576.

#13 @dmsnell
3 weeks ago

@TobiasBg you have an incredible ability to notice details! thanks for the comment. I've fixed this in the commit above. Much appreciated! 🙇‍♂️

This ticket was mentioned in PR #6977 on WordPress/wordpress-develop by @dmsnell.


3 weeks ago
#14

Trac ticket: Core-61576

## Description

As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY.

Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses.

This ticket was mentioned in PR #6981 on WordPress/wordpress-develop by @dmsnell.


2 weeks ago
#15

Trac ticket: Core-61576

Since the HTML Processor started visiting all nodes in a document, both real and virtual, the breadcrumb accounting became a bit complicated and it's not entirely clear that it is fully reliable.

In this patch the breadcrumbs are rebuilt separately from the stack of open elements in order to eliminate the problem of the stateful stack interactions and the post-hoc event queue.

Breadcrumbs are greatly simplified as a result, and more verifiably correct, in this construction.

This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.


2 weeks ago
#16

This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.


2 weeks ago
#17

This ticket was mentioned in PR #6984 on WordPress/wordpress-develop by @dmsnell.


2 weeks ago
#18

Trac ticket: Core-61576

Many tests from the html5lib test suite fail because of differences in
text handling between a DOM API and the HTML API, even though the
semantics of the parse are equivalent. For example, it's possible in
the HTML API to read multiple successive text nodes when the tokens
between them are ignored.

The test suite didn't account for this and so was failing tests. This
patch improves the construction of the representation to compare
against the test suite so that those tests don't fail inaccurately.

This ticket was mentioned in PR #6988 on WordPress/wordpress-develop by @dmsnell.


2 weeks ago
#19

Trac ticket: Core-61576

The generate_implied_end_tags() algorithm has been comparing the current node to a list of node names, which means that it won't ever pop any elements from the stack of open elements.

This patch corrects the mistake by comparing node name against the list, thus fixing the algorithm. This was noted in development work for the 6.7 release.

#20 @dmsnell
2 weeks ago

In 58702:

HTML API: Correct node name in generate_implied_end_tags().

The generate_implied_end_tags() algorithm has been comparing the
current node to a list of node names, which means that it won't ever
pop any elements from the stack of open elements.

This patch corrects the mistake by comparing node name against the
list, thus fixing the algorithm. This was noted in development work
for the 6.7 release.

Developed in https://github.com/WordPress/wordpress-develop/pull/6988
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell.
See #61576.

#22 @dmsnell
11 days ago

In 58712:

HTML API: Join successive text nodes in html5lib test representation.

Many tests from the html5lib test suite fail because of differences in
text handling between a DOM API and the HTML API, even though the
semantics of the parse are equivalent. For example, it's possible in
the HTML API to read multiple successive text nodes when the tokens
between them are ignored.

The test suite didn't account for this and so was failing tests. This
patch improves the construction of the representation to compare
against the test suite so that those tests don't fail inaccurately.

Developed in https://github.com/WordPress/wordpress-develop/pull/6984
Discussed in https://core.trac.wordpress.org/ticket/61576

Props bernhard-reiter, dmsnell, jonsurrell.
See #61576.

#24 @dmsnell
11 days ago

In 58713:

HTML API: Simplify breadcrumb accounting.

Since the HTML Processor started visiting all nodes in a document, both
real and virtual, the breadcrumb accounting became a bit complicated
and it's not entirely clear that it is fully reliable.

In this patch the breadcrumbs are rebuilt separately from the stack of
open elements in order to eliminate the problem of the stateful stack
interactions and the post-hoc event queue.

Breadcrumbs are greatly simplified as a result, and more verifiably
correct, in this construction.

Developed in https://github.com/WordPress/wordpress-develop/pull/6981
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58590].

Props bernhard-reiter, dmsnell.
See #61576.

#27 @hellofromTonya
7 days ago

In 58733:

HTML API: Fix "${var} in strings" deprecation notice in html5lib test.

Changeset [58712] introduced the following compile time PHP deprecation notice on >= PHP 8.2 test runs:

Deprecated: Using ${var} in strings is deprecated, use {$var} instead in /var/www/tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php on line 257
PHPUnit 9.6.20 by Sebastian Bergmann and contributors.

The ${ syntax for string interpolation was deprecated in PHP 8.2 and should not be used anymore.

Ref: https://wiki.php.net/rfc/deprecate_dollar_brace_string_interpolation

Follow-up to [58712].

Props jrf.
See #61530, #59654, #61576.

#28 @dmsnell
7 days ago

Thanks @hellofromTonya - this was a mistake on my part; never intended to add that in there.

@jonsurrell commented on PR #6040:


7 days ago
#29

@dmsnell This is ready for review.

This ticket was mentioned in PR #7041 on WordPress/wordpress-develop by @jonsurrell.


7 days ago
#30

Trac ticket: https://core.trac.wordpress.org/ticket/61576

This builds on TABLE support from https://github.com/WordPress/wordpress-develop/pull/6040 (merged here).

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests, (compared with https://github.com/WordPress/wordpress-develop/pull/6040):

-Tests: 614, Assertions: 217, Skipped: 397.
+Tests: 614, Assertions: 223, Skipped: 391.

Trac ticket:

This ticket was mentioned in PR #7043 on WordPress/wordpress-develop by @jonsurrell.


7 days ago
#32

Insertion modes may include instructions like "process the token in
another insertion mode." This means that the step_in_X method may be
called to process in the insertion mode _without_ changing the
state of the insertion mode.

This can result in unsupported errors that are incorrect.

The bail messages for each step_in_ method should explicitly
mention its insertion mode to ensure the error messages are
correct.

Trac ticket: https://core.trac.wordpress.org/ticket/61576

@dmsnell commented on PR #6972:


2 days ago
#35

@westonruter @sirreal I think all of the feedback has been addressed. additionally:

  • get_modifiable_text() in the Tag Processor has been updated to properly handle leading newlines and NULL bytes.
  • an additional test asserts this behavior for get_modifiable_text().
  • I've pulled in clear_up_to_last_marker() from #6040 (thanks @sirreal).
  • Fixed some mistakes/oversights after additional review.

I might merge this tomorrow, but I'd appreciate any reviews _in post_. This is so exciting: we're there - at the end of IN BODY.

@dmsnell commented on PR #6972:


21 hours ago
#36

Thanks for the work on this.

This is a remaining issue where a newline is removed from a text node that is only a newline, resulting in a text node with no text. The text node should not be visited at all in this case:

<pre>&#x0A;</pre>

This PR is big enough and in a good place, I don't mind landing now and handling that known issue in a follow-up.

Thanks @sirreal. I'm puzzled on this one. While there's no child node in the DOM, I find it strange that anyone would consider there to be no text node there. Perhaps I would feel differently about the actual newline byte 0x0A, but intentionally encoding it makes it seem that from a syntactic perspective, someone wanted it to be there.

Let's continue to explore this in the follow-up.

#37 @dmsnell
19 hours ago

In 58779:

HTML API: Add missing tags in IN BODY insertion mode to HTML Processor.

As part of work to add more spec support to the HTML API, this patch adds
support for the remaining missing tags in the IN BODY insertion mode. Not
all of the added tags are supported, because in some cases they reset the
insertion mode and are reprocessed where they will be rejected.

This patch also improves the support of get_modifiable_text(), removing
a leading newline inside a LISTING, PRE, or TEXTAREA element.

Developed in https://github.com/WordPress/wordpress-develop/pull/6972
Discussed in https://core.trac.wordpress.org/ticket/61576

Props dmsnell, jonsurrell, westonruter.
See #61576.

#38 @dmsnell
18 hours ago

In 58780:

HTML API: Remove empty test file after adding support for missing elements.

When support was added for the remaining tags in the IN BODY insertion mode, a test
file indicating that support was necessary for certain parts of the parser was
removed, but it wasn't removed from SVN when sending over the patch from git.

This patch removes that empty file so that the WPCS workflows pass.

Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58779].

See #61576.

#40 @dmsnell
18 hours ago

In 58781:

HTML API: Fix unsupported insertion mode messages.

Insertion modes in an HTML parser may include instructions like "process
the token in the IN HEAD insertion mode." The rules do not change the
insertion mode of the parser, but the errors are triggered outside of the
rules for the current insertion mode. These will be misleading when
bailing on these instructions, because it will point someone to the wrong
place in the code to find the source of the error.

In this patch all of the bail-points due to lacking insertion mode support
are hard-coded to better orient someone to the section of the code lacking
support for handling the input HTML.

Developed in https://github.com/wordpress/wordpress-develop/pull/7043
Discussed in https://core.trac.wordpress.org/ticket/61576

Follow-up to [58679].

Props: dmsnell, jonsurrell.
See #61576.

@dmsnell commented on PR #6040:


17 hours ago
#42

@sirreal I've merged trunk into this branch, and resolved conflicts. Please review and ensure I didn't accidentally break your changes.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


17 hours ago
#43

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


8 hours ago
#44

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


8 hours ago
#45

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.

This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.


8 hours ago
#46

Trac ticket: Core-61576

Add support for table elements:

  • TABLE
  • THEAD, TBODY, TFOOT
  • TR
  • TD, TH
  • COL, COLGROUP
  • CAPTION

Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.

HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests):

-Tests: 610, Assertions: 275, Skipped: 335.
+Tests: 610, Assertions: 301, Skipped: 309.
Note: See TracTickets for help on using tickets.