Opened 3 months ago
Closed 2 weeks ago
#61576 closed enhancement (fixed)
HTML API: Improved spec support in 6.7
Reported by: | dmsnell | Owned by: | |
---|---|---|---|
Milestone: | 6.7 | Priority: | normal |
Severity: | normal | Version: | 6.6 |
Component: | HTML API | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
The HTML Processor, introduced in #58517, remains a work in progress until it fully supports the HTML5 specification, or until all that will be supported is (there are a few corners of the specification that don't obviously fit well into the paradigm, specifically foster parenting and some parts of the adoption agency algorithm).
However, during WordPress 6.7's development cycle it is hoped to rapidly add whatever remaining tag and algorithm support as is able, having the refactor during 6.6 (visiting all nodes, real and virtual, from #61348) which unclocks most of the remaining rules from a design standpoint.
This is a tracking ticket for that work during this release cycle.
Notable Support
- SVG and MathML, foreign content: wordpress-develop#6006
- TABLE and table-related elements: wordpress-develpo#6040
- SELECT, OPTION, OPTGROUP: wordpress-develop#5908
- FORM and form-related elements.
- Deprecated elements.
- Page-structure elements like HEAD, HTML, IFRAME, META.
- TEMPLATE
- PLAINTEXT element: wordpress-develop#5905
- Reset the insertion mode appropriately wordpress-develop#6020
- Active format reconstruction wordpress-develop#13
- Adoption agency algorithm wordpress-develop#14
Change History (163)
This ticket was mentioned in PR #6968 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#1
- Keywords has-patch added
This ticket was mentioned in PR #5908 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#4
- Keywords has-unit-tests added
Trac ticket: Core-61576
A start tag whose tag name is "select"
Reconstruct the active formatting elements, if any.
Insert an HTML element for the token.
Set the frameset-ok flag to "not ok".
If the insertion mode is one of "in table", "in caption", "in table body", "in row", or "in cell", then switch the insertion mode to "in select in table". Otherwise, switch the insertion mode to "in select".
Skip fewer of the html5lib-tests:
OK, but incomplete, skipped, or risky tests! -Tests: 607, Assertions: 174, Skipped: 433. +Tests: 607, Assertions: 184, Skipped: 423.
This ticket was mentioned in PR #6972 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#7
Trac ticket: Core-61576
## Summary
- [ ] There are some failing tests that need to pass.
- [ ] This needs a _full_ and careful audit.
## Description
As part of work to add more spec support to the HTML API, this patch adds support for the remaining missing tags in the IN BODY insertion mode. Not all of the added tags are supported, because in some cases they reset the insertion mode and are reprocessed where they will be rejected.
html5lib tests
- Tests: 607, Assertions: 174, Skipped: 433. + Tests: 607, Assertions: 224, Errors: 2, Failures: 17, Skipped: 381.
This ticket was mentioned in PR #6973 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#8
Trac ticket: Core-61576
## Description
As part of work to add more spec support to the HTML API, this patch adds stubs for all of the remaining parser insertion modes. These modes are not all supported, but they will be necessary to continue adding support for other tags and markup.
#11
@
3 months ago
@dmsnell: There's a single @since 6.4.0
in [58679] that probably also should be a @since 6.7.0
, no?
#13
@
3 months ago
@TobiasBg you have an incredible ability to notice details! thanks for the comment. I've fixed this in the commit above. Much appreciated! 🙇♂️
This ticket was mentioned in PR #6977 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#14
Trac ticket: Core-61576
## Description
As part of work to add more spec support to the HTML API, this patch adds support for the insertion modes from the initial start of a full document parse until IN BODY.
Modes after IN BODY are left to future work, but this change opens up the ability to start performing full document parses.
This ticket was mentioned in PR #6981 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#15
Trac ticket: Core-61576
Since the HTML Processor started visiting all nodes in a document, both real and virtual, the breadcrumb accounting became a bit complicated and it's not entirely clear that it is fully reliable.
In this patch the breadcrumbs are rebuilt separately from the stack of open elements in order to eliminate the problem of the stateful stack interactions and the post-hoc event queue.
Breadcrumbs are greatly simplified as a result, and more verifiably correct, in this construction.
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#16
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#17
This ticket was mentioned in PR #6984 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#18
Trac ticket: Core-61576
Many tests from the html5lib test suite fail because of differences in
text handling between a DOM API and the HTML API, even though the
semantics of the parse are equivalent. For example, it's possible in
the HTML API to read multiple successive text nodes when the tokens
between them are ignored.
The test suite didn't account for this and so was failing tests. This
patch improves the construction of the representation to compare
against the test suite so that those tests don't fail inaccurately.
This ticket was mentioned in PR #6988 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#19
Trac ticket: Core-61576
The generate_implied_end_tags()
algorithm has been comparing the current node to a list of node names, which means that it won't ever pop any elements from the stack of open elements.
This patch corrects the mistake by comparing node name against the list, thus fixing the algorithm. This was noted in development work for the 6.7 release.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#26
Trac ticket: https://core.trac.wordpress.org/ticket/61576
#28
@
3 months ago
Thanks @hellofromTonya - this was a mistake on my part; never intended to add that in there.
@jonsurrell commented on PR #6040:
3 months ago
#29
@dmsnell This is ready for review.
This ticket was mentioned in PR #7041 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#30
Trac ticket: https://core.trac.wordpress.org/ticket/61576
This builds on TABLE support from https://github.com/WordPress/wordpress-develop/pull/6040 (merged here).
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
, (compared with https://github.com/WordPress/wordpress-develop/pull/6040):
-Tests: 614, Assertions: 217, Skipped: 397. +Tests: 614, Assertions: 223, Skipped: 391.
Trac ticket:
This ticket was mentioned in PR #7042 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#31
This ticket was mentioned in PR #7043 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#32
Insertion modes may include instructions like "process the token in
another insertion mode." This means that the step_in_X method may be
called to process in the insertion mode _without_ changing the
state of the insertion mode.
This can result in unsupported errors that are incorrect.
The bail messages for each step_in_ method should explicitly
mention its insertion mode to ensure the error messages are
correct.
Trac ticket: https://core.trac.wordpress.org/ticket/61576
This ticket was mentioned in PR #7044 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#33
This builds on TABLE support from https://github.com/WordPress/wordpress-develop/pull/6040 (merged here).
Trac ticket: https://core.trac.wordpress.org/ticket/61576
This ticket was mentioned in PR #7051 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#34
Trac ticket: https://core.trac.wordpress.org/ticket/61576
3 months ago
#35
@westonruter @sirreal I think all of the feedback has been addressed. additionally:
get_modifiable_text()
in the Tag Processor has been updated to properly handle leading newlines and NULL bytes.- an additional test asserts this behavior for
get_modifiable_text()
. - I've pulled in
clear_up_to_last_marker()
from #6040 (thanks @sirreal). - Fixed some mistakes/oversights after additional review.
I might merge this tomorrow, but I'd appreciate any reviews _in post_. This is so exciting: we're there - at the end of IN BODY.
3 months ago
#36
Thanks for the work on this.
This is a remaining issue where a newline is removed from a text node that is only a newline, resulting in a text node with no text. The text node should not be visited at all in this case:
<pre>
</pre>This PR is big enough and in a good place, I don't mind landing now and handling that known issue in a follow-up.
Thanks @sirreal. I'm puzzled on this one. While there's no child node in the DOM, I find it strange that anyone would consider there to be no text node there. Perhaps I would feel differently about the actual newline byte 0x0A, but intentionally encoding it makes it seem that from a syntactic perspective, someone wanted it to be there.
Let's continue to explore this in the follow-up.
3 months ago
#42
@sirreal I've merged trunk
into this branch, and resolved conflicts. Please review and ensure I didn't accidentally break your changes.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#43
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#44
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#45
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#46
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#47
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#48
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#49
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#50
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#51
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
@jonsurrell commented on PR #6040:
3 months ago
#52
There's an error in the new form test with the change to auto-close the form where the form closer is not visited by next_token
. It's fixed with this patch, although I haven't investigated why:
-
src/wp-includes/html-api/class-wp-html-processor.php
diff --git before/src/wp-includes/html-api/class-wp-html-processor.php after/src/wp-includes/html-api/class-wp-html-processor.php index 1e6a18389b..0d1c7c1496 100644
old new class WP_HTML_Processor extends WP_HTML_Tag_Processor { 2356 2356 } 2357 2357 2358 2358 // This FORM is special because it immediately closes and cannot have other children. 2359 $this->state->current_token->has_self_closing_flag = true;2360 2361 2359 $this->insert_html_element( $this->state->current_token ); 2362 2360 $this->state->form_element = $this->state->current_token; 2361 $this->state->stack_of_open_elements->pop(); 2363 2362 return true; 2364 2363 }
I'll push that change and tests should be passing again. It's effectively reverting part of your updates so happy to discuss further.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#53
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
@jonsurrell commented on PR #6040:
3 months ago
#54
@dmsnell The changes you've pushed look good, thanks for catching a couple of bugs and I do like the method for creating virtual nodes.
The last remaining thing is to sort out the FORM element in table insertion mode that's immediately popped.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#55
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
3 months ago
#56
I'll push that change and tests should be passing again. It's effectively reverting part of your updates so happy to discuss further.
I was considering this because I wanted the code to function as the void tags function, but I can't figure out what's wrong. Along the way I found some bugs in next_token()
I introduced when fixing breadcrumbs, but the fix is incoming and easy. We don't close out the remaining open stack elements at the end of the document _because they never get popped_.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#57
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #6040 on WordPress/wordpress-develop by @jonsurrell.
3 months ago
#60
Trac ticket: Core-61576
Add support for table elements:
- TABLE
- THEAD, TBODY, TFOOT
- TR
- TD, TH
- COL, COLGROUP
- CAPTION
Not all necessary insertion modes are implemented in this PR, so e.g. "in caption" and "in select in table" modes are not implemented in this PR.
HTML5-lib test change (./vendor/bin/phpunit --group html-api-html5lib-tests
):
-Tests: 610, Assertions: 275, Skipped: 335. +Tests: 610, Assertions: 301, Skipped: 309.
This ticket was mentioned in PR #7085 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#61
When the model of breadcrumb generation in the HTML Processor and node traversal was simplified, the change introduced a bug whereby unclosed nodes at the end of a document would remain unvisited and unclosed.
In this patch, a fix is applied to ensure that all open elements close while traversing a document. A couple of minor documentation typos are fixed in the patch as well.
Follow-up to [58713].
See #61576.
3 months ago
#62
It would be helpful to add test coverage for this case. Something simple and similar to the existing tests here:
3 months ago
#63
It would be helpful to add test coverage for this case. Something simple and similar to the existing tests here:
Your wish is my command!
This ticket was mentioned in PR #6006 on WordPress/wordpress-develop by @dmsnell.
3 months ago
#64
Trac ticket: Core-61576
## Status
I'm nervous sharing this because it's full of flaws, needs extensive review, and documented test cases.
### Earlier versions
- First draft (ba2c8896f08ee36d3c61a500a86422b736a00923)
- By implementing the rules at the top of Tree Construction/Dispatch, we don't need
$skip_next_foreign_content_processing
, because we pop nodes off of the stack of open elements until we're back at anhtml
element, or an HTML integration point, or a MathML integration point. Returning tostep()
will choose _insertion mode_ instead of _foreign content_ and prevent an infinite loop.
- By implementing the rules at the top of Tree Construction/Dispatch, we don't need
### Notes
- There's an interplay we have to figure out with casing of tag names. For all of our comparison functions we check the upper-case variants, but there's value in reporting the foreign content tag names as lower-cased or as mixed case.
- There are some SVG tag names that are supposed to report in mixed case. For both SVG and MathML there are attributes which are supposed to report in mixed case. Unlike my first assumption, however, they still all parse with ASCII case insensitivity. That is,
<circle clippath=1 clipPath=2 CLIPPATH=3 cliPPath=4>
has a single attribute namedclipPath
and whose value is1
. This is convenient because we don't have to change the Tag Processor, but inconvenient when things are mostly based on lower-case attribute names. - We probably need to repeal the idea that tag names are upper-case and add another bit to communicate tag name vs. doctype declaration. Maybe? Can we test that
math:mi
ormath:MI
is in themath
namespace? There can be no HTML tag namedmath:mi
(it would beMATH:MI
). - Could be handy to have something like
base_class_get_tag()
orprivate function comparison_tag_name()
etc… to report an upper-case tag name, while preserving the case-variants required in foreign content to outside calls. - Tracking the namespace seems a bit fishy. When we're in the Tag Processor we can know that if we encounter an SVG or MATH tag when we're in the
html
namespace, that it should change the namespace tosvg
ormath
, respectively, and lower-case the tag names. However, the role of integration points and parsing things in the insertion mode is still vague. - Foreign content _is not_ an insertion mode. This is important when handling HTML tags inside HTML integration points. This is _very important_ for
get_modifiable_text()
which doesn't know if a text node inside foreign content is being processed _as foreign content_ or _in the insertion mode_, where it determines if NULL bytes should be replaced or removed. - It's probably necessary to start over again with these insights and more in mind.
## Description
We should reliable detect foreign content and we need to do it in the Tag Processor, specifically because of the rules for CDATA sections. The HTML Processor needs this as well to determine if things like self-closing flags for HTML elements should be respected.
Unlocks:
- proper detection of CDATA
- parsing of SVG and MathML
- indicating if one should expect a closing tag or not
2 months ago
#70
@sirreal @westonruter I've pushed some updates that incorporate trunk
and a few other fixes. notably, I had the wrong code for adjusted_current_node()
.
@jonsurrell commented on PR #6006:
2 months ago
#71
This is very tricky. I'll share some cases that are problematic right now:
### <svg><a/>Text after
svg:a self closing tag.
Errors with Cannot run adoption agency when "any other end tag" is required.
### <svg></svg><![CDATA[ bogus comment ]]>
### <svg><foreignObject><![CDATA[ bogus comment ]]>
### <math><mi><div></div></mi><mi>
@jonsurrell commented on PR #6981:
2 months ago
#72
I believe this fixed https://core.trac.wordpress.org/ticket/61560.
@jonsurrell commented on PR #7041:
2 months ago
#75
I'd be fine with splitting that up.
@jonsurrell commented on PR #6006:
2 months ago
#80
I've made some progress on this fixing some of the issues mentioned above: https://github.com/WordPress/wordpress-develop/pull/6006#issuecomment-2262585496
Right now some self-closing tags in foreign content break. HTML like <svg><a/>
seems to enter step_in_body, runs the adoption agency algorithm, and fails:
WP_HTML_Unsupported_Exception: Cannot run adoption agency when "any other end tag" is required. in /var/www/html/wp-includes/html-api/class-wp-html-processor.php:450 Stack trace: #0 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(5145): WP_HTML_Processor->bail('Cannot run adop...') #1 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(2423): WP_HTML_Processor->run_adoption_agency_algorithm() #2 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(4108): WP_HTML_Processor->step_in_body() #3 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(889): WP_HTML_Processor->step_in_foreign_content() #4 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(635): WP_HTML_Processor->step()}}}
This ticket was mentioned in PR #7141 on WordPress/wordpress-develop by @dmsnell.
2 months ago
#81
Trac-ticket: Core-61576.
Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the <body>
context. In this patch, any context node is allowed.
2 months ago
#82
@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:
- we have to enter within a certain tokenizer state, but that's not how our parser is designed.
- the DOM is able to store things like the text
</textarea>
inside aTEXTAREA
node, but HTML is not able to do this. we have to determine what to do in these cases.
This ticket was mentioned in PR #7157 on WordPress/wordpress-develop by @dmsnell.
2 months ago
#83
Trac ticket: Core-61576
As part of work to add more spec support to the HTML API, this patch adds support for the relevant foreign elements in the HTML algorithms within the stack of open elements. Although the HTML Processor cannot yet step into these elements, the format of how they will be represented was determined in the related work from which this patch is extracted.
This patch extracted from https://github.com/wordpress/wordpress-develop/pull/6006.
Developed in https://github.com/wordpress/wordpress-develop/pull/ Discussed in https://core.trac.wordpress.org/ticket/61576
Props: dmsnell, jonsurrell.
See #61576.
This ticket was mentioned in PR #7158 on WordPress/wordpress-develop by @dmsnell.
2 months ago
#84
Trac ticket: Core-61576
As part of work to add more spec support to the HTML API, this patch removes an insertion mode that was added but shouldn't have been. The rules for parsing foreign content are real, but they don't change the insertion mode for the parser - the interactions are more complex than this.
Follow-up to [58679].
Props dmsnell, jonsurrell.
See #61576.
2 months ago
#85
Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.
2 months ago
#86
Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.
2 months ago
#87
Instead of extracting this work, it's going in all at once. Extracting didn't turn out to be as clean as I had hoped it would be.
This ticket was mentioned in PR #7162 on WordPress/wordpress-develop by @jonsurrell.
2 months ago
#91
Trac ticket: https://core.trac.wordpress.org/ticket/61576
Follow-up to: [58868] (GitHub https://github.com/WordPress/wordpress-develop/pull/6006).
This ticket was mentioned in PR #7164 on WordPress/wordpress-develop by @jonsurrell.
2 months ago
#92
Trac ticket: https://core.trac.wordpress.org/ticket/61576
Follow-up to: [58868] (GitHub https://github.com/WordPress/wordpress-develop/pull/6006).
- Add a unit test for a bug that was discovered in manual testing and fixed in [58868].
- Modify behavior to follow-spec (resolves todo).
The spec states the following:
If the token has its self-closing flag set, then run the appropriate steps from the following list:
If the token's tag name is "script", and the new current node is in the SVG namespace
Acknowledge the token's self-closing flag, and then act as described in the steps for a "script" end tag below.
An end tag whose tag name is "script", if the current node is an SVG script element
…
Any other end tag
…
I believe the original implementation was incorrect and the intention is to move into the An end tag whose tag name is "script", if the current node is an SVG script element instructions and not the Any other end tag. Ultimately, the result was the same.
This also adds a missing return true
from the script end tag block.
This ticket was mentioned in PR #7165 on WordPress/wordpress-develop by @dmsnell.
2 months ago
#97
Trac ticket: Core-61576.
As part of work to add more spec support to the HTML API, this patch
adds support for the FRAMESET-related insertion modes, as well as the
set of missing _after_ insertion modes. These modes run at the end of
parsing a document, closing it and taking care of any lingering tags.
Developed in https://github.com/wordpress/wordpress-develop
Discussed in https://core.trac.wordpress.org/ticket/61576
See #61576.
2 months ago
#98
@sirreal there are failures in the tests that look like BODY isn't closing. If you have insight, I'd appreciate it.
@jonsurrell commented on PR #7165:
2 months ago
#99
I explored some ideas for dealing with those comments in https://github.com/dmsnell/wordpress-develop/pull/18.
For now, I'm going to push a change to this PR to switch the bail conditions in after body. It seems better to bail on comments outside of BODY
and HTML
nodes (we might even be able to skip this) and support a </body>
tag in the middle of the document that has almost no impact if text or tags follow it.
2 months ago
#100
I'm not sure how best to handle this 😕 Reviewing the spec, I believe that only comments are inserted outside of the BODY and HTML nodes. I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.
This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and next_token()
would pull them out when it sees the POP event for BODY and then for HTML.
From a performance standpoint, I don't see this being that troublesome, especially if all we stored in those lists were the WP_HTML_Token
or bookmark name.
@jonsurrell commented on PR #7141:
2 months ago
#101
The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂
I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.
The following steps form the HTML fragment parsing algorithm. The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, input, a string to parse, and an optional boolean allowDeclarativeShadowRoots (default false). It returns a list of zero or more nodes.
@jonsurrell commented on PR #7165:
2 months ago
#102
I wonder if we could collect these in-place comments in a couple of lists then iterate through them then the document is finished.
This seems like a plausible idea, @sirreal. Presumably we could reuse the event queue, such that we track these comments in those lists, and
next_token()
would pull them out when it sees the POP event for BODY and then for HTML.
I implemented that in this PR in a1304b5b9b77f93294fc832a5a102e8dcea40c3c. I reverted the change. I had issues getting the text for the comments outside of body because get_modifiable_text
is a method on the Tag Processor and it expects to operate on the token it's parsing. It may be enough to seek to all the comments to get their text, I did not try that.
This ticket was mentioned in PR #7195 on WordPress/wordpress-develop by @jonsurrell.
2 months ago
#103
- Add complete parsing of DOCTYPE tokens.
- Update full parser document mode (quirks mode) according to doctype tokens as described in the specification.
This change adds a new class to handle DOCTYPE token information according to the specification.
The Tag Processor handles parsing DOCTYPE Tokens in greater detail and on-demand in order to create instances of WP_HTML_Doctype_Info
. This is useful because DOCTYPE tokens may appear in many places in HTML but are ignores in most situations. This allows the detailed parsing of DOCTYPE tokens to be handled on-demand when a DOCTYPE token is reached under the appropriate circumstances.
The WP_HTML_Doctype_Info
class exposes a method to that handles the complex rules for determining quirks mode which involve inspecting the DOCTYPE token name, public identifier, system identifier, and force_quirks_flag.
Trac ticket: https://core.trac.wordpress.org/ticket/61576
---
<details><summary>Survey of existing DOCTYPE declarations</summary>
Download the DOCTYPE report and cat report-doctypes.txt
to see the color output.
Here is a preview:
</details>
8 weeks ago
#104
@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially).
One case I was specifically thinking through is a <script>
context. The idea is simple: parse the inner HTML of a SCRIPT element. The problem is when </script>
appears and there is more content. In the DOM you can actually create a text node containing </script>
but this is not possible in the HTML. The spec even notes situations like this where the DOM can create trees that cannot be expressed in HTML.
So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for _that_ function to call out something special for elements like SCRIPT
so that it falls into the equivalent of ->skip_script_data()
.
Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable.
However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the svg
or math
namespace.
@jonsurrell commented on PR #7141:
8 weeks ago
#105
I'll share some context from a conversation.
---
As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided.
---
The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, <body>
provides a suitable context, but there could be other cases like <head>
. Again, this seems relatively straightforward to support. However, it gets more complicated with things like <svg>
as a context element, or especially <rect>
, or even trickier a script tag in the SVG namespace!
The context element will be used to establish the correct insertion mode _and_ the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes).
It may make sense to maintain this interface as-is. <body>
is likely a good context for most use-cases but in order to properly support fragments some advanced interface likely needs to support:
- Passing a context element along with its tag name, namespace, and attributes.
- The context element's document determines the quirks mode of the fragment parser.
@jonsurrell commented on PR #7195:
7 weeks ago
#106
Thank you! I've reviewed your changes and I'm happy with them.
This ticket was mentioned in PR #7232 on WordPress/wordpress-develop by @jonsurrell.
7 weeks ago
#107
This ticket was mentioned in PR #7230 on WordPress/wordpress-develop by @jonsurrell.
7 weeks ago
#108
@jonsurrell commented on PR #7141:
7 weeks ago
#113
in order to properly support fragments some advanced interface likely needs to support…
I had a thought that may be helpful.
I'd like to add an instance method to the HTML process like create_fragment_parser_at_node( string $html_fragment )
. This method would return a new HTML processor, created as a fragment parser with the context element created from the current tag opener (or null). This would allow use to set the document compat type and the context element and its attributes correctly according to the specification.
Once we have this method, advanced fragment parser creation could be handled by that method:
$full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// …
I think internally this method would be used for things like set_inner_html that require a fragment parser.
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
7 weeks ago
#116
7 weeks ago
#117
@sirreal mind sharing any thoughts on this? I don't think it adds much other than more descriptive bailing messages, but it does structure the algorithm more and include more comments. my thinking is that this might be useful to guide us when we spend more focus on trying to handle these situations.
@jonsurrell commented on PR #7230:
7 weeks ago
#118
A comment about html5lib-tests and this PR. There are currently 2 test failures on trunk. I believe a latent whitespace bug was exposed in #7232.
When removing all the related whitespace html5lib-test skips, there are a total of 8 failures on trunk. All of them are fixed by this PR:
- FAILURES! - Tests: 1500, Assertions: 1082, Failures: 8, Skipped: 418. + OK, but incomplete, skipped, or risky tests! + Tests: 1500, Assertions: 1076, Skipped: 424.
@jonsurrell commented on PR #7230:
7 weeks ago
#119
A comment about html5lib-tests and this PR. There are currently 2 test failures on trunk. I believe a latent whitespace bug was exposed in #7232.
When removing all the related whitespace html5lib-test skips, there are a total of 8 failures on trunk. All of them are fixed by this PR:
- FAILURES! - Tests: 1500, Assertions: 1082, Failures: 8, Skipped: 418. + OK, but incomplete, skipped, or risky tests! + Tests: 1500, Assertions: 1076, Skipped: 424.
@jonsurrell commented on PR #6982:
7 weeks ago
#120
I think the HTML5lib tests will help here. I'll try to make some progress on this.
FAILURES! -Tests: 1494, Assertions: 1076, Failures: 2, Skipped: 418. +Tests: 1494, Assertions: 1103, Failures: 11, Skipped: 391.
@jonsurrell commented on PR #6982:
7 weeks ago
#121
I'm exploring the Noah's Ark work to limit equivalent elements on the active formatting element stack in https://github.com/dmsnell/wordpress-develop/pull/19.
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
7 weeks ago
#122
@jonsurrell commented on PR #6983:
6 weeks ago
#123
This is difficult to interpret and interacts with active format reconstruction in interesting ways. I think my preference would be to work on that first (#6982) then come back to this.
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#124
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#125
6 weeks ago
#126
Note from conversation: can we simply bail if we have more than three of the same kind of tag name on the list of active formatting elements (up to the nearest marker) and then defer any format Noah's Ark processing?
this could potentially let us process the most common form of active format reconstruction without implementing the complicated and costly parts of the algorithm.
6 weeks ago
#127
Need to report attributes even on virtual reconstructed attributes. Prior to now we haven't actually recreated any formatting elements, but once we start doing so, we have to ensure we don't report that attributes are missing
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#130
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#131
@jonsurrell commented on PR #7230:
6 weeks ago
#132
After reviewing this and text subdivision, my preference is to go the other direction. I've pushed a change to stop running subdivision on CDATA and let this change handle CDATA on its own.
This ticket was mentioned in PR #7286 on WordPress/wordpress-develop by @jonsurrell.
6 weeks ago
#133
@jonsurrell commented on PR #7286:
6 weeks ago
#134
I've collected other potential "fixes" to the same problem, but I haven't been able to verify incorrect behaviors yet: https://github.com/sirreal/wordpress-develop/pull/7
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#139
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#140
This ticket was mentioned in PR #6983 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#141
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
6 weeks ago
#142
This ticket was mentioned in PR #6982 on WordPress/wordpress-develop by @dmsnell.
5 weeks ago
#143
This ticket was mentioned in PR #7312 on WordPress/wordpress-develop by @jonsurrell.
5 weeks ago
#144
Improve support for content appearing after BODY, HTML nodes where
possible. Some HTML may be representable by the HTML API without
introducing ordering problems.
When HTML moves to the "after-{body,html,…}" insertion modes, BODY and
HTML remain on the stack of open elements. Certain content is inserted
in place (comments), some content is inserted in BODY (white space
text), and most other content will switch back to the "in body"
insertion mode.
The difficulty with the HTML API is that the result in the tree may be
out-of-order compared to the positioning of the corresponding tokens in
the HTML input text. For example:
</body>FOO
Produces this tree where the ordering is the same even though a BODY
closer is reached:
[DOCUMENT]
└HTML
├HEAD
├BODY
│ └#text: FOO
└#comment: comment
Compare with this case:
</body>FOO
The tree is identical, the FOO text is _out of order_ in the tree
compared to the HTML string. This case continues to be invalid because
when the FOO text needs to be reached, the parser would have already
reached a BODY tag closer and content outside of the body:
[DOCUMENT]
└HTML
├HEAD
├BODY
│ └#text: FOO
└#comment: comment
This approach deviates from the explicit steps in the spec and will pop
the BODY and HTML tags from the stack of open elements when content
should appear outside of them. Flags are set for each "after-" mode to
prevent returning to BODY and then continuing to produce content
outside, which would produce the unsupported out-of-order behavior.
See https://html.spec.whatwg.org/#parsing-main-afterbody.
Trac ticket: Core-61576
5 weeks ago
#145
I'm wondering if here we could lean on some of the work I've explored for XML well-formedness error-recovery.
Imagine we start with "proper HTML parsing in tree-order node traversal" but you could opt-in to certain errors.
$processor->parser_tolerance( WP_HTML_Processor::VISIT_NODES_OUT_OF_ORDER, true );
$processor->parser_tolerance( WP_HTML_Processor::OVERLOOK_FOSTERED_NODES, true );
I'm concerned with patches like this because I think there's a real semantic challenge when thinking about operations like set_inner_html()
. For a chunk of HTML where the content after BODY itnerleaves IN BODY and AFTER BODY content, can we provide a rational answer to the question: how should the document change when setting inner HTML of the BODY?
This ticket was mentioned in PR #7330 on WordPress/wordpress-develop by @dmsnell.
5 weeks ago
#146
The HTML specification indicates that an HTML tag with the name "IMAGE" should be renamed as "IMG" and handled as if it were an "IMG", but this only applies to elements in the HTML namespace.
In this patch the HTML Processor is updated to ensure that it doesn't remap the tag name when processing foreign content, such as SVG and MathML markup.
See #61576.
This ticket was mentioned in PR #7332 on WordPress/wordpress-develop by @dmsnell.
5 weeks ago
#149
Trac ticket: Core-61576.
Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.
This patch changes the behavior accordingly. No tests are impacted by this change.
Follow-up to [58867].
5 weeks ago
#150
@sirreal I don't feel strongly about this, but I do think that if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break. I've expanded the docblock with a comparison to get_tag()
.
@jonsurrell commented on PR #7332:
5 weeks ago
#151
I've thought about this some more and I don't think we should make this change.
Since this method is meant for printing and display, a more expected return value would be the lower-case variant of a given HTML tag name.
I'm not convinced this is the purpose of the method, although it depends what is meant by "printing and display."
Primarily, HTML API is for working with HTML input and HTML output. In HTML, the casing of tag names is irrelevant. There's no reason for svg tags to be "correctly" cased (<altGlyph>
instead of <altglyph>
), just like there's no reason to use upper or lower or any particular casing for HTML tag names. This method handles the correct casing for SVG element tag names, but that's not important for serializing and printing the svg tags in an HTML document.
It doesn't seem more correct to me to lowercase HTML tag names, MathML tag names, and then used some mixed casing on SVG tag names when the element name differs.
Here's my take on this method.
This method applies the rules from the specification on parsing foreign content:
If the adjusted current node is an element in the SVG namespace, and the token's tag name is one of the ones in the first column of the following table, change the tag name to the name given in the corresponding cell in the second column. (This fixes the case of SVG elements that are not all lowercase.)
| Tag name | Element name |
|:--- |:--- |
|altglyph
|altGlyph
|
|altglyphdef
|altGlyphDef
|
… | …
This seems to adjust for a difference between an HTML tag name (case insensitive) and an SVG element name. This adjustment is important as a consideration of tree construction where HTML tokens are transformed into elements in a tree.
At the moment, this roughly corresponds to `Node.nodeName` for elements (and `Element.tagName`). If we make this change, then it becomes an arbitrary decision.
[(An interesting note, qualified name
_is_ a DOM concept and does seem to correspond to this method as implemented.)
](https://dom.spec.whatwg.org/#concept-element-qualified-name)
And to state the obvious, it should be trivial for consuming code to lowercase tag names if desired.
---
if we want to consider the change it'd be best to do before 6.7 is released, as after that it would be a backwards-compatibility break
Definitely.
5 weeks ago
#152
It doesn't seem more correct to me to lowercase HTML tag names
It's less about being correct and more about expectations. I think if you survey a bunch of people and ask them how they feel about turning <p><a><span>
into <P><A><SPAN>
they will have opinions about that. Also, we can survey HTML in the wild and see what a sampling of global expectations might be, given the prevalence of the styling.
Here is a survey from my list of ~300k HTML pages.
Type of tag | Count | Percent |
---|---|---|
ALL UPPER | 307,499 | 2.7% |
all lower | 12,388,675 | 96% |
Mixed Case | 221,307 | 1.7% |
My point in sharing these numbers isn't to say they dictate what we do; just noting that the overwhelming majority of HTML out there is using lower-case tag names and people have grown accustomed to them.
In the case of normalization, this is the _default_ behavior, which is why I care about it.
But also going back in time, the reason I remember for introducing these functions was just to ensure that the html5lib tests pass which check against the adjusted foreign content tag names and attribute names. I don't feel these have a central important role in the spec compliance.
Your point is sound: it's trivial for calling code to lower-case-fold the tag names. Except, then they also have to remember to only do that for elements in the HTML namespace and not to do so for foreign elements. That leaves calling code calling this function and then immediately asking if it's an HTML element and then lower-casing.
$tag_name = $this->get_qualified_tag_name();
if ( 'html' === $this->get_namespace() ) {
$tag_name = strtolower( $tag_name );
}
maybe I had a gut reaction since get_qualified_tag_name()
_just_ made the exact same namespace check before returning, and it felt like spreading out the same semantic between the inside and outside of that method.
I was going to close this, but I think I'll leave it open at least a little longer to continue pondering.
5 weeks ago
#153
For general interest: here is the list of all-caps and mixed-case tag names from my survey. The list includes tag closers, and I didn't attempt to check if the closer casing matched the opening casing. Obvious HTML errors are evident, especially in the list of once-seen tags.
<details><summary>The list</summary>
ALL UPPER TAGS 65,179: A 30,621: BR 28,545: TD 23,392: FONT 18,875: P 15,786: IMG 15,154: B 13,723: TR 13,142: LI 10,102: OPTION 6,578: META 5,069: TABLE 4,701: HEAD 4,699: HTML 4,658: TITLE 4,576: BODY 4,137: I 4,008: ID 3,945: EM 3,476: H1 3,167: CENTER 2,532: HR 2,351: SPAN 1,654: INPUT 1,529: STRONG 1,515: UL 1,098: AREA 990: SCRIPT 853: H2 844: O:P 792: H3 785: WBR 777: DIV 740: LINK 738: TH 661: MENU 620: BLOCKQUOTE 610: TBODY 418: U 400: FORM 379: H4 318: SMALL 270: STYLE 266: FRAME 200: PRE 186: ADDRESS 174: BIG 168: MAP 166: PARAM 151: SELECT 142: FRAMESET 124: TT 94: SPACER 92: ABBR 88: CITE 81: SUP 78: NOBR 72: NOSCRIPT 63: H6 56: BASE 52: H5 51: COL 49: IFRAME 36: COLGROUP 35: DD 34: NOFRAMES 33: DT 30: OBJECT 28: EMBED 27: SUMMARY 25: MARQUEE 25: SUB 22: HEADER 22: LAYER 20: EC 19: BASEFONT 19: OL 15: FIGURE 14: NAV 13: DFN 12: V:F 12: ARTICLE 12: ILAYER 11: SECTION 11: X-CLARIS-WINDOW 11: X-CLARIS-TAGVIEW 11: CAPTION 11: BUTTON 9: THEAD 9: FIGCAPTION 8: O 8: ZBLINK 8: LABEL 7: CODE 7: DIR 7: LH 7: AUDIO 6: BLINK 6: STYLE='MSO-BIDI-FONT-WEIGHT: 5: X-SAS-WINDOW 5: BGSOUND 4: FOOTER 4: TEXTAREA 4: ALIGN=LEFT 4: X-CLARIS-REMOTESAVE 4: NOINDEX 3: APPLET 3: LEFT 3: BOLD 3: KBD 3: SP 2: Y 2: E= 2: C 2: NÍ 2: MARK 2: STRIKE 2: WSJ 2: NOLAYER 2: ALIGN="LEFT" 2: INSERT_COUNT* 2: BD 2: NOFRAME 2: INS 2: LNAME 2: FNAME 2: URL 2: ACRONYM Once-seen tags: A,, AAREA, ALIGN="RIGHT", ALIGN=CENTER, ASIDE, B<P, BGCOLOR="#000000", BOTTOM, BR<B, BRïï, CENTRE, CFINCLUDE, CLEAR, COLDEF, COLDEFS, CONNECTED,PREFERRED, CSACTION, CSACTIONDICT, CSACTIONITEM, CSACTIONS, CSSCRIPTDICT, CUFON, CUFONCANVAS, CUFONTEXT, DL, DOC, DOCTYPE, EF_B_RED, FIELDSET, GCSE:SEARCH, GCSE:SEARCHBOX-ONLY, H, HEADS_TAG, HRNOSHADE, HTM, HTML!, IF_ERRORPARAM, IF_ERRORSTR, IF_ERRORTYPE, INSERTFLASHHEAD, INSTITUTE, JSON, LEGEND, LI,<A, MAJOR, METANAME="DESCRIPTION", METANAME="KEYWORDS", NAME, NOAUTOLINK, NOF, O:LOCK, OCCUPATION, ONTOLOGY, ROWS, SAMP, SPOUSE, T, TAIL, TILTE, TIME, U7:P, UNION_TAG_INDEX_FOOTER, UNION_TAG_INDEX_HEADER_1, UNION_TAG_INDEX_HEADER_2, UNION_TAG_INDEX_TITLE, UP-21, V:FORMULAS, V:PATH, V:SHAPETYPE, V:STROKE Mixed Tags 32,047: Key 32,041: Contents 32,041: LastModified 32,041: ETag 32,041: Size 28,041: StorageClass 4,007: Owner 4,004: DisplayName 4,000: Generation 4,000: MetaGeneration 1,578: feColorMatrix 1,466: feComposite 1,390: feComponentTransfer 1,390: feFuncA 1,387: feFuncR 1,387: feFuncG 1,387: feFuncB 807: linearGradient 680: clipPath 459: Error 459: Code 459: Message 380: RequestId 379: Option 373: HostId 155: feGaussianBlur 125: feBlend 122: feOffset 111: bR 91: Br 80: tD 79: feFlood 75: Td 74: feMergeNode 71: Font 64: o:SmartTagType 49: textPath 48: st1:State 45: Meta 43: radialGradient 41: tR 38: animateTransform 38: ListBucketResult 38: Name 38: Prefix 38: Marker 38: IsTruncated 37: feMerge 37: st1:City 34: MaxKeys 33: Tr 31: Center 31: Table 30: rdf:RDF 30: asp:ListItem 29: Img 28: feMorphology 26: Strong 26: foaf:givenName 26: foaf:familyName 25: QueryParameterName 25: QueryParameterValue 25: Reason 24: st1:PlaceName 23: AccountName 23: cc:Work 21: Script 21: Title 20: RecommendDoc 19: st1:PlaceType 19: class="Text" 16: Li 13: noScript 13: psi:contextVar 13: contBox-x 12: Input 11: u51:SmartTagType 10: Details 10: Body 10: ItemTemplate 10: u46:SmartTagType 10: u48:SmartTagType 9: Dd 9: foreignObject 9: Th 9: psi:queryVar 9: u26:SmartTagType 9: u40:SmartTagType 9: u45:SmartTagType 9: u52:SmartTagType 9: u53:SmartTagType 9: st1:Street 8: u28:SmartTagType 8: u29:SmartTagType 8: u31:SmartTagType 8: u42:SmartTagType 8: u47:SmartTagType 8: u49:SmartTagType 8: u54:SmartTagType 8: u55:SmartTagType 8: Button 8: asp:RequiredFieldValidator 8: asp:TextBox 7: color="#CC0000" 7: MainOrArchivePage 7: rdf:Description 7: u23:SmartTagType 7: u24:SmartTagType 7: u27:SmartTagType 7: u33:SmartTagType 7: u36:SmartTagType 7: u37:SmartTagType 7: u43:SmartTagType 7: Head 7: color=#fFee00 6: Event-Card-Open-Close-Toggle 6: HFBusiness 6: u1:SmartTagType 6: u4:SmartTagType 6: u25:SmartTagType 6: u30:SmartTagType 6: u34:SmartTagType 6: u35:SmartTagType 6: u38:SmartTagType 6: u39:SmartTagType 6: u41:SmartTagType 6: u44:SmartTagType 6: u50:SmartTagType 6: u57:SmartTagType 6: u58:SmartTagType 6: u60:SmartTagType 6: u62:SmartTagType 6: u63:SmartTagType 6: u64:SmartTagType 6: u65:SmartTagType 6: u66:SmartTagType 6: u67:SmartTagType 6: u68:SmartTagType 6: u69:SmartTagType 6: iconSm-x 6: Select 6: asp:Panel 5: psi:sessionVar 5: u32:SmartTagType 5: u61:SmartTagType 5: st1:PostalCode 5: liNK 4: InLineReplace 4: Style 4: psi:sortOp 4: u7:SmartTagType 4: u8:SmartTagType 4: u9:SmartTagType 4: u10:SmartTagType 4: u11:SmartTagType 4: u12:SmartTagType 4: u17:SmartTagType 4: u21:SmartTagType 4: u22:SmartTagType 4: u56:SmartTagType 4: u59:SmartTagType 4: NextMarker 4: httpStatusCode 4: Form 3: Resource 3: ListAllMyBucketsResult 3: Buckets 3: Ul 3: u5:SmartTagType 3: u6:SmartTagType 3: u13:SmartTagType 3: u14:SmartTagType 3: u15:SmartTagType 3: u16:SmartTagType 3: u73:SmartTagType 3: u72:SmartTagType 3: u74:SmartTagType 3: toggleSection 3: Initial-scale=1.0" 3: Ozelliklerimiz< 3: BucketName 3: asp:Label 3: invalidTag 3: String 2: hR 2: xmpMM:DerivedFrom 2: QlÃIPq¼I3J]ߢ*5×¾¢GC 2: Wo|¥´bFôÈ®D:ýx3¨j8V~ùs¸xÑ,4P[\ô÷sDóÃ1#ð£y)F?|ù 2: k*H)ϼzÐâ5U%Oý 2: Z.¶seiÁ%<Aù¯~õÀZÇv¸¼ºXBË är9KÇãào¼KNôT·2 ÛÁcÚFáÌú¾èMJ`.ôSÕûUÐÀ¡Õ7»H$2³Èhe¤þPEçûP°IBZ)R] 8R°ÊÆ2dd5æ 2: sÓ?©JhJ¢ÉéëcZzÖAÝ6âd£5èIå 2: H©õj&[زê³èsúòÆýÑ 2: NY´Ø&bTÉ9C¢q¼Vº 2: JÏSÖ!Ðr¼K9äy´è(J±Ä¤8xõ L&c1Ç Ga [Yj*Ëf«:þ¹DWW0¨ÜÑ(HÔ 2: W°éÿÛ=ek¾6¹Æ^Äu 2: hX¼Wy2ûºfzÐ[*vÀq°« 2: KMx¢ÛY75)-¾=¤vw3ê;¿î=-Sç\»Ñ7Ëu\2J^S[¤C&CC-'ððÒ©ä}'é¯XLjÒ ðXuÂAÔdd5æ 2: Rñú9w1÷ñÿ\²þd=¡ùºÞ¿òel¿Ô³¤ø¥ãôrîcïcþ¹eüÈp~n·¯ü[ 2: y+x¬û¤óéI¡*Y1"J+Å 2: Q·öÝm§o}Ý)eóVQú=QNY$GIËZÄ¥,LÌkÐPXh¾±ÂA#ÌÿUp$<3,$ûGë@Ü9@32e¯C@@æ8t<ÿhàÿy¿ÍÏú#´âÞþ 2: pñçîð²Ä²Zóöèh5ßÏê¶ 2: i!?Æ %µÀl[N&åoÖHA%$&G 2: WK¼bøÒ²ÿ³«úO=tãóÒî2ãm%ü[6* = 2: ré*Fu,Ü0,¯g!j4VÉTKÝËÜ4LKgN?ØΤ§gѼ«o"(äFöEo¡÷»ã4Hî×·@ËNºPê"áGÓ5È¡[Lq 2: Zâè:P%ÞвXS1ó<e@ÊKôw,*ËË1Òs{¹¥ 2: kXu¦!.\ 2: AùCd»´{×ýÚà"CÁÂÒ0ë÷NG» 2: o'©SVVp 2: YÆIeUVë¦X«+M¦oòÖÞEöY{¸²Ö"ïb)¯Üo´4YZË 2: Bþ®J¼½+˹»EUÅUððV2¾UiÜcK¶PA(P*ÑE# 2: j«Ct³èîÖ{]òZ¦Ûk]Ï%%ËzT~ ³îËÖÒÁ®Ú×µó÷ybmMqjê©»±²pÐÕBp_ß<À}õ)iBËNØ(ã4àÉÆêìl¤Ô¤êÛÑR¸VJ¬Ý[ëgHáÍÊë;hNqXb¢ÖUùM¿&K8ªuiMaf¶¼[A{ö"ÐÛÕÝ}Y dës\ajwjêö¾u.¤F¥(CxR²JN 2: p»Jk<ð§=¡+üã£h²Ò8ÖäõM¼êú©*V¬ÆZX·qñ+,½ÿ5e5xHö+«ÙöÑX[Æ¿ºú 2: n£ØbðlÄÜÄCilüì]aûÛ÷ªÆâb·XZ°%À jH* 2: Pmj¼Ï'Þ²dõCUÙz¢ª©ªèJ¦«@j 2: G£øg@]($÷GTJó V¼§øP}ãà(â÷iÊ tñtRá\YtÝP*2 2: Aù¯våÀXpÙ~%*qà°à 2: F%{þöéyàO4ôéEs#¢=ó^G:² 2: bgcolor="#FFFFFF" 2: Valign="middle" 2: tdOCTYPE 2: u18:SmartTagType 2: u19:SmartTagType 2: u20:SmartTagType 2: u70:SmartTagType 2: u71:SmartTagType 2: AHREF="http: 2: Address 2: socketType 2: i:pgfRef 2: ServerTime 2: TraceId 2: Transition 2: font-face="Times 2: Werewolf 2: Civilian 2: feTurbulence 2: feDisplacementMap 2: Area 2: Html 2: headHTML 2: asp:HyperLink 2: asp:DropDownList 2: asp:CheckBox Once-seen tags: AAddress, AHREF="Collections.html", AHREF="mailto:hydra@umail.ucsb.edu", Basilica, BlockQuote, Bstyle="color:black;background-color:#ffff66", Bucket, CENTEeR, Content-Type:, Endpoint, Footer, GallerySection, Gblockquote, HREF="http:, Hostip, HostipLookupResultSet, Iframe, In, InstragramWidget, Left, Link, Mmeta, Numbertemplate, P,Here, Page, Rosa, Rotten, SCRILongDateAGE="JavaS0riSunday,, SetModTime, Span, Special, TABle, UdeM_menu, UnknownOperationException, Xlink, YX*~³;]<úæ ¹ÒÕ^ö¬ÖÕ'ÿÔwNJcZo^«èk\E£òâ»P\éäD~Ñ4Þ÷ä97 «Æ ý, alt="El, aname="OLE_LINK2", aria-label="Instagram", aria-label="YouTube", asp:Button, asp:CustomValidator, asp:LinkButton, asp:Localize, asp:RegularExpressionValidator, asp:ValidationSummary, bDopo, clientConfig, cmLogo, com:Text, contPad, countryAbbrev, countryName, customHeaders, displayName, displayShortName, dnn:DnnCssInclude, emailProvider, fOnt, face="Verdana", font="C60000, f~ÐÙSBøh_ÝXÄÄÁ÷GV~uc¦ác{3÷Övñ:B¹1°$æ1·¸°BÛ?YçÁkUHY6~ðå,GVð, gml:Null, gml:Point, gml:boundedBy, gml:featureMember, gml:pointProperty, google-site-verification=OgY2mih_AxAZzi7f8b33QTOGHoScolbNOE6aTlqold0, httpProtocol, incomingServer, ipLocation, jA|ó+§ðü©cÈå,¼²e[WId2ùSÂë#ã§Rxáô¾~æBµÕÞôqm#ºÃ×ÐÒ §¡¶*ó7Qh2Ré, link="#0000FF", mes:Error, mes:ErrorMessage, meta name="description" content="Evo, meta name="keywords" content="Bike, meta property="og:description" content="Rent, navBarCenter, ns0:City, ns0:PlaceName, ns0:PlaceType, ns0:State, onMouseover="over_effect(event,'outset')", outgoingServer, pYes,, psi:ZOTERO_COinS, psi:iktList, psi:sortOptions, quicklinkComp, skippedTag, title="Fairfield, titleB, toggleCont, togglePad, xáP¾~æ$ Found 12,917,481 tags of those: ALL UPPER: 307,499 (2.38%) all lower: 12,388,675 (95.906%) Mixed Case: 221,307 (1.713%)
</details>
@jonsurrell commented on PR #7141:
4 weeks ago
#154
This ticket was mentioned in PR #7347 on WordPress/wordpress-develop by @jonsurrell.
4 weeks ago
#155
In the rules for parsing tokens in foreign content there are 2 places that indicated tokens should be processed "according to the rules given in the section corresponding to the current insertion mode in HTML content."
First (bold mine):
A start tag whose tag name is one of: "b", "big", "blockquote", "body", "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", "table", "tt", "u", "ul", "var"
A start tag whose tag name is "font", if the token has any attributes named "color", "face", or "size"
An end tag whose tag name is "br", "p"
Parse error.
While the current node is not a MathML text integration point, an HTML integration point, or an element in the HTML namespace, pop elements from the stack of open elements.
Reprocess the token according to the rules given in the section corresponding to the current insertion mode in HTML content.
And later (bold mine):
Any other end tag
Run these steps:
- Initialize node to be the current node (the bottommost node of the stack).
- If node's tag name, converted to ASCII lowercase, is not the same as the tag name of the token, then this is a parse error.
- Loop: If node is the topmost element in the stack of open elements, then return. (fragment case)
- If node's tag name, converted to ASCII lowercase, is the same as the tag name of the token, pop elements from the stack of open elements until node has been popped from the stack, and then return.
- Set node to the previous entry in the stack of open elements.
- If node is not an element in the HTML namespace, return to the step labeled loop.
- Otherwise, process the token according to the rules given in the section corresponding to the current insertion mode in HTML content.
While working on fragment parsing and html5lib-tests in https://github.com/dmsnell/wordpress-develop/pull/22, I discovered an infinite loop that seems to occur in the following situation. At an svg:svg
context node, create a fragment parser for with HTML </p>
. This is the first condition mentioned.
In a full parser, the instruction "While the current node is not a MathML text integration point, an HTML integration point, or an element in the HTML namespace, pop elements from the stack of open elements." would ensure that when reprocessing the token it does not re-enter the foreign content rules because by popping elements the token would no longer be foreign content. However, in a fragment parser there may not be any nodes to pop and the context element may continue to cause foreign content handling to be applied. But the instruction seems to indicate that the token should be handled in the current HTML insertion mode.
This is fixed by moving to the same place in both cases which is a reproduction of the HTML handling switch on the current insertion mode.
This fixes the infinite loop that that appeared in https://github.com/dmsnell/wordpress-develop/pull/22.
Observed in https://github.com/dmsnell/wordpress-develop/pull/22.
This ticket was mentioned in PR #7425 on WordPress/wordpress-develop by @jonsurrell.
3 weeks ago
#158
Compare this branch with nightly.
Trac ticket: #61576
@jonsurrell commented on PR #7425:
3 weeks ago
#159
@dmsnell My first thought was that this logic should be used to determine the parsing mode before calling Tag_Processor (parent) step
:
And _maybe_ the tag procuessor just needs to know html
/foreign
more than the actual namespace it's in 🤔
@jonsurrell commented on PR #7425:
3 weeks ago
#160
Thanks for the fixes here, I think this is good to go @dmsnell.
Trac ticket: Core-61576