Make WordPress Core

Opened 10 months ago

Closed 10 months ago

Last modified 9 months ago

#60283 closed enhancement (fixed)

HTML API: Support all HTML tags in standard

Reported by: jonsurrell's profile jonsurrell Owned by: dmsnell's profile dmsnell
Milestone: 6.5 Priority: normal
Severity: normal Version:
Component: HTML API Keywords: has-patch has-unit-tests needs-dev-note
Focuses: Cc:

Description

The HTML API does not yet support all HTML tags in the HTML standard. Support should be introduced for all HTML tags.

This ticket does not cover foreign content like SVG.

Unsupported tags are listed in WP_HTML_Processor::step_in_body, currently at this location.

Change History (35)

This ticket was mentioned in PR #5895 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#1

  • Keywords has-patch has-unit-tests added

Add "area", "br", "embed", "img", "keygen", "wbr" tag handling

A start tag whose tag name is one of: "area", "br", "embed", "img", "keygen", "wbr"
Reconstruct the active formatting elements, if any.
Insert an HTML element for the token. Immediately pop the current node off the stack of open elements.
Acknowledge the token's self-closing flag, if it is set.
Set the frameset-ok flag to "not ok".

This exception is currently unsupported, <br> is handled, but the special exception for </br> is not:

An end tag whose tag name is "br"
Parse error. Drop the attributes from the token, and act as described in the next entry; i.e. act as if this was a "br" start tag token with no attributes, rather than the end tag token that it actually is.

Trac ticket: https://core.trac.wordpress.org/ticket/60283

This ticket was mentioned in Slack in #core by jorbin. View the logs.


10 months ago

#3 @jorbin
10 months ago

  • Milestone changed from Awaiting Review to 6.5

@jonsurrell commented on PR #5895:


10 months ago
#4

@dmsnell Will you review please?

@jonsurrell commented on PR #5895:


10 months ago
#5

I compared this with the html5lib-tests branch (#5794) and this includes 4 additional tests, no failures.

This ticket was mentioned in PR #5897 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#6

Implement HR tag handling

Trac ticket: https://core.trac.wordpress.org/ticket/60283

This ticket was mentioned in PR #5903 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#7

Add PRE and LISTING tag handling.

A start tag whose tag name is one of: "pre", "listing"
If the stack of open elements has a p element in button scope, then close a p element.

Insert an HTML element for the token.

If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of pre blocks are ignored as an authoring convenience.)

Set the frameset-ok flag to "not ok".

An end tag whose tag name is one of: "address", "article", "aside", "blockquote", "button", "center", "details", "dialog", "dir", "div", "dl", "fieldset", "figcaption", "figure", "footer", "header", "hgroup", "listing", "main", "menu", "nav", "ol", "pre", "search", "section", "summary", "ul"
If the stack of open elements does not have an element in scope that is an HTML element with the same tag name as that of the token, then this is a parse error; ignore the token.

Otherwise, run these steps:

Generate implied end tags.

If the current node is not an HTML element with the same tag name as that of the token, then this is a parse error.

Pop elements from the stack of open elements until an HTML element with the same tag name as the token has been popped from the stack.

Trac ticket: https://core.trac.wordpress.org/ticket/60283

@jonsurrell commented on PR #5903:


10 months ago
#8

This increases html5lib-test assertions to 347 vs 336 on #5794 (11 new passing tests).

This ticket was mentioned in PR #5905 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#9

I started work on <plaintext> tags, but they don't make much sense until we can support text nodes (https://github.com/WordPress/wordpress-develop/pull/5683). Pausing this work for now.

Trac ticket: https://core.trac.wordpress.org/ticket/60283

This ticket was mentioned in PR #5906 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#10

  • Add param, source, track tag handling
  • html5lib-tests

Trac ticket: https://core.trac.wordpress.org/ticket/60283

This ticket was mentioned in PR #5907 on WordPress/wordpress-develop by @jonsurrell.


10 months ago
#13

  • html5lib-tests
  • Add INPUT tag handling

Trac ticket:

I tested this with:

  • src/wp-includes/html-api/class-wp-html-processor.php

    diff --git a/src/wp-includes/html-api/class-wp-html-processor.php b/src/wp-includes/html-api/class-wp-html-processor.php
    index c3874fa51a..b2d267fd98 100644
    a b class WP_HTML_Processor extends WP_HTML_Tag_Processor { 
    170170         *
    171171         * @var WP_HTML_Processor_State
    172172         */
    173         private $state = null;
     173        public $state = null;
    174174
    175175        /**
    176176         * Used to create unique bookmark names.
  • tests/phpunit/tests/html-api/wpHtmlProcessor.php

    diff --git a/tests/phpunit/tests/html-api/wpHtmlProcessor.php b/tests/phpunit/tests/html-api/wpHtmlProcessor.php
    index d2cf982d72..adc076eae7 100644
    a b class Tests_HtmlApi_WpHtmlProcessor extends WP_UnitTestCase { 
    132132                $this->assertFalse( $p->next_tag( 'EM' ), 'Should have aborted before finding second EM as it required reconstructing the first EM.' );
    133133        }
    134134
     135        /**
     136         * Ensures that input elements correctly set (or not) frameset_ok.
     137         *
     138         * @ticket 60283
     139         *
     140         * @covers WP_HTML_Processor::step_in_body
     141         */
     142        public function test_input_tag_framest_ok() {
     143                $p = WP_HTML_Processor::create_fragment( '<input type="x">' );
     144                $this->assertTrue( $p->next_tag() );
     145                $this->assertTrue( $p->state->frameset_ok );
     146
     147                $p = WP_HTML_Processor::create_fragment( '<input type="hidden">' );
     148                $this->assertTrue( $p->next_tag() );
     149                $this->assertFalse( $p->state->frameset_ok );
     150
     151                $p = WP_HTML_Processor::create_fragment( '<input type="hIdDEn">' );
     152                $this->assertTrue( $p->next_tag() );
     153                $this->assertFalse( $p->state->frameset_ok );
     154
     155                $p = WP_HTML_Processor::create_fragment( '<input>' );
     156                $this->assertTrue( $p->next_tag() );
     157                $this->assertFalse( $p->state->frameset_ok );
     158        }
     159
    135160        /**
    136161         * Ensures that special handling of unsupported tags is cleaned up
    137162         * as handling is implemented. Otherwise there's risk of leaving special

#14 @dmsnell
10 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 57314:

HTML API: Add support for HR element.

Adds support for the following HTML elements to the HTML Processor:

  • HR

Previously, this element was not supported and the HTML Processor would bail when encountering
it. Now, with this patch, it will proceed to parse an HTML document when encountering one.

Developed in WordPress/wordpress-develop#5897

Props jonsurrell, dmsnell
Fixes #60283

#15 @dmsnell
10 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

#16 @dmsnell
10 months ago

Reopening for the other linked PR

@dmsnell commented on PR #5897:


10 months ago
#17

Merged in ecbd1376a6c322f20ccccc6133933d13fc3c3356

#18 @dmsnell
10 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 57316:

HTML API: Add support for BR, EMBED, & other tags.

Adds support for the following HTML elements to the HTML Processor:

  • AREA, BR, EMBED, KEYGEN, WBR
  • Only the opening BR tag is supported, as the invalid closer </br> involves more complicated rules, to be implemented later.

Previously, these elements were not supported and the HTML Processor
would bail when encountering them. With this patch it will proceed to
parse an HTML document when encountering those tags as long as other
normal conditions don't cause it to bail (such as complicated format
reconstruction rules).

Props jonsurrell, dmsnell
Fixes #60283

@dmsnell commented on PR #5895:


10 months ago
#19

Merged in https://core.trac.wordpress.org/changeset/57316
91e51f92a8c2c5e81b2a08cf2600615612cc5ab5

#20 @dmsnell
10 months ago

In 57317:

HTML API: Add support for PRE and LISTING elements.

Adds support for the following HTML elements to the HTML Processor:

  • PRE, LISTING

Previously, these elements were not supported and the HTML Processor would bail when encountering them. Now, with this patch applied, it will proceed to parse an HTML document when encountering those tags.

Developed in WordPress/wordpress-develop#5903

Props jonsurrell, dmsnell
Fixes #60283

@dmsnell commented on PR #5903:


10 months ago
#21

Merged in 85de4aa70dd4f02d1aeb87277b65535438530bcd
https://core.trac.wordpress.org/changeset/57317

@dmsnell commented on PR #5903:


10 months ago
#22

Forgot to push the conflict resolution, but it was limited to the line of the HTML Processor's class documentation listing supported tags. The resolution was to add both the KEYGEN and the LISTING elements to that line.

@dmsnell commented on PR #5905:


10 months ago
#23

I think we should actually change the Tag Processor before doing this, because we will continue to step into new tags if we don't.

#24 @dmsnell
10 months ago

In 57326:

HTML API: Support PARAM, SOURCE, and TRACK tags.

Adds support for the following HTML elements to the HTML Processor:

  • PARAM, SOURCE, TRACK

Previously these elements were not supported and the HTML Processor would bail when encountering them. Now, with this patch applied, it will proceed to parse an HTML document when encountering those tags.

Props jonsurrell, dmsnell
Fixes #60283

@dmsnell commented on PR #5906:


10 months ago
#25

Merged in [57326]
6653002c215e90aac4bde7191df98e8851af3a7e

@jonsurrell commented on PR #5906:


10 months ago
#26

We should try and add these to the class-level docblock when we add support … please find a way to fit them in towards the top of the file so that people can see what's supported.

I've missed that in all the recent PRs. Thanks for handling them so far, I'll try to include those in the future 👍

#27 @dmsnell
10 months ago

In 57343:

HTML API: Support INPUT tags.

Adds support for the following HTML elements to the HTML Processor:

  • INPUT

Previously this element was not supported and the HTML Processor would bail when encountering one. Now, with this patch applied, it will proceed to parse the HTML document.

Developed in https://github.com/WordPress/wordpress-develop/pull/5907
Discussed in https://core.trac.wordpress.org/ticket/60283

Props jonsurrell
See #60283

#29 @afercia
10 months ago

I'm not following closely the amazing work on the HTML processor so I may be missing something but I'm not sure I understand why non-conforming and obsolete HTML elements are being added to the processor.

For example, the LISTING and PARAM elements are listed in the 'Non-conforming features' of the spec and 'must not be used by authors'.

https://html.spec.whatwg.org/#non-conforming-features
16.2 Non-conforming features
Elements in the following list are entirely obsolete, and must not be used by authors:

listing
Use pre and code instead.

param
Use the data attribute of the object element to set the URL of the external resource.

@dmsnell can you please expand a bit, when you have a chance?

Also, since there are commits still referncing this ticket, I guess the ticket should stay open?

#30 @dmsnell
10 months ago

hi @afercia 👋

I'm glad you asked this great question.

it's hard to start answering without a preamble, which is that "invalid," "non-conforming," "non-normative,", "broken," and other words describing HTML must be understood as broad and highly context-sensitive terms. for the sake of any HTML parser there is no such thing as "invalid" markup. there is, but it doesn't mean anything other than "handle the markup as specified and note somewhere that it was invalid, if you want to communicate that it's invalid."

the section of the specification you are quoting is specifically discussing authoring HTML and yes, nobody should be creating a new LISTING tag. unfortunately for us and other parsers and browsers, that HTML might exist and we might have to read and understand it. there's no way to simply ignore a deprecated, obsolete, or non-conforming tag if carries its own semantic rules.

the XMP element, for example, switches into a separate parsing mode and the PARAM element is void. if we didn't include support for these tags then we would get off track when parsing any HTML that includes them. sadly, this means that every valid parser will have to support or at least recognize these elements forever in order to avoid breaking content already on the web. In fact, I recently observed that Github is sending </xmp> on their pages and I'm guessing it's specifically to eagerly mitigate any attack vector based on the invalid XMP, where if someone were to inject <xmp> onto the page then everything after it would be swallowed as plaintext until a closing </xmp> is found. Without that guard then the injection would effectively trash the page from that point forward.

it's not currently possible to generate tags in the HTML API, though we do allow creating invalid attribute names as long as they don't break the HTML syntax. these discussions are relevant because the HTML API provides great opportunity to enforce whatever rules we want. for example, should we disallow setting invalid tabindex values? I suspect that in time we'll find that there are cases we want to be strict but usually we will want to be permissive because of the risk of over-zealously rejecting content that otherwise would behave well in browsers.

for now though, the simple answer is that we must support any existing or legacy rules for deprecated elements in the spec to avoid mis-parsing HTML inputs containing them.

Also, since there are commits still referencing this ticket, I guess the ticket should stay open?

I was hoping that this ticket could close and remain closed. it has a broad title and I'd prefer that no new PRs are opened to be associated with it. the existing PLAINTEXT PR is almost surely not merging into 6.5. this ticket, as broad as it is, is also not comprehensive and so it's not being discussed in a way I'd expect a big tracking ticket to work.

what's your advice? should I rename the ticket? dissociate the already-associated PR from it?

#31 @afercia
10 months ago

@dmsnell thanks for your extended explanation :)

should I rename the ticket? dissociate the already-associated PR from it?

If you're planning to add more tags, I'd think this ticket must stay open as the work on it is not completed.

#32 @dmsnell
10 months ago

If you're planning to add more tags, I'd think this ticket must stay open as the work on it is not completed.

we are going to add more tags, but I prefer not to on this ticket. if I change the title to something like "HTML API: Support more tags" would that clarify?

every other addition has come in separate tickets, and I find it distracting to mush all the conversations into one. that is, if we keep this ticket open until the HTML Processor is complete I'm not sure what the point of the ticket is. it's not for discussion about whether we should support the tags; and if it's simply to relate the work to the HTML Processor, then maybe we should be using See #58517 or Follows [56274]?

#33 @afercia
10 months ago

Traditionally the process on Trac is a little different from the one typically used on GitHub for example in the Gutenberg project. On Trac, it's not uncommon, sometimes even recommended, to use a ticket as a 'tracking ticket' and refer commits to it.

Trac isn't only a development tool. It represents the history of WordPress. Grouping related changes under a single ticket can help making the history of the project clearer even after many years.

#34 @dmsnell
10 months ago

Thanks for sharing your thoughts, @afercia. I don't think there's any comparison to Github here, or at least none was intended on my part; this is solely about Trac.

There are a few things to note:

  • None of the HTML tag support patches have been done under a single Trac ticket. Some are individually not that complicated, but some are mini-projects on their own.
  • We'll likely be having lingering support patches for some amount of time.
  • This Trac ticket was not opened for the purposes I think you are describing; more or less it was accidentally created with a title that sounds like what you're suggesting.
  • As written, this ticket even excludes large portions of the HTML5 specification from its scope, making it non-comprehensive even for the title'd purpose.

So, as I responded to your first prompting, I think it would make sense, if we're requiring people not to split work into smaller pieces, to reuse #58517 since that seems to be more like what you are talking about requiring. I would appreciate your thoughts on that point; I'd like to imagine that this Trac ticket never existed, but it was created, and here we are.

Also I'd like some thoughts from folks if there's strong feelings here or if it'd be fine to stay the course on the HTML API developments taking place in separate tickets. It can be hard to anticipate beforehand how complicated some section of the HTML5 specification will be to implement, and the discussion on this ticket, holding only a handful of relatively easy parts to implement, is already difficult for me personally to follow; there's too much noise and scrolling. Additionally, we inevitably get to things like the PLAINTEXT PR, which doesn't really belong here because as we got into it we realized that it's part of the Tag Processor's work and not part of the HTML Processor, so any merging of the work between the two is pushing the inherent relation that's there.

#35 @stevenlinx
9 months ago

  • Keywords needs-dev-note added

as part of a standalone dev note post

Note: See TracTickets for help on using tickets.