Make WordPress Core

Opened 6 months ago

Closed 5 months ago

Last modified 5 months ago

#61348 closed enhancement (fixed)

HTML API: Report real and virtual nodes in the HTML Processor.

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 6.6 Priority: normal
Severity: normal Version: 6.6
Component: HTML API Keywords: has-patch has-unit-tests needs-dev-note dev-reviewed
Focuses: Cc:

Description

HTML is a kind of short-hand for a DOM structure. This means that there are many cases in HTML where an element's opening tag or closing tag is missing (or both). This is because many of the parsing rules imply creating elements in the DOM which may not exist in the text of the HTML.

The HTML Processor, being the higher-level counterpart to the Tag Processor, is already aware of these nodes, but since it's inception has not paused on them when scanning through a document. Instead, these are visible when pausing on a child of such an element, but otherwise not seen.

The HTML Processor ought to fully represent the DOM structure a browser would see, which includes representing these "virtual" nodes which are implicitly created.


For example, the HTML string <p><div>Content</p></div> looks like it contains overlapping P and DIV elements, but in reality the first P is implicitly closed by the <div> and the second </p> is unexpected and creates an empty P element.

The current HTML Processor in trunk will visit these tags in sequence (where a + indicates opening a node while - indicates closing one): +P +DIV #text -P -DIV.

The HTML Processor ought to represent this the way code traversing a DOM tree would: +P -P +DIV #text +P -P -DIV. Notably, in this sequence we can see the missing/implicit/virtual nodes that were created as part of applying the semantic HTML rules.

Change History (28)

This ticket was mentioned in PR #6348 on WordPress/wordpress-develop by @dmsnell.


6 months ago
#1

Trac ticket: Core-61348

## Summary

Creates virtual nodes when pushing to and popping from the stack of open elements. It's these nodes that are returned by next_tag(), while subclassed methods intercept tag information, all within the HTML Processor.

## Splitting time!

  • [x] Add get_current_depth() to return the depth of the currently-matched element in the stack of open elements. This needs to account for marker and other not-yet-implemented items in the stack.
  • [x] Add expects_closer() or similar function to indicate if the currently-matched element needs or expects a closing element.

## Questions

  • What do we do with void tags?
    • There are tradeoffs in how to represent tags without a closing tag (be they void elements or self-closing elements within foreign content or special atomic elements like SCRIPT). We can represent a unique open and close tag/event for them, or only expose the opening.
      • In exposing an open and a close there's a very straightforward and universal model of sequential open and close events for every tag.
      • It's weird to have a closing tag for an IMG.
    • I think that the value in having a close tag is outweighed by the awkwardness of it. For situations where we're doing things like tracking depth, we should rely on an internal method like get_depth() from the HTML API itself instead of exporting that HTML nuance onto the caller.

## Related work

  • WordPress/wordpress-develop#5562
  • WordPress/gutenberg#46345

## Examples

Convert to Markdown Output
https://github.com/WordPress/wordpress-develop/assets/5431237/c1abc5f2-ae11-434d-9923-5ad16a47745d https://github.com/WordPress/wordpress-develop/assets/5431237/4f1f91bc-74f8-4bf0-8c3d-45e4d6135dbb
Normalize HTML Output
https://github.com/WordPress/wordpress-develop/assets/5431237/c1abc5f2-ae11-434d-9923-5ad16a47745d https://github.com/WordPress/wordpress-develop/assets/5431237/b5c71e31-642a-4875-9907-e7e3aea7e30c

cc: @sirreal

#2 @dmsnell
6 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 58304:

HTML API: Report real and virtual nodes in the HTML Processor.

HTML is a kind of short-hand for a DOM structure. This means that there are
many cases in HTML where an element's opening tag or closing tag is missing (or
both). This is because many of the parsing rules imply creating elements in the
DOM which may not exist in the text of the HTML.

The HTML Processor, being the higher-level counterpart to the Tag Processor, is
already aware of these nodes, but since it's inception has not paused on them
when scanning through a document. Instead, these are visible when pausing on a
child of such an element, but otherwise not seen.

In this patch the HTML Processor starts exposing those implicitly-created nodes,
including opening tags, and closing tags, that aren't foudn in the text content
of the HTML input document.

Previously, the sequence of matched tokens when scanning with
WP_HTML_Processor::next_token() would depend on how the HTML document was written,
but with this patch, all semantically equal HTML documents will parse and scan in
the same exact manner, presenting an idealized or "perfect" view of the document
the same way as would occur when traversing a DOM in a browser.

Developed in https://github.com/WordPress/wordpress-develop/pull/6348
Discussed in https://core.trac.wordpress.org/ticket/61348

Props audrasjb, dmsnell, gziolo, jonsurrell.
Fixes #61348.

This ticket was mentioned in PR #6726 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#4

This is a follow-up to 163c3fd11026e0a6f93a7f88ab32dd1679ad0f88 / #6348 with two small fixes:

  • Fix non-static is_tag_closer method called statically
  • Fix call to parent is_tag_closer() that should call instance method

Trac ticket: https://core.trac.wordpress.org/ticket/61348

#5 @dmsnell
6 months ago

In 58365:

HTML API: Call $this->is_tag_closer() in HTML Processor.

The HTML Processor had been calling the parent class is_tag_closer()
method, but since visiting virtual nodes was introduced, it's important
that all of the methods are called on the subclass.

This patch fixes one issue identified where the parent method was called
instead, and it fixes another case where the change from calling the
parent method to the $this method was done improperly.

Developed in https://github.com/WordPress/wordpress-develop/pull/6726
Discussed in https://core.trac.wordpress.org/ticket/61348

Props jonsurrell.
See #61348.
Follow-up to [58304].

#7 @dmsnell
6 months ago

In 58441:

HTML API: Prevent Open Elements class from waking up.

This class accepts a Closure, but it should not be possible
to wake up with one from a serialized class instance.

Developed in https://github.com/WordPress/wordpress-develop/pull/6852
Discussed in https://core.trac.wordpress.org/ticket/61348

Props jonsurrell.
See #61348.
Follow-up to [58304].

This ticket was mentioned in PR #6860 on WordPress/wordpress-develop by @dmsnell.


6 months ago
#8

See Core-61348

Since the introduction of _visiting all nodes_ in the HTML Processor, the internal logic within the class can be confusing. This new method indicates if the currently-matched node is real or virtual, and can be used to clarify that logic.

@jonsurrell commented on PR #6860:


6 months ago
#9

This fixes a regression that was currently in 6.6 that has tests in #6765 where COMMENT_AS_PI_NODE_LOOKALIKE comments did not expose their "tag name" (PI _target_), it was always #comment. So <?target foo ?> would have no way to access the target` part of the comment.

#10 @dmsnell
6 months ago

  • Keywords needs-dev-note added

This ticket was mentioned in PR #6914 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#13

Follow-up to [58304]
See Core-61348

Previously the breadcrumbs were only generated for real nodes, and when visiting virtual nodes, the parser had already traversed past them to the next real node, advancing the breadcrumbs ahead of the matched token.

Test in the Playground

https://github.com/WordPress/wordpress-develop/assets/5431237/81bbcf66-d2ad-4662-9a71-d6c96b90923f

#14 @dmsnell
5 months ago

  • Keywords dev-feedback added
  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening #61348 to request fix for broken breadcrumbs.

See PR#6914

#15 @dmsnell
5 months ago

In 58588:

HTML API: Report breadcrumbs properly when visiting virtual nodes.

When [58304] introduced the abililty to visit virtual nodes in the HTML document,
those being the nodes which are implied by the HTML but no explicitly present in
the raw text, a bug was introduced in the get_breadcrumbs() method because it
wasn't updated to be aware of the virtual nodes. Therefore it would report the
wrong breadcrumbs for virtual nodes. Since the new get_depth() method is based
on the same logic it was also broken for virtual nodes.

In this patch, the breadcrumbs have been updated to account for the virtual nodes
and the depth method has been updated to rely on the fixed breadcrumb logic.

Developed in https://github.com/WordPress/wordpress-develop/pull/6914
Discussed in https://core.trac.wordpress.org/ticket/61348

Follow-up to [58304].

Props dmsnell, jonsurrell, zieladam.
See #61348.

#17 @dmsnell
5 months ago

Requesting backport into 6.6, fixing a bug introduced during the 6.6 release cycle.

I prepared a commit message, in case it helps.

HTML API: Report breadcrumbs properly when visiting virtual nodes.

When [58304] introduced the abililty to visit virtual nodes in the HTML document,
those being the nodes which are implied by the HTML but no explicitly present in
the raw text, a bug was introduced in the `get_breadcrumbs()` method because it
wasn't updated to be aware of the virtual nodes. Therefore it would report the
wrong breadcrumbs for virtual nodes. Since the new `get_depth()` method is based
on the same logic it was also broken for virtual nodes.

In this patch, the breadcrumbs have been updated to account for the virtual nodes
and the depth method has been updated to rely on the fixed breadcrumb logic.

Developed in https://github.com/WordPress/wordpress-develop/pull/6914
Discussed in https://core.trac.wordpress.org/ticket/61348

Reviewed by jonsurrell, zieladam.
Merges [58588] to the 6.6 branch.

Follow-up to [58304].

Props dmsnell, hellofromtonya, joemcgill, jonsurrell, zieladam.
Fixes #61348.

This ticket was mentioned in Slack in #core-committers by dmsnell. View the logs.


5 months ago

@gziolo commented on PR #6914:


5 months ago
#19

get_current_depth may report the wrong depth in some cases that are fixes for breadcrumbs with this change.

good point. in my head it was using breadcrumbs() already, but now it actually is with https://github.com/WordPress/wordpress-develop/commit/21bc01b956ded3bef4ed0df0371ba376bc6c1a8f

What is the minimal test cases that would cover this issue? We are in the RC phase, so better to stay on the safe side. When testing with the Playground link provided everything works as expected.

#20 @gziolo
5 months ago

  • Keywords dev-reviewed added; dev-feedback removed

Okay, to backport to the 6.6 branch.

#21 @gziolo
5 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 58590:

HTML API: Report breadcrumbs properly when visiting virtual nodes.

The breadcrumbs have been updated to account for the virtual nodes
and the depth method has been updated to rely on the fixed breadcrumb logic.

Reviewed by jonsurrell, zieladam, gziolo.
Merges [58588] to the 6.6 branch.

Follow-up to [58304].

Props dmsnell, hellofromtonya, joemcgill, jonsurrell, zieladam.
Fixes #61348.

@jonsurrell commented on PR #6914:


5 months ago
#22

</p> is a good test case for this. I'll add some unit tests in another PR.

You can see that the P tags are created at the wrong depth with the wrong breadcrumbs here (hover them for a tooltip and notice their nesting)

In this PR that's been fixed.

@gziolo commented on PR #6914:


5 months ago
#23

To close the loop here, I backported changes with https://core.trac.wordpress.org/changeset/58590 to the 6.6 release branch.

This ticket was mentioned in PR #6929 on WordPress/wordpress-develop by @jonsurrell.


5 months ago
#24

Add unit tests for virtual node breadcrumbs and depth. This includes tests for behaviors that were fixed in https://github.com/WordPress/wordpress-develop/pull/6914#issuecomment-2196299009.

Trac ticket: https://core.trac.wordpress.org/ticket/61348

@jonsurrell commented on PR #6914:


5 months ago
#25

It would be great to land https://github.com/WordPress/wordpress-develop/pull/6929 as a follow-up to this with unit tests.

#26 @gziolo
5 months ago

In 58592:

HTML API: Add tests for virtual node breadcrumbs and depth

Follow-up [58590].
See #61348.
Props jonsurrell, gziolo.

@jonsurrell commented on PR #6929:


5 months ago
#28

I'll address some follow-up test tweaks in https://github.com/WordPress/wordpress-develop/pull/7030.

Note: See TracTickets for help on using tickets.