Make WordPress Core

Opened 5 months ago

Last modified 8 weeks ago

#62270 new defect (bug)

Unable to set bookmark on </body> in WP_HTML_Processor

Reported by: westonruter's profile westonruter Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 6.4
Component: HTML API Keywords: has-unit-tests
Focuses: Cc:

Description

In the Optimization Detective plugin from the WordPress Core Performance Team, the HTML Tag Processor is currently used to locate the closing </body> tag, set a bookmark on it, and then inject pending markup at that point when get_updated_html() is called after the document has been iterated over.

This works well with WP_HTML_Tag_Processor, but it does not work with WP_HTML_Processor because when the processor is located at </body> the parser state gets set to STATE_COMPLETE and then set_bookmark() short-circuits returning false. It doesn't seem that STATE_COMPLETE should be reached until the end of the document is reached.

Change History (10)

This ticket was mentioned in PR #7614 on WordPress/wordpress-develop by @jonsurrell.


5 months ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket:

#2 @jonsurrell
5 months ago

Is this using the full processor or the fragment processor?

I don't expect the body close tag to be bookmarkable in the fragment processor (since it operates as if it were a fragment in BODY). However, the full processor still seems to fail.

#3 @westonruter
5 months ago

@jonsurrell yes, it fails on the full parser. I didn't try on the fragment parser.

#4 follow-up: @jonsurrell
5 months ago

Ah, this is fascinating.

The BODY tag closer doesn't remove BODY from the stack of open elements like most tag closers. It simply changes the insertion mode:

An end tag whose tag name is "body"
If the stack of open elements does not have a body element in scope, this is a parse error; ignore the token.

Switch the insertion mode to "after body".

Subsequent insertion modes may switch back to "in body" insertion mode. Here's an interesting example:

<body></body>
back in body.
</html>
<i>still</i> in body!
</body>
<!-- Body is "closed" because nothing re-enters. -->
HTML
├── HEAD
├── BODY
│   ├── #text: back in body.
│   ├── I
│   │   └── #text: still
│   └── #text: in body!
└── #comment: Body is "closed" because nothing re-enters.

The HTML processor attempts to hide these surprising details and present an idealized representation closer to the DOM tree. In that regard, it's doing the right thing. It only pops BODY and HTML off the stack when it reaches the end of the HTML.

(Right now the HTML API errors when adding content outside of HTML or BODY, see PR 7312 for work to improve that.)

(Aside: Upon review, I think the handling of BODY and HTML elements that don't impact the "tree" are buggy and are one reason that this bock is needed.)


I think it's correct that the closing BODY tag does not have a bookmark because it does not correspond to the removal of the BODY element. It _may_, but there are no guarantees and this would require peeking ahead in the HTML, something that the HTML API does not do at this time.

I don't have a solution to offer at this time, I need to think about it. I don't think the answer it to bookmark that location.

It's interesting to consider that when the processor actually takes the BODY tag off the stack (and pauses on the BODY closer) it's at the end of the document:

<html><body></body></html>
<!-- When we see </BODY> and </HTML> the parser is actually here! -->

If we were to inject HTML at this location, the result would be the following:

<html><body></body></html>
<div>injected html</div>

In most cases (except whitespace text or comments) the injected HTML would actually be inside the BODY tag in the tree, but the HTML is undesirable.

#5 in reply to: ↑ 4 @westonruter
5 months ago

  • Keywords has-patch removed

Replying to jonsurrell:

In most cases (except whitespace text or comments) the injected HTML would actually be inside the BODY tag in the tree, but the HTML is undesirable.

That's a good point. In my subclass, if a bookmark is unable to be set on </body> then it could just opt to append the injected content at the end of the document after </html> instead of inserting it at the desired bookmark before </body>.

Alternatively, I guess I could print out a comment at wp_footer() and then when iterating over the tokens and this comment is detected, the "end of body" bookmark could be set at that point.

These should be viable workarounds until setting a bookmark on </body> is supported.

#6 follow-up: @jonsurrell
5 months ago

In my subclass, if a bookmark is unable to be set on </body> then it could just opt to append the injected content at the end of the document after </html> instead of inserting it at the desired bookmark before </body>.

I was going to suggest something along these lines as a workaround. I think the only cases it would get wrong if a comment appears in the injected HTML before other elements or non-whitespace text. Anything else should be placed in the body element. It won't make the prettiest HTML, but it should work 🙂

#7 in reply to: ↑ 6 @westonruter
5 months ago

Replying to jonsurrell:

I think the only cases it would get wrong if a comment appears in the injected HTML before other elements or non-whitespace text. Anything else should be placed in the body element. It won't make the prettiest HTML, but it should work 🙂

Could you elaborate? What would go wrong in this case? So let's say a SCRIPT tag is intended to be added to the end of the </body>. If there is a comment preceding the SCRIPT tag then what? The comment would be left after </body> but then the SCRIPT tag and any other remaining content would be moved up to the end of the </body>?

#8 @jonsurrell
5 months ago

If there is a comment preceding the SCRIPT tag then what? The comment would be left after </body> but then the SCRIPT tag and any other remaining content would be moved up to the end of the </body>?

That's exactly right, except that if we're assuming the original HTML ended with </body></html> then the comment would be outside the HTML element.


To elaborate with a few examples:

Let's assume the original HTML ends as expected: </body></html>. This would put the parser into the wonderfully named "after after body" insertion mode.

  • A comment token is inserted as the last child of the document element.
  • Whitespace text is inserted as a child of BODY (but does NOT change the insertion mode).
  • Anything else (that isn't ignored) switches the insertion mode to "in body" and is reprocessed.

We can look at a few cases.

If the HTML (with appended HTML) looks like this:

</body></html>
	<!-- A comment -->
Text

We get this:

HTML
├── HEAD
├── BODY
│   ├── (… whatever was originally here …)
│   └── #text: \n\t\nText\n
└── #comment: A comment

The BODY element ends with \n\t\nText\n. The Document (outside of HTML) ends with the comment. Most instances of getting it wrong look something like this where comments would not be children of body if nothing has triggered the switch back to "in body" insertion mode.

However, if the HTML looks something like this:

</body></html>
<div>We want to append this to <code>BODY</code></div>
	<!-- A comment -->
Text

This is the result:

HTML
├── HEAD
└── BODY
    ├── (… whatever was originally here …)
    ├── #text: \n
    ├── DIV
    │   ├─ #text: We want to append this to
    │   └─ CODE
    │      └─ #text: BODY
    ├── #text: \n\t
    ├── #comment: A comment
    └── #text: Text\n

Then the DOM is exactly how we wanted it. Everything is under BODY an in the same order. This is because before the comment can appear out of place, the DIV caused the insertion mode to switch to "in body." As long as there's any non-whitespace text or another element before any comments, this should always hold. And I believe comments are the only thing that can have experience this problem.

#9 follow-up: @jorbin
8 weeks ago

In 59747:

Documentation: Update @since to reflect version this might ship in.

When originally committed, this code was targeting 6.7.1. However, it was not backported and included in 6.7.1. Will this be followed up by another version change? You'll need to stay tuned to next week's episode of "As the WordPress Turns" to find out!

Follow-up to [59285] and [59364].

See #62270.

#10 in reply to: ↑ 9 @westonruter
8 weeks ago

Replying to jorbin:

See #62270.

Actually, see #62269.

Note: See TracTickets for help on using tickets.