Make WordPress Core

Opened 4 weeks ago

Closed 3 weeks ago

Last modified 3 weeks ago

#62396 closed defect (bug) (fixed)

Serialize should include doctype when present

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 6.7.1 Priority: normal
Severity: normal Version: 6.7
Component: HTML API Keywords: has-patch has-unit-tests fixed-major commit
Focuses: Cc:

Description

When a full HTML document with a DOCTYPE is serialized by the HTML Processor, the DOCTYPE token should be included.

The DOCTYPE is important HTML content and impacts how the document is processed.

DOCTYPEs are currently omitted in output of WP_HTML_Processor::serialize.

Change History (29)

This ticket was mentioned in PR #7780 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#1

  • Keywords has-patch has-unit-tests added

Output DOCTYPE when calling WP_HTML_Processor::serialize on a full document including a DOCTYPE.

The DOCTYPE should be included in the serialized/normalized HTML output, it has an impact in how the document is handled, in particular whether the document should be handled in quirks or no-quirks mode.

Trac ticket: https://core.trac.wordpress.org/ticket/62396

#2 @jonsurrell
4 weeks ago

  • Version set to 6.7

#3 @Bernhard Reiter
4 weeks ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 59399:

HTML API: Include doctype in full parser serialize.

Output DOCTYPE when calling WP_HTML_Processor::serialize on a full document that includes a DOCTYPE.

The DOCTYPE should be included in the serialized/normalized HTML output as it has an impact in how the document is handled, in particular whether the document should be handled in quirks or no-quirks mode.

This only affects the serialization of full parsers at this time because DOCTYPE tokens are currently ignored in all possible fragments. The omission of the DOCTYPE is subtle but can change the serialized document's quirks/no-quirks mode.

Props jonsurrell.
Fixes #62396.

#5 @Bernhard Reiter
4 weeks ago

  • Keywords dev-feedback fixed-major added
  • Resolution fixed deleted
  • Status changed from closed to reopened

Re-opening to seek approval by another committer to backport to the 6.7 branch.

#6 @cbravobernal
4 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 59401:

HTML API: Include doctype in full parser serialize.

Output DOCTYPE when calling WP_HTML_Processor::serialize on a full document that includes a DOCTYPE.

The DOCTYPE should be included in the serialized/normalized HTML output as it has an impact in how the document is handled, in particular whether the document should be handled in quirks or no-quirks mode.

This only affects the serialization of full parsers at this time because DOCTYPE tokens are currently ignored in all possible fragments. The omission of the DOCTYPE is subtle but can change the serialized document's quirks/no-quirks mode.

Reviewed by cbravobernal.
Merges [59399] to the 6.7 branch.

Props jonsurrell.
Fixes #62396.

#7 follow-ups: @dmsnell
4 weeks ago

Thanks for addressing this point, but I would ask we reconsider this change.

First, thanks for noticing the typo from my original code. I apparently used the token name html in the switch when passing the token type #doctype. This fix should stay.

The change to recreate the original doctype I believe might be a mistake. It more faithfully recreates the input document, but that's not the goal of serialize(), which is to fully normalize the input into a form that's going to be consistent and predictable.

<?php
$p = WP_HTML_Processor::create_full_parser( '<!DOCTYPE dorktype><p>hi<!DOCTYPE html5>' );
echo $p->serialize();
// <!DOCTYPE dorktype><html><head></head><body><p>hi</p></body></html>

I’m in the anti-dork-type camp here. Now initially I thought this was very messy and due to the way the HTML Processor understands HTML we would always want to return standards-mode HTML - this is why it formerly only returned <!DOCTYPE html>. A little investigation reveals a more nuanced need, but one for which I think we have precedent.

We ought to preserve the mode of the input document but I don't think we should preserve the PUBLIC and SYSTEM identifiers. That is, only return one of two predefined DOCTYPE declarations.

If we have a standards document input, we return <!DOCTYPE html>. If we get a quirks-mode document input, we return something like <!DOCTYPE html PUBLIC "-//WordPress/HTML/quirks-mode"> to fully identify the document as such.


For quirks-mode input I was thinking of the TABLE-closes-a-P rule. If we print out that serialized document as standards-compliant then we end up reshaping the document structure plus we produce an extra empty P element upon re-parsing. So we have to stay there. The more obvious thing I should have considered is the CSS names and ASCII case sensitivity.

Regardless, what do you think about printing only a pre-defined DOCTYPE for quirks mode? We could reuse one of the W3C ones, but I think creating our own would also be nice and clearly communicate what's going on.

Fun fact: the -// prefix in the public identifier indicates that what follows is an unregistered owner. That is, since this isn't referring to an ISO standards document, it’s unregistered. Otherwise it would be +//. SGML lore 🙂

This ticket was mentioned in PR #7793 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#8

See this comment

We ought to preserve the mode of the input document but I don't think we should preserve the PUBLIC and SYSTEM identifiers. That is, only return one of two predefined DOCTYPE declarations.

If we have a standards document input, we return <!DOCTYPE html>. If we get a quirks-mode document input, we return something like <!DOCTYPE html PUBLIC "-//WordPress/HTML/quirks-mode"> to fully identify the document as such.

Trac ticket: https://core.trac.wordpress.org/ticket/62396

#9 in reply to: ↑ 7 @jonsurrell
4 weeks ago

Replying to dmsnell:

Thanks for addressing this point, but I would ask we reconsider this change.


We ought to preserve the mode of the input document but I don't think we should preserve the PUBLIC and SYSTEM identifiers. That is, only return one of two predefined DOCTYPE declarations.

If we have a standards document input, we return <!DOCTYPE html>. If we get a quirks-mode document input, we return something like <!DOCTYPE html PUBLIC "-//WordPress/HTML/quirks-mode"> to fully identify the document as such.

Those are good thoughts, I think agree. The essential thing is to maintain whether the document is in quirks or no-quirks mode.

For no-quirks, the standard HTML5 doctype is good.
For quirks mode, omitting the doctype is fine.

I've prepared a PR with those changes for consideration: https://github.com/WordPress/wordpress-develop/pull/7793

@jonsurrell commented on PR #7793:


4 weeks ago
#10

This change would by lossy regarding documents in `limited-quirks` mode. A case could be added for that:

The public identifier starts with: "-W3CDTD XHTML 1.0 Frameset"
The public identifier starts with: "-
W3CDTD XHTML 1.0 Transitional"
The system identifier is not missing and the public identifier starts with: "-W3CDTD HTML 4.01 Frameset"
The system identifier is not missing and the public identifier starts with: "-
W3CDTD HTML 4.01 Transitional"

@jonsurrell commented on PR #7793:


4 weeks ago
#11

So if we ignore the token we get a quirks-mode document?

That is my understanding. Processing a DOCTYPE token can result in quirks, no-quirks, or limited-quirks. The absence of the DOCTYPE proceeds through to subsequent insertion modes and sets the document mode to quirks. That logic is encoded in the HTML Processor here:

https://github.com/WordPress/wordpress-develop/blob/a27e6a8ecd5279a45e47ef88d048643740d07b5a/src/wp-includes/html-api/class-wp-html-processor.php#L1330-L1353

@jonsurrell commented on PR #7793:


4 weeks ago
#12

Do you think there's still value in positively affirming that it’s in quirks mode?

This is the question 🙂

If we're going to have a quirks-mode DOCTYPE, I think it should either be omission or the (normalized) input DOCTYPE.

In the same way that <!DOCTYPE html> is the ideal no-quirks doctype, I feel that omission is the ideal quirks-mode DOCTYPE.

There remains the issue of limited-quirks DOCTYPEs, and those probably need to have their (normalized) DOCTYPES preserved given it's unclear what the ideal limited-quirks DOCTYPE would be.

I don't feel strongly about any of these, or even that this PR is a better approach. I'm open to being convinced.

DOCTYPEs _do_ contain information that's accessible in the browsers. They can be normalized to a standard form, but I'm a bit reluctant to performa this "aggressive" normalization to quirks/limited-quirks/no-quirks DOCTYPEs.

This ticket was mentioned in PR #7804 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#13

Changeset [59399] fixed missing DOCTYPEs in normalized HTML output. It missed an edge case where public and system identifiers may contain double quotes, in which case they must be quoted with single quotes.

This change addresses that issue and adds tests.

Trac ticket: https://core.trac.wordpress.org/ticket/62396
Follow-up to [59399]

#14 in reply to: ↑ 7 @jonsurrell
4 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

#15 @jonsurrell
4 weeks ago

Replying to dmsnell:

The change to recreate the original doctype I believe might be a mistake. It more faithfully recreates the input document, but that's not the goal of serialize(), which is to fully normalize the input into a form that's going to be consistent and predictable.

I want to make clarify that the change landed in [59399] does fully normalize the document. The DOCTYPE is not repeated verbatim, it is normalized into a standard DOCTYPE form. Some examples:

// Input (fun fact, this is a no-quirks mode doctype!)
<!dOcTyPehtml	PublIC"pub-id"'fooo"bar"bazzzzzzz' 😛>
// Normalized
<!DOCTYPE html PUBLIC "pub-id" 'fooo"bar"bazzzzzzz'>

// Input
<!doctype DORKtype>
// Normalized
<!DOCTYPE dorktype>

These DOCTYPES are not equal byte-for-byte, but their parsed values are equivalent and the result is indistinguishable in the document while using normalized HTML text.

@jonsurrell commented on PR #7804:


4 weeks ago
#16

@ockham @cbravobernal This is an important follow-up to [59399] / https://github.com/WordPress/wordpress-develop/pull/7780

@dmsnell commented on PR #7793:


4 weeks ago
#17

DOCTYPEs do contain information that's accessible in the browsers. They can be normalized to a standard form, but I'm a bit reluctant to performa this "aggressive" normalization to quirks/limited-quirks/no-quirks DOCTYPEs.

they do, and this is one reason that I think it’s proper to normalize them as well. talking about _predictability_ and improved reliability for unreliable legacy and new systems, it makes me wonder who we’re benefiting by passing along a _dorktype_.

do we know the effects of limited-quirks mode? so far we don’t even track that do we? I guess I’m comfortable leaving it a known potential bug until we understand the implications of it, but the fix is trivial — add that extra case.

omission is the ideal quirks-mode DOCTYPE.

for all fragment parsers this is the default behavior. we’re considering the wholesale rewrite of a full document in this patch. perhaps that’s good reason to remind ourselves that the impact of this existing behavior and change is minimal.

still, it worries me leaving things ambiguous, since the same HTML could appear as standards-mode in which case it would parse differently. there’d be no clue to the provenance of the document as a quirks- or limited-quirks document. this is also why I thought about using a new public identifier linking back to WordPress — any downstream system could see that and link back to the HTML API. “Oh! this is quirks-mode produced by WordPress, that’s why it’s this way…” Maybe my worries are overblown.

@jonsurrell commented on PR #7793:


4 weeks ago
#18

do we know the effects of limited-quirks mode? so far we don’t even track that do we?

The DOCTYPE parser does recognize limited-quirks doctypes correctly according to the spec. The HTML processor doesn't care about it because it does not affect parsing or tree construction, I believe it only affects rendering.

I believe limited quirks keeps some of the rendering peculiarities of quirks mode, but does not include things like case-insensitive class selectors or the P > TABLE tree construction difference that's in quirks mode.

Here's an example of a rendering and/or tree construction differences:

Limited quirks No quirks
<kbd>https://github.com/user-attachments/assets/a320ab6d-c79d-4d89-92da-a634cfd323a7</kbd> <kbd>https://github.com/user-attachments/assets/245c30c7-90f9-4557-ad8e-b9d07ff2ff39</kbd>

Chrome, Firefox, and Safari all continue to support limited-quirks mode.

Wikipedia says this:

A third compatibility mode known as "limited quirks mode" (previously, "almost standards mode" or "strict mode"), which maintains the "traditional" vertical sizing of table cells according to the CSS2 specification, has been implemented in these browsers: Safari), Opera) 7.5 (and later), all Gecko)-based browsers since 1.0.1 (such as Firefox) and Internet Explorer 8.[2]

"Almost standards" mode rendering matches "standards" mode in all details except for one. The layout of images inside table cells is handled the same way "quirks" mode operates, instead, which is fairly consistent with legacy browsers such as Internet Explorer 7 (and earlier). This means that sliced-images-in-tables layouts are less likely to fall apart in browsers when in either "quirks" or "almost standards" mode, rather than "standards" mode.[5]

@jonsurrell commented on PR #7793:


4 weeks ago
#19

In my opinion, the safest, least surprising, and most obvious thing to do is serialize a normalized version of the DOCTYPE token as was done in https://github.com/WordPress/wordpress-develop/pull/7780. This respects the input and maintains the no-/limited-/quirks mode of the document. I don't think we can go wrong preserving these aspects of the original document.

I could be convinced that replacing no-quirks doctypes with the standard <!DOCTYPE html> is reasonable and good. However, I'm less convinced that quirks mode doctypes should be completely removed or radically rewritten, but if we replace the no-quirks doctypes then it seems like this should be done as well 😕

Finally, I'm really not comfortable with changing a document's mode from limited-quirks, which probably means that limited-quirks must respect the original doctype, which leads me back to thinking that the original doctype should just be respected.

@dmsnell commented on PR #7793:


3 weeks ago
#20

less convinced that quirks mode doctypes should be completely removed or radically rewritten

you’ve made a compelling case for preserving limited-quirks mode, though I’m just going to take you at your word that those screenshots are different — I don’t see it.

I’m still for radical rewriting though. In serialize() we also strip duplicate attributes, normalize attribute values, re-encode HTML text, drop ignored tokens, write tags that didn’t appear in the input, and perform all sorts of radical rewriting.

For the purpose of normalization and predictability, I still like the idea of pre-defined DOCTYPE tokens, just now with one known to create limited quirks. That breaks the reference to WordPress (since any new DOCTYPE will indicate quirks-mode) but it would also mean that all documents produced will have one of three very-easily searchable DOCTYPE declarations.

Again, this is an issue on the periphery I think, as most serializations will likely occur on fragments and most will be HTML5 (or should be). If not for the complexity involved I’d far prefer we produce HTML5 standards-mode documents for all inputs, but alas, that seems intractable (okay maybe I’ll be eating these words in a few months when we do it).

That means I will support your choice, but I think that preserving “dorktypes” is a misunderstanding of the goal of serialization() — it’s not about preserving but rather normalizing.

#21 @cbravobernal
3 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 59410:

HTML API: Fix normalized doctype pub/sys identifier quotes.

Changeset [59399] fixed missing DOCTYPEs in normalized HTML output. It missed an edge case where public and system identifiers may contain double quotes, in which case they must be quoted with single quotes.

This commit addresses that issue and adds tests.

Follow-up to [59399].

Props jonsurrell, luisherranz, apermo.
Fixes #62396.

#23 @cbravobernal
3 weeks ago

  • Keywords dev-reviewed added; dev-feedback removed
  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening #62396 to request backporting [59410] to 6.7 branch

Last edited 3 weeks ago by cbravobernal (previous) (diff)

#24 @cbravobernal
3 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 59411:

HTML API: Fix normalized doctype pub/sys identifier quotes.

Changeset [59399] fixed missing DOCTYPEs in normalized HTML output. It missed an edge case where public and system identifiers may contain double quotes, in which case they must be quoted with single quotes.

This commit addresses that issue and adds tests.

Follow-up to [59399].

Reviewed by cbravobernal.
Merges [59410] to the 6.7 branch.

Props jonsurrell, luisherranz, apermo.
Fixes #62396.

This ticket was mentioned in Slack in #core by cbravobernal. View the logs.


3 weeks ago

#27 @cbravobernal
3 weeks ago

  • Keywords commit added; dev-reviewed removed

#28 @desrosj
3 weeks ago

  • Summary changed from HTML API: serialize should include doctype when present to Serialize should include doctype when present

@jonsurrell commented on PR #7793:


3 weeks ago
#29

I'll close this, but I am open to reconsidering.

Note: See TracTickets for help on using tickets.