Make WordPress Core

Opened 4 weeks ago

Last modified 3 weeks ago

#62584 reopened feature request

Support more contexts in ::create_fragment method

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 6.8 Priority: normal
Severity: normal Version: 6.5
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

WP_HTML_Processor::create_fragment only supports the <body> context. It can and should support more contexts, for example:

<?php
WP_HTML_Processor::create_fragment( '<td>', '<table>' );

[59444] added the foundation to allow more contexts to be used via the new create_fragment_at_current_node method. This new method can be used along with a full processor to find the context node in the provided context and create a fragment at that location.

This is a follow-up to #62357.

Change History (15)

This ticket was mentioned in PR #7777 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#1

  • Keywords has-patch has-unit-tests added

✅ ~This PR builds on (and requires) #7348.~

This modifies ::create_fragment( $html, $context ) to use a full processor and create_fragment_at_node instead of the other way around. This makes more sense and makes the main factory methods more clear, where the state required for fragments is set up in create_fragment_at_node instead of in both create_fragment and create_fragment_at_current_node.

This allows for more HTML contexts to be provided to the basic create_fragment where the provided context HTML is appended to <!DOCTYPE html>, a full processor is created, the last tag opener is found, and a fragment parser is created at that node via create_fragment_at_current_node.

The HTML5lib tests are updated accordingly to use this new method to create fragments.

Trac ticket: https://core.trac.wordpress.org/ticket/62584
(previously https://core.trac.wordpress.org/ticket/62357)

Closes #7141.

#3 @Bernhard Reiter
4 weeks ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 59467:

HTML API: Allow more contexts in create_fragment.

This changeset modifies WP_HTML_Processor::create_fragment( $html, $context ) to use a full processor and create_fragment_at_node instead of the other way around. This makes more sense and makes the main factory methods more clear, where the state required for fragments is set up in create_fragment_at_node instead of in both create_fragment and create_fragment_at_current_node.

This allows for more HTML contexts to be provided to the basic create_fragment where the provided context HTML is appended to <!DOCTYPE html>, a full processor is created, the last tag opener is found, and a fragment parser is created at that node via create_fragment_at_current_node.

The HTML5lib tests are updated accordingly to use this new method to create fragments.

Props jonsurrell, dmsnell, bernhard-reiter.
Fixes #62584.

#5 @dmsnell
4 weeks ago

This is a thing of beauty, @jonsurrell.

https://cldup.com/ayU0YE4_qI.png

Unfortunately, something we didn’t anticipate is what happens when the fragment context is a token that’s ignored.

https://cldup.com/h7_2k-O_Jr.png

Now you and I know that <td> can’t live on its own - it’s ignored, but will the typical develop know this who is using the API and hasn’t spent months in the HTML spec?

Maybe as a follow-up we try and resolve this by doing something basic, like assuming BODY context as a wrapper in case the context fails to parse. For example, this algorithm:

fn get_context( html ):
  match parse( html ) {
    Parsed( context ) -> context
    NoTokens           -> get_context( "<body>{html}" )
  }

Do you think this would present any problems? I know this is related to our issue creating a context with elements like SCRIPT and #text, but it feels a bit different.

  • For <script> it reverts to BODY, so that’s not great, because we wan’t to reject these.
  • For <td> it creates BODY > TD.

So maybe there’s some figuring to do, but I think it would be very surprising and frustrating to pass <td> and have it fail. 🤔

#6 @jonsurrell
4 weeks ago

This is a thing of beauty…

😀 thank you!

what happens when the fragment context is a token that’s ignored.

If the context node can't be found (or the fragment processor cannot be created for any other reason), then the function returns null instead of an instance. There's also a _doing_it_wrong message No valid context element was detected. in case no context node can be found.

I don't think prepending <body> is a good solution, in fact it hides a problem that is appears to fix. A <body><td> context has the exact same problem as a <td> context: TD cannot appear in that location, the tag is ignored. In that case the context is _not_ TD, but BODY!

<tbody> context with <td> fragment is a more illustrative example.

Error (returns null) no context element found WP_HTML_Processor::create_fragment( '<td>', '<tbody>' ) demo.

Not expected - BODY context element WP_HTML_Processor::create_fragment( '<td>', '<body><tbody>' ). Resulting fragment is empty demo.

Correct - TR context element WP_HTML_Processor::create_fragment( '<td>', '<table><tbody>' ). Results in the following fragment demo:

└─TR
  └─TD

We could try to perform some heuristics on the context HTML and fix it. The html5lib tests do this, although its through the specific test format and not heuristic.

If the context HTML is something like <td>, <tr>, <tbody>… then it needs a preceding <table> to be detected. I'm reluctant to build this logic into the class, but maybe it's valuable and fixes some common problems.

Note that there are examples of exactly these types of tricky contexts in the function documentation:

<?php
// Usually, snippets of HTML ought to be processed in the default `<body>` context:
$processor = WP_HTML_Processor::create_fragment( '<p>Hi</p>' );

// Some fragments should be processed in the correct context like this SVG:
$processor = WP_HTML_Processor::create_fragment( '<rect width="10" height="10" />', '<svg>' );

// This fragment with TD tags should be processed in a TR context:
$processor = WP_HTML_Processor::create_fragment(
    '<td>1<td>2<td>3',
    '<table><tbody><tr>'
);

#7 @jonsurrell
4 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening to reconsider the interface and behavior of fragments in the HTML API. I spoke at length with @dmsnell and we decided to make the new methods (from this ticket and #62357) for creating non-body fragments private and consider some changes to the implementations that diverge from the HTML standard on fragment parsing.

A good example is the following:

<?php
WP_HTML_Processor::create_fragment( '<p>If this P is considered to be under the context node, that cannot be represented in HTML.', '<p>' );

As in implemented, this produces the following tree ("under" the context node of <p>):

└─P
  └─#text If this P is considered to be under the context node, that cannot be represented in HTML.

Through the lens of the HTML API, P > P is not something that can ever be represented as HTML and may not make sense to show as a context. The DOM APIs (where fragment parsing comes from) do support this behavior because the DOM supports many arbitrary ways of nesting elements that are unrepresentable in HTML.

Last edited 4 weeks ago by jonsurrell (previous) (diff)

This ticket was mentioned in PR #7907 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#8

See this comment:

Reopening to reconsider the interface and behavior of fragments in the HTML API. I spoke at length with @dmsnell and we decided to make the new methods (from this ticket and #62357) for creating non-body fragments private and consider some changes to the implementations that diverge from the HTML standard on fragment parsing.

A good example is the following:

WP_HTML_Processor::create_fragment( '<p>If this P is considered to be under the context node, that cannot be represented in HTML.', '<p>' );

As in implemented, this produces the following tree ("under" the context node of <p>):

└─P
  ├─#text If this P is considered to be under the context node, that's 
  ├─EM
  │ └─#text broken
  └─#text  HTML.

Through the lens of the HTML API, P > P is not something that can ever be represented as HTML and may not make sense to show as a context. The DOM APIs (where fragment parsing comes from) do support this behavior because the DOM supports many arbitrary ways of nesting elements that are unrepresentable in HTML.

Follow-up https://core.trac.wordpress.org/changeset/59444
Follow-up https://core.trac.wordpress.org/changeset/59467

Trac ticket:

This ticket was mentioned in PR #7912 on WordPress/wordpress-develop by @jonsurrell.


4 weeks ago
#9

WORK IN PROGRESS. DO NOT MERGE

This is exploring the idea of using a "context stack" inside fragment parsers. The goal is to maintain a coherent HTML structure of the fragment document.

This is not something that document objects (browsers) need to worry about because document trees can support any structure, including structures that are _unrepresentable_ in HTML, such as P > P (nested paragraph tags).

This diverges from the fragment parsing specification browser follow, where they only consider the _context element_ as a starting point for building a document. Because the tree can ensure that the document is contained within the node, there are no concerns of a nodes contents leaking out.

In the HTML API, this explores the idea of using a "context stack" in addition to the context element. The context stack is the entire available document up to and including the context element. These nodes are all "locked," and if they are removed from the stack the HTML API will bail (at this time it throws an exception, but that is not likely to be the final implementation).

Trac ticket: https://core.trac.wordpress.org/ticket/62584

@jonsurrell commented on PR #7907:


4 weeks ago
#10

See https://github.com/WordPress/wordpress-develop/pull/7912 for exploration on how to prevent unrepresentable HTML documents from being created as fragments.

#11 @Bernhard Reiter
4 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 59469:

HTML API: Make non-body fragment creation methods private.

The current implementation of create_fragment (and the underlying create_fragment_at_current_node) allows passing in a context that might result in a tree that cannot be represented by HTML. For example, a user might use <p> as context, and attempt to create a fragment that also consists of a paragraph element, <p>like this. This would result in a paragraph node nested inside another -- something that can never result from parsing HTML.

To prevent this, this changeset makes create_fragment_at_current_node private and limits create_fragment to only <body> as context, while a comprehensive solution to allow other contexts is being worked on.

Follow-up to [59444], [59467].
Props jonsurrell, dmsnell, bernhard-reiter.
Fixes #62584.

#13 @jonsurrell
4 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

@jonsurrell commented on PR #7912:


4 weeks ago
#14

This approach makes set_inner_html relatively easy to implement in a safe way: https://github.com/WordPress/wordpress-develop/pull/7932

@jonsurrell commented on PR #7912:


3 weeks ago
#15

When trying to control contexts like this, the insertion mode is also dangerous and is problematic in this PR as implemented.

The fragment implementation currently on trunk is correct where a BODY context element is not on the stack of open elements. This causes </body> tokens to be ignored. The same is true for </html>

In this PR, these checks do not behave as expected in the spec so could cause the parser to incorrectly move into after-body or after-after-body insertion modes.

Note: See TracTickets for help on using tickets.