Opened 4 weeks ago
Last modified 3 weeks ago
#62584 reopened feature request
Support more contexts in ::create_fragment method
Reported by: | jonsurrell | Owned by: | jonsurrell |
---|---|---|---|
Milestone: | 6.8 | Priority: | normal |
Severity: | normal | Version: | 6.5 |
Component: | HTML API | Keywords: | has-patch has-unit-tests |
Focuses: | Cc: |
Description
WP_HTML_Processor::create_fragment
only supports the <body>
context. It can and should support more contexts, for example:
<?php WP_HTML_Processor::create_fragment( '<td>', '<table>' );
[59444] added the foundation to allow more contexts to be used via the new create_fragment_at_current_node
method. This new method can be used along with a full processor to find the context node in the provided context and create a fragment at that location.
This is a follow-up to #62357.
Change History (15)
This ticket was mentioned in PR #7777 on WordPress/wordpress-develop by @jonsurrell.
4 weeks ago
#1
- Keywords has-patch has-unit-tests added
@jonsurrell commented on PR #7777:
4 weeks ago
#2
I've created https://core.trac.wordpress.org/ticket/62584 for this.
@Bernhard Reiter commented on PR #7777:
4 weeks ago
#4
Committed to Core in https://core.trac.wordpress.org/changeset/59467.
#5
@
4 weeks ago
This is a thing of beauty, @jonsurrell.
Unfortunately, something we didn’t anticipate is what happens when the fragment context is a token that’s ignored.
Now you and I know that <td>
can’t live on its own - it’s ignored, but will the typical develop know this who is using the API and hasn’t spent months in the HTML spec?
Maybe as a follow-up we try and resolve this by doing something basic, like assuming BODY
context as a wrapper in case the context fails to parse. For example, this algorithm:
fn get_context( html ): match parse( html ) { Parsed( context ) -> context NoTokens -> get_context( "<body>{html}" ) }
Do you think this would present any problems? I know this is related to our issue creating a context with elements like SCRIPT
and #text
, but it feels a bit different.
- For
<script>
it reverts toBODY
, so that’s not great, because we wan’t to reject these. - For
<td>
it createsBODY > TD
.
So maybe there’s some figuring to do, but I think it would be very surprising and frustrating to pass <td>
and have it fail. 🤔
#6
@
4 weeks ago
This is a thing of beauty…
😀 thank you!
what happens when the fragment context is a token that’s ignored.
If the context node can't be found (or the fragment processor cannot be created for any other reason), then the function returns null
instead of an instance. There's also a _doing_it_wrong
message No valid context element was detected.
in case no context node can be found.
I don't think prepending <body>
is a good solution, in fact it hides a problem that is appears to fix. A <body><td>
context has the exact same problem as a <td>
context: TD
cannot appear in that location, the tag is ignored. In that case the context is _not_ TD
, but BODY
!
<tbody>
context with <td>
fragment is a more illustrative example.
Error (returns null) no context element found WP_HTML_Processor::create_fragment( '<td>', '<tbody>' )
demo.
Not expected - BODY
context element WP_HTML_Processor::create_fragment( '<td>', '<body><tbody>' )
. Resulting fragment is empty demo.
Correct - TR
context element WP_HTML_Processor::create_fragment( '<td>', '<table><tbody>' )
. Results in the following fragment demo:
└─TR └─TD
We could try to perform some heuristics on the context HTML and fix it. The html5lib tests do this, although its through the specific test format and not heuristic.
If the context HTML is something like <td>
, <tr>
, <tbody>
… then it needs a preceding <table>
to be detected. I'm reluctant to build this logic into the class, but maybe it's valuable and fixes some common problems.
<?php // Usually, snippets of HTML ought to be processed in the default `<body>` context: $processor = WP_HTML_Processor::create_fragment( '<p>Hi</p>' ); // Some fragments should be processed in the correct context like this SVG: $processor = WP_HTML_Processor::create_fragment( '<rect width="10" height="10" />', '<svg>' ); // This fragment with TD tags should be processed in a TR context: $processor = WP_HTML_Processor::create_fragment( '<td>1<td>2<td>3', '<table><tbody><tr>' );
#7
@
4 weeks ago
- Resolution fixed deleted
- Status changed from closed to reopened
Reopening to reconsider the interface and behavior of fragments in the HTML API. I spoke at length with @dmsnell and we decided to make the new methods (from this ticket and #62357) for creating non-body fragments private and consider some changes to the implementations that diverge from the HTML standard on fragment parsing.
A good example is the following:
<?php WP_HTML_Processor::create_fragment( '<p>If this P is considered to be under the context node, that cannot be represented in HTML.', '<p>' );
As in implemented, this produces the following tree ("under" the context node of <p>
):
└─P └─#text If this P is considered to be under the context node, that cannot be represented in HTML.
Through the lens of the HTML API, P > P
is not something that can ever be represented as HTML and may not make sense to show as a context. The DOM APIs (where fragment parsing comes from) do support this behavior because the DOM supports many arbitrary ways of nesting elements that are unrepresentable in HTML.
This ticket was mentioned in PR #7907 on WordPress/wordpress-develop by @jonsurrell.
4 weeks ago
#8
Reopening to reconsider the interface and behavior of fragments in the HTML API. I spoke at length with @dmsnell and we decided to make the new methods (from this ticket and #62357) for creating non-body fragments private and consider some changes to the implementations that diverge from the HTML standard on fragment parsing.
A good example is the following:
WP_HTML_Processor::create_fragment( '<p>If this P is considered to be under the context node, that cannot be represented in HTML.', '<p>' );
As in implemented, this produces the following tree ("under" the context node of <p>):
└─P ├─#text If this P is considered to be under the context node, that's ├─EM │ └─#text broken └─#text HTML.Through the lens of the HTML API,
P > P
is not something that can ever be represented as HTML and may not make sense to show as a context. The DOM APIs (where fragment parsing comes from) do support this behavior because the DOM supports many arbitrary ways of nesting elements that are unrepresentable in HTML.
Follow-up https://core.trac.wordpress.org/changeset/59444
Follow-up https://core.trac.wordpress.org/changeset/59467
Trac ticket:
This ticket was mentioned in PR #7912 on WordPress/wordpress-develop by @jonsurrell.
4 weeks ago
#9
WORK IN PROGRESS. DO NOT MERGE
This is exploring the idea of using a "context stack" inside fragment parsers. The goal is to maintain a coherent HTML structure of the fragment document.
This is not something that document objects (browsers) need to worry about because document trees can support any structure, including structures that are _unrepresentable_ in HTML, such as P > P
(nested paragraph tags).
This diverges from the fragment parsing specification browser follow, where they only consider the _context element_ as a starting point for building a document. Because the tree can ensure that the document is contained within the node, there are no concerns of a nodes contents leaking out.
In the HTML API, this explores the idea of using a "context stack" in addition to the context element. The context stack is the entire available document up to and including the context element. These nodes are all "locked," and if they are removed from the stack the HTML API will bail (at this time it throws an exception, but that is not likely to be the final implementation).
Trac ticket: https://core.trac.wordpress.org/ticket/62584
@jonsurrell commented on PR #7907:
4 weeks ago
#10
See https://github.com/WordPress/wordpress-develop/pull/7912 for exploration on how to prevent unrepresentable HTML documents from being created as fragments.
@Bernhard Reiter commented on PR #7907:
4 weeks ago
#12
Committed to Core in https://core.trac.wordpress.org/changeset/59469.
@jonsurrell commented on PR #7912:
4 weeks ago
#14
This approach makes set_inner_html
relatively easy to implement in a safe way: https://github.com/WordPress/wordpress-develop/pull/7932
@jonsurrell commented on PR #7912:
3 weeks ago
#15
When trying to control contexts like this, the insertion mode is also dangerous and is problematic in this PR as implemented.
The fragment implementation currently on trunk is correct where a BODY
context element is not on the stack of open elements. This causes </body>
tokens to be ignored. The same is true for </html>
In this PR, these checks do not behave as expected in the spec so could cause the parser to incorrectly move into after-body or after-after-body insertion modes.
✅ ~This PR builds on (and requires) #7348.~
This modifies
::create_fragment( $html, $context )
to use a full processor andcreate_fragment_at_node
instead of the other way around. This makes more sense and makes the main factory methods more clear, where the state required for fragments is set up increate_fragment_at_node
instead of in bothcreate_fragment
andcreate_fragment_at_current_node
.This allows for more HTML contexts to be provided to the basic
create_fragment
where the provided context HTML is appended to<!DOCTYPE html>
, a full processor is created, the last tag opener is found, and a fragment parser is created at that node viacreate_fragment_at_current_node
.The HTML5lib tests are updated accordingly to use this new method to create fragments.
Trac ticket: https://core.trac.wordpress.org/ticket/62584
(previously https://core.trac.wordpress.org/ticket/62357)
Closes #7141.