Opened 9 months ago
Last modified 2 months ago
#63020 new enhancement
HTML API: Breadcrumbs should include element indices and attributes
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Future Release | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | HTML API | Keywords: | has-patch dev-feedback has-test-info has-unit-tests |
| Focuses: | Cc: |
Description (last modified by )
The Optimization Detective plugin from the Core Performance Team extends WP_HTML_Tag_Processor with some features from WP_HTML_Processor like get_breadcrumbs() and get_current_depth(). It also introduces its own method get_xpath() which computes an XPath expression to uniquely identify the element, for example:
/HTML/BODY/DIV[@class='wp-site-blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]
See full documentation for why the XPath is constructed like this. In short, /HTML and /HTML/BODY lack any node indices since there is no possibility for ambiguity. For children of the BODY, using node indices is not stable since arbitrary HTML may be printed at wp_body_open(), and for this reason it uses the id, role, or class attribute to add a disambiguating XPath predicate. For levels below this, elements are referenced as *[1][self::IMG] to target the an element that occurs at a specific position. If this were instead /HEADER[1] it would select the first IMG among other IMG elements, not the first IMG among all siblings. This ensures the XPath only matches an IMG when it is the first child, and it will no longer match if a P is inserted before it, for example.
All this to say, WP_HTML_Processor does not keep track of element node indices, and it doesn't expose the attributes for the tags in the open stack (e.g. to get the id, role, or class). This would seem to make it more difficult to implement get_xpath() than maybe it should be. Ideally computing the XPath wouldn't require subclassing at all, and the information could be obtained from existing public methods. In Optimization Detective, the WP_HTML_Tag_Processor class is extended and the next_token() method is overridden so it can construct its own breadcrumbs and then also compute the node indices and capture the attributes at a given depth.
All this to say, I suggest that in addition to get_breadcrumbs() that there be a way to get more information from the open stack of tags, including the attributes for each tag and the node index for each.
In other words, it's currently possible to construct an XPath like /HTML/BODY as follows:
<?php $xpath = array_map( function ( string $breadcrumb ): string { return "/$breadcrumb"; }, $processor->get_breadcrumbs() );
But I'm proposing something like get_element_breadcrumbs() which would return objects for each open tag on the stack instead of just the tag name. So then you could construct a full unambiguous XPath:
<?php $xpath = array_map( function ( WP_Element $breadcrumb ): string { $expression = '/*'; $expression .= sprintf( '[self::*]', $breadcrumb->get_tag() ); foreach ( array( 'id', 'role', 'class' ) as $attribute_name ) { $attribute = $breadcrumb->get_attribute( $attribute_name ); if ( is_string( $attribute ) ) { $expression .= sprintf( '[@%s="%s"]', $breadcrumb->get_tag(), addcslashes( $attribute, '\\"' ) ); break; } } return $expression; }, $processor->get_element_breadcrumbs() );
Attachments (1)
Change History (15)
#4
@
9 months ago
As you are probably aware, Automattic has paused contributions to WordPress for the moment and I’m unable to give this ticket deep consideration at the moment. Thanks @flixos90 for reaching out to me.
In the past I have opposed the inclusion of this in the Core classes for a few reasons: there’s a measurable performance impact on eagerly tracking all of the node indices. In addition, this cannot be reliably done in the Tag Processor, so it belongs exclusively in the HTML Processor where those guarantees are possible.
Finally, I suggest leaning hard into the subclass for things that are currently possible. I’m not sure how generic the specific form of XPath construction is going to be outside of the Optimization Detective.
There is an implementation detail I’ve thought a lot about called a “Fir Tree” which would track the locations of all nodes in a document in something not too dissimilar from React’s virtual DOM. There could be ways of deriving the node indices from that approach, though I will not be working on this in the foreseeable future.
Should this end up in Core I would prefer seeing it be an opt-in mechanism, since we’ve tried to make all performance impacts a developer choice, where nothing is surprisingly costly. If it isn’t essential, it’d be best to leave it off until someone needs it.
Thank you both for working on this!
This ticket was mentioned in PR #8718 on WordPress/wordpress-develop by @sukhendu2002.
8 months ago
#5
- Keywords has-patch added; needs-patch removed
Trac ticket: https://core.trac.wordpress.org/ticket/63020
#7
@
3 months ago
- Keywords has-patch added; needs-patch removed
I added get_element_breadcrumbs() to WP_HTML_Processor. It gives detailed breadcrumbs with element indices and key attributes (id, role, class) that the Optimization Detective plugin uses for XPath generation.
Each breadcrumb in the returned array includes:
- tag name
- namespace (html/svg/math)
- sibling index
- relevant attributes for disambiguation
I also added a full test suite—11 cases covering basics, indexing, deep nesting, attributes, and edge cases. Everything passes.
This should make it way easier to generate unambiguous XPaths without subclassing WP_HTML_Tag_Processor like Optimization Detective does now.
Implementation is in src/wp-includes/html-api/class-wp-html-processor.php, tests in tests/phpunit/tests/html-api/wpHtmlProcessorElementBreadcrumbs.php.
#9
@
3 months ago
@ibrahimhajjaj thank you for posting that patch and for documenting the functions and providing tests.
first of all, this patch and functionality seems very fitting for the Optimization Detective. I’m glad that the HTML Processor is able to provide the information it wants to generate its XPath queries.
this particular feature request, however, introduces a break from the performance characteristics so-far built into the system (well, at least as much as is possible). that is, the HTML API generally makes you aware of any performance impact you request. it avoids doing extra work up-front and for everyone in case someone needs the information. but if we decode, allocate, and store every attribute for every element during a parse, I’m worried about penalizing every use case for the very specific use case presented here.
storing attributes carries two costs: they are text-decoded so that calling-code doesn’t have to determine if and how it should be decoded and this carries a runtime cost usually skipped-over unless it’s needed; they can be very large and lead to memory bloat, increasing the cost of a page and sometimes getting close to doubling the memory requirements.
given that I work frequently with sub-classes, and given that the HTML API was designed to be subclassed, I’d like to understand better why sub-classing is not working, especially for a plugin as staffed and maintained as the Optimization Detective. we should not see sub-classes as failures or inherent things to eradicate.
please see my previous comment as well, which this is mostly a rewording of. proposing a patch is the easy part, but fitting in the needs of one very-specific use-case into a system meant to be a low-level mechanism for building HTML processors is much more difficult.
#10
@
3 months ago
Hi @dmsnell,
Thank you for the detailed feedback. I completely agree with the philosophy of keeping the core processor lean and using subclasses for specialized needs...To make sure I'm on the right track before posting a new patch, I wanted to run the proposed solution by you.
Based on your comments, I've moved the functionality into a new WP_HTML_Breadcrumbs_Processor subclass. The core WP_HTML_Processor class would remain completely untouched.
The key features of this subclass would be:
Strictly Opt-In: All features (index tracking, attribute tracking) are disabled by default via methods like enable_index_tracking(true).
Lazy Attribute Decoding: It avoids the eager cost you mentioned. Instead of decoding and storing all attributes for every element, it only decodes the specific allow-listed attributes (id, role, class) on-demand, when get_element_breadcrumbs() is called.
On-Demand Index Calculation: Indices are calculated by counting siblings at the required level when needed, not stored during parsing.
My question is: Does this approach—a dedicated subclass with lazy, on-demand decoding and explicit opt-in controls—align with the performance profile and architecture you had in mind for handling this use case?
If this sounds correct, I will post the patch for this subclass implementation soon.
Thank you again...
This ticket was mentioned in Slack in #core by welcher. View the logs.
2 months ago
#12
@
2 months ago
- Milestone changed from 6.9 to Future Release
Based on the feedback from @dmsnell and considering we're ~2 weeks away from 6.9 beta 1, I am going to set the milestone to Future Release. If there is traction before the deadline, we can move it back.
#13
follow-up:
↓ 14
@
2 months ago
My question is: Does this approach—a dedicated subclass with lazy, on-demand decoding and explicit opt-in controls—align with the performance profile and architecture you had in mind for handling this use case?
If this sounds correct, I will post the patch for this subclass implementation soon.
@ibrahimhajjaj yes it does, very much, inside the Optimization Detective plugin. I've yet to see something that suggests it should be baked in as part of Core. to get there, I think it would be good to see concrete motivation for it: a number of different plugins doing this same thing in this same way, where the implementation is full of easy-to-miss steps.
it seems like the motivating use case is a fairly degenerate form of XPath which is a kind of custom tracker for a specific element rather than something that represents normative XPath usage. that’s fine again, in a plugin, but if that’s the feature leading this design I would love to see more evidence of universal need for that.
now if there’s anything in the HTML API that makes building this custom subclass difficult I want to see if we can’t align those better.
suggest that…there be a way to get more information from the open stack of tags, including the attributes for each tag and the node index for each.
this is something that seems might be best done by tracking the bookmarks, which are technically already internally stored, and re-parsing as necessary. on the one hand, plugin code can already easily track the attributes for the open elements. the code @westonruter shared as something he would want to be able to write should be trivial to add to a subclass which does nothing more than wraps next_token().
<?php public $open_stack = array(); public $element_indices = array( 0 ); public function next_token() { if ( ! $this->expects_closer() ) { array_pop( $this->open_stack ); array_pop( $this->element_indices ); } if ( ! parent::next_token() ) { return false; } if ( $this->is_tag_closer() ) { array_pop( $this->open_stack ); array_pop( $this->element_indices ); } else { $element_index = ++$this->element_indices[ count( $this->element_indices ) - 1 ]; $this->open_stack[] = array( 'token_name' => $this->get_token_name(), 'id' => $this->get_attribute( 'id' ), 'role' => $this->get_attribute( 'role' ), 'class' => $this->get_attribute( 'class' ), 'elementIndex' => $element_index, ); } return true; }
this glosses over some of the details, but I think it provides what your patch provides, though with less code and in a way that shouldn’t need to be maintained in harmony with the rest of the HTML Processor. this is using only public methods. it also doesn’t force Core to make a decision for everyone on which attributes for any given element are essential or not (for example, this code and the proposed patch do not indicate if a given element is a last-child, something I would think some code might want, and another thing I would be skeptical of adding for all people in all situations).
#14
in reply to:
↑ 13
@
2 months ago
Replying to dmsnell:
it seems like the motivating use case is a fairly degenerate form of XPath which is a kind of custom tracker for a specific element rather than something that represents normative XPath usage. that’s fine again, in a plugin, but if that’s the feature leading this design I would love to see more evidence of universal need for that.
The implementation in Optimization Detective is an XPath subset, yes, but the breadcrumbs could be used to generate CSS selectors as well or any more specific breadcrumbs than just an tag name hierarchy.
now if there’s anything in the HTML API that makes building this custom subclass difficult I want to see if we can’t align those better.
It has been a few months since I've been in the head space for integration the HTML Processor. I remember I had tried to subclass to obtain the breadcrumbs but I encountered issues. Granted, that was a year ago or so. I need to revisit this and evaluate what need there is for improved helpers in core, but I won't have time before 6.9-beta1.
I think these are great ideas, so putting this into the 6.9 milestone for consideration.
@dmsnell It would be great to get your feedback.