Make WordPress Core

Opened 8 weeks ago

Last modified 7 weeks ago

#63020 new enhancement

HTML API: Breadcrumbs should include element indices and attributes

Reported by: westonruter's profile westonruter Owned by:
Milestone: 6.9 Priority: normal
Severity: normal Version:
Component: HTML API Keywords: needs-patch
Focuses: Cc:

Description (last modified by westonruter)

The Optimization Detective plugin from the Core Performance Team extends WP_HTML_Tag_Processor with some features from WP_HTML_Processor like get_breadcrumbs() and get_current_depth(). It also introduces its own method get_xpath() which computes an XPath expression to uniquely identify the element, for example:

/HTML/BODY/DIV[@class='wp-site-blocks']/*[1][self::HEADER]/*[1][self::DIV]/*[2][self::IMG]

See full documentation for why the XPath is constructed like this. In short, /HTML and /HTML/BODY lack any node indices since there is no possibility for ambiguity. For children of the BODY, using node indices is not stable since arbitrary HTML may be printed at wp_body_open(), and for this reason it uses the id, role, or class attribute to add a disambiguating XPath predicate. For levels below this, elements are referenced as *[1][self::IMG] to target the an element that occurs at a specific position. If this were instead /HEADER[1] it would select the first IMG among other IMG elements, not the first IMG among all siblings. This ensures the XPath only matches an IMG when it is the first child, and it will no longer match if a P is inserted before it, for example.

All this to say, WP_HTML_Processor does not keep track of element node indices, and it doesn't expose the attributes for the tags in the open stack (e.g. to get the id, role, or class). This would seem to make it more difficult to implement get_xpath() than maybe it should be. Ideally computing the XPath wouldn't require subclassing at all, and the information could be obtained from existing public methods. In Optimization Detective, the WP_HTML_Tag_Processor class is extended and the next_token() method is overridden so it can construct its own breadcrumbs and then also compute the node indices and capture the attributes at a given depth.

All this to say, I suggest that in addition to get_breadcrumbs() that there be a way to get more information from the open stack of tags, including the attributes for each tag and the node index for each.

In other words, it's currently possible to construct an XPath like /HTML/BODY as follows:

<?php
$xpath = array_map(
        function ( string $breadcrumb ): string {
                return "/$breadcrumb";
        },
        $processor->get_breadcrumbs()
);

But I'm proposing something like get_element_breadcrumbs() which would return objects for each open tag on the stack instead of just the tag name. So then you could construct a full unambiguous XPath:

<?php
$xpath = array_map(
        function ( WP_Element $breadcrumb ): string {
                $expression = '/*';
                $expression .= sprintf( '[self::*]', $breadcrumb->get_tag() );
                foreach ( array( 'id', 'role', 'class' ) as $attribute_name ) {
                        $attribute = $breadcrumb->get_attribute( $attribute_name );
                        if ( is_string( $attribute ) ) {
                                $expression .= sprintf( '[@%s="%s"]', $breadcrumb->get_tag(), addcslashes( $attribute, '\\"' ) );
                                break;
                        }
                }
                return $expression;
        },
        $processor->get_element_breadcrumbs()
);

Change History (4)

#1 @flixos90
8 weeks ago

  • Milestone changed from Future Release to 6.9

I think these are great ideas, so putting this into the 6.9 milestone for consideration.

@dmsnell It would be great to get your feedback.

#2 @westonruter
8 weeks ago

  • Description modified (diff)

#3 @westonruter
8 weeks ago

  • Description modified (diff)

#4 @dmsnell
7 weeks ago

As you are probably aware, Automattic has paused contributions to WordPress for the moment and I’m unable to give this ticket deep consideration at the moment. Thanks @flixos90 for reaching out to me.

In the past I have opposed the inclusion of this in the Core classes for a few reasons: there’s a measurable performance impact on eagerly tracking all of the node indices. In addition, this cannot be reliably done in the Tag Processor, so it belongs exclusively in the HTML Processor where those guarantees are possible.

Finally, I suggest leaning hard into the subclass for things that are currently possible. I’m not sure how generic the specific form of XPath construction is going to be outside of the Optimization Detective.

There is an implementation detail I’ve thought a lot about called a “Fir Tree” which would track the locations of all nodes in a document in something not too dissimilar from React’s virtual DOM. There could be ways of deriving the node indices from that approach, though I will not be working on this in the foreseeable future.

Should this end up in Core I would prefer seeing it be an opt-in mechanism, since we’ve tried to make all performance impacts a developer choice, where nothing is surprisingly costly. If it isn’t essential, it’d be best to leave it off until someone needs it.

Thank you both for working on this!

Note: See TracTickets for help on using tickets.