Make WordPress Core

Opened 8 weeks ago

#60698 new feature request

Add optimized set lookup class.

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 6.5
Component: General Keywords:
Focuses: Cc:

Description

In the course of exploratory development in the HTML API there have been a few times where I wanted to test if a given string is in a set of statically-known strings, and a few times where I wanted to check if the next span of text represents an item in the set.

For the first case, in_array() is a suitable method, but isn't always ideal when the test set is large.

<?php
if ( in_array( '&notin', $html5_named_character_references, true ) )

For the second case, in_array() isn't adequate, and a more complicated lookup is necessary.

<?php
foreach ( $html5_named_character_references as $name ) {
        if ( 0 === substr_compare( $html, $name, $at, /* length */ null, /* case insensitive */ true ) ) {
                
                return $name;
        }
}

This second example demonstrates some catastrophic lookup characteristics when it's not certain if the following input is any token from the set, let alone which one it might be. The at-hand code has to iterate the search domain and then compare every single possibility against the input string, bailing when one is found.

While reviewing code in various places I've noticed a similar pattern and need, mostly being served by in_array() and a regex that splits apart an input string into a large array, allocating substrings for each array element, and then calling in_array() inside the regex callback (or when the results array is passed to another function). This is all memory heavy and inefficient in the runtime.


I'd like to propose a new class whose semantic is a relatively static set of terms or tokens which provides a test for membership within the set, and what the next matching term or token is at a given offset in a string, if the next sequence of characters matches one.

<?php
$named_character_references = WP_Token_Set( [ '&notin', '&notin;', '&amp;',  ] );

if ( $named_character_references->contains( '&notin' ) ) {
        
}

while ( true ) {
        $was_at = $at;
        $at = strpos( $text, '&', $at );
        if ( false === $at ) {
                $output .= substr( $text, $was_at )
                break;
        }

        $name = $named_character_reference->read_token( $text, $at );
        if ( false !== $name ) {
                $output .= substr( $text, $was_at, $at - $was_at );
                $output .= $named_character_replacements[ $name ];
                $at     += strlen( $name );
                continue;
        }

        // No named character reference was found, continue searching.
        ++$at;
}

Further, because WordPress largely deals with large and relatively static token sets (named character references, allowable URL schemes, file types, loaded templates, etc…), it would be nice to be able to precompute the lookup tables if they are at all costly, as doing so on every PHP load is unnecessarily burdensome.

A bonus feature would be a method to add and a method to remove terms.


In #5373 I have proposed such a WP_Token_Set and used it in #5337 to create a spec-compliant, low-memory-overhead, and efficient replacement for esc_attr().

The replacement esc_attr() is able to more reliably parse attribute values than the existing code and it does so more efficiently, avoiding numerous memory allocations and lookups.

Change History (0)

Note: See TracTickets for help on using tickets.