Make WordPress Core

Changes between Initial Version and Version 1 of Ticket #60698

05/09/2024 02:10:41 AM (5 weeks ago)


  • Ticket #60698

    • Property Keywords has-patch needs-unit-tests added
    • Property Milestone changed from Awaiting Review to 6.6
    • Property Version changed from 6.5 to trunk
    • Property Summary changed from Add optimized set lookup class. to Token Map: Introduce an efficient lookup and translation class for string mappings.
  • Ticket #60698 – Description

    initial v1  
    1 In the course of exploratory development in the HTML API there have been a few times where I wanted to test if a given string is in a set of statically-known strings, and a few times where I wanted to check if the next span of text represents an item in the set.
     1Motivated by the need to properly transform HTML named character references (like `&`) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.
    3 For the first case, `in_array()` is a suitable method, but isn't always ideal when the test set is large.
     3== How does WordPress typically perform this kind of substitution?
     5=== An associative array with a generated regular expression callback.
     7It's common to see a pattern like this, where the lookup is stored as an array and a regular expression is built from it.
    7 if ( in_array( '&notin', $html5_named_character_references, true ) )
     11$smilies = array(
     12  ':(' => "\xf0\x9f\x99\x81",
     13  ':)' => "\xf0\x9f\x99\x82",
     14  ':?' => "\xf0\x9f\x98\x95",
     15  ':D' => "\xf0\x9f\x98\x80",
     19        '/' . implode( '|', array_keys( $smilies ) ) . '/',
     20        function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0] ]; }
    10 For the second case, `in_array()` isn't adequate, and a more complicated lookup is necessary.
     24This method constructs a large regular expression pattern on every request and it's not always clear what performance implications there might be with the long list of alternation groups. These can lead to some cataclysmic runtimes and the code surrounding these to setup the regular expression pattern and replacments tend to appear somewhat exotic, masking the intent of the original goal of the code.
     26=== Calls to `str_replace()` with two arrays.
    14 foreach ( $html5_named_character_references as $name ) {
    15         if ( 0 === substr_compare( $html, $name, $at, /* length */ null, /* case insensitive */ true ) ) {
    16                 …
    17                 return $name;
    18         }
    19 }
     30$chars = array(
     31        'ª' => 'a',
     32        'º' => 'o',
     33        'À' => 'A',
     34        'Á' => 'A',
     35        'Â' => 'A',
     36        'Ã' => 'A',
     37        'Ä' => 'A',
     40str_replace( array_keys( $chars ), array_values( $chars ), $text );
    22 This second example demonstrates some catastrophic lookup characteristics when it's not certain if the following input is any token from the set, let alone which one it might be. The at-hand code has to iterate the search domain and then compare every single possibility against the input string, bailing when one is found.
     43This method is clear in the code, but the runtime can be very inefficient due to the fact that it's required to iterate over all of the array keys until a string match is found. Like the regular expression, performance can degrade quickly when there are hundreds or thousands of replacements.
    24 While reviewing code in various places I've noticed a similar pattern and need, mostly being served by `in_array()` and a regex that splits apart an input string into a large array, allocating substrings for each array element, and then calling `in_array()` inside the regex callback (or when the results array is passed to another function). This is all memory heavy and inefficient in the runtime.
     45=== Scanning with array lookup.
     47Sometimes code is crawling through a string and testing for set membership using an array.
     52        '~&([^;]+;)~',
     53        function ( $name ) use ( $html_entities ) {
     54                $allowed = in_array( $name[0], $html_entities, true ) );
     55                return $allowed ? $name : "&amp;{$name[1]}";
     56        }
     60This can work if the tokens follow a searchable pattern, but it's hard to apply complicated rules for the token pattern with the regular expression syntax, and this method also suffers from slow set-membership checking in the lookup array. For small arrays it's fine, but  not for large ones, especially if the document contains a high number of the input tokens.
     64All of these existing methods work best when it's clear that a given string contains one of the tokens, but what if it's not certain? The performance characteristics of the array-based approaches are worst when the document is missing the input tokens. These approaches are memory heavy and inefficient computationally.
    30 I'd like to propose a new class whose semantic is a relatively static set of terms or tokens which provides a test for membership within the set, and what the next matching term or token is at a given offset in a string, if the next sequence of characters matches one.
     68== A different approach.
     70I'm proposing a new class whose semantic is a relatively static mapping of terms or tokens to replacement strings. This class provides a number of purpose-specific methods supporting token-replacement operations, including:
     72- Fast set-membership testing, to determine if a given string is one of the tokens.
     73- Fast mapping from a given token to its replacement value.
     74- Determining if one of the tokens exists at a specific offset, and if so, what it is and what its mapping is.
    34 $named_character_references = WP_Token_Set( [ '&notin', '&notin;', '&amp;', … ] );
     78$named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉', 'notin;' => '∉', 'amp;' => '&', … ] );
    36 if ( $named_character_references->contains( '&notin' ) ) {
    37         …
     80$matched_string_snippet = '&notin';
     81if ( $named_character_references->contains( substr( $matched_string_snippet, 1 ) ) ) {
     82        // "&notin" is a valid token
     85// Replace all named character references with their replacements.
    4086while ( true ) {
    4187        $was_at = $at;
    4692        }
    48         $name = $named_character_reference->read_token( $text, $at );
    49         if ( false !== $name ) {
     94        $name_length = 0;
     95        $replacement = $named_character_reference->read_token( $text, $at, $name_length );
     96        if ( false !== $replacement ) {
    5097                $output .= substr( $text, $was_at, $at - $was_at );
    51                 $output .= $named_character_replacements[ $name ];
    52                 $at     += strlen( $name );
     98                $output .= $replacement;
     99                $at     += $name_length;
    53100                continue;
    54101        }
    61 ----
     108In this example the leading `&` has been pruned from the lookup table to save space and time, but it very well could be added into the table.
    63 Further, because WordPress largely deals with large and relatively static token sets (named character references, allowable URL schemes, file types, loaded templates, etc…), it would be nice to be able to precompute the lookup tables if they are at all costly, as doing so on every PHP load is unnecessarily burdensome.
    65 A bonus feature would be a method to add and a method to remove terms.
     110Because this class introduces specific semantic meaning, that is, looking up and mapping static string tokens to static string replacements, we can employ a number of optimizations to reduce the overall memory footprint and also to optimize the runtime. Such optimizations have been incorporated into the linked PR.
    69 In [ #5373] I have proposed such a `WP_Token_Set` and used it in [ #5337] to create a spec-compliant, low-memory-overhead, and efficient replacement for `esc_attr()`.
     114In [ #6387] I have built a spec-compliant HTML text decoder which utilizes this token map. The performance of the new decoder is approximately 20% slower than calling `html_entity_decode()` directly, except it properly decodes what PHP can't. In fact, the performance bottleneck in that work is not even in the token map, but in converting a sequence of digits into a UTF-8 string from the given code point.
    71 The replacement `esc_attr()` is able to more reliably parse attribute values than the existing code and it does so more efficiently, avoiding numerous memory allocations and lookups.
     116== Other potential uses
     118 - Converting smilies like `:)`
     119 - Converting emoji sequences like `:happy:`
     120 - Determining if a given verb/action is allowed in an API call.