#60698 closed feature request (fixed)
Token Map: Introduce an efficient lookup and translation class for string mappings.
Reported by: | dmsnell | Owned by: | dmsnell |
---|---|---|---|
Milestone: | 6.6 | Priority: | normal |
Severity: | normal | Version: | 6.6 |
Component: | General | Keywords: | has-patch needs-unit-tests 2nd-opinion needs-dev-note |
Focuses: | Cc: |
Description (last modified by )
Motivated by the need to properly transform HTML named character references (like &
) I found the need for a new semantic class which can efficiently perform search and replacement of a set of static tokens. Existing patterns in the codebase are not sufficient for the HTML need, and I suspect there are other use-cases where this class would help.
It may be helpful to review the dev note draft before reviewing this work:
https://make.wordpress.org/core/?p=113042&preview=1&_ppp=fbe827d6b7
How does WordPress typically perform this kind of substitution?
An associative array with a generated regular expression callback.
It's common to see a pattern like this, where the lookup is stored as an array and a regular expression is built from it.
<?php $smilies = array( ':(' => "\xf0\x9f\x99\x81", ':)' => "\xf0\x9f\x99\x82", ':?' => "\xf0\x9f\x98\x95", ':D' => "\xf0\x9f\x98\x80", ); preg_replace_callback( '/' . implode( '|', array_keys( $smilies ) ) . '/', function ( $smily ) use ( $smilies ) { return $smilies[ $smily[0] ]; } );
This method constructs a large regular expression pattern on every request and it's not always clear what performance implications there might be with the long list of alternation groups. These can lead to some cataclysmic runtimes and the code surrounding these to setup the regular expression pattern and replacments tend to appear somewhat exotic, masking the intent of the original goal of the code.
Calls to str_replace()
with two arrays.
<?php $chars = array( 'ª' => 'a', 'º' => 'o', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A', ); str_replace( array_keys( $chars ), array_values( $chars ), $text );
This method is clear in the code, but the runtime can be very inefficient due to the fact that it's required to iterate over all of the array keys until a string match is found. Like the regular expression, performance can degrade quickly when there are hundreds or thousands of replacements.
Scanning with array lookup.
Sometimes code is crawling through a string and testing for set membership using an array.
<?php preg_replace( '~&([^;]+;)~', function ( $name ) use ( $html_entities ) { $allowed = in_array( $name[0], $html_entities, true ) ); return $allowed ? $name : "&{$name[1]}"; } );
This can work if the tokens follow a searchable pattern, but it's hard to apply complicated rules for the token pattern with the regular expression syntax, and this method also suffers from slow set-membership checking in the lookup array. For small arrays it's fine, but not for large ones, especially if the document contains a high number of the input tokens.
All of these existing methods work best when it's clear that a given string contains one of the tokens, but what if it's not certain? The performance characteristics of the array-based approaches are worst when the document is missing the input tokens. These approaches are memory heavy and inefficient computationally.
A different approach.
I'm proposing a new class whose semantic is a relatively static mapping of terms or tokens to replacement strings. This class provides a number of purpose-specific methods supporting token-replacement operations, including:
- Fast set-membership testing, to determine if a given string is one of the tokens.
- Fast mapping from a given token to its replacement value.
- Determining if one of the tokens exists at a specific offset, and if so, what it is and what its mapping is.
<?php $named_character_references = WP_Token_Map::from_array( [ 'notin' => '∉', 'notin;' => '∉', 'amp;' => '&', … ] ); $matched_string_snippet = '¬in'; if ( $named_character_references->contains( substr( $matched_string_snippet, 1 ) ) ) { // "¬in" is a valid token } // Replace all named character references with their replacements. while ( true ) { $was_at = $at; $at = strpos( $text, '&', $at ); if ( false === $at ) { $output .= substr( $text, $was_at ) break; } $name_length = 0; $replacement = $named_character_reference->read_token( $text, $at, $name_length ); if ( false !== $replacement ) { $output .= substr( $text, $was_at, $at - $was_at ); $output .= $replacement; $at += $name_length; continue; } // No named character reference was found, continue searching. ++$at; }
In this example the leading &
has been pruned from the lookup table to save space and time, but it very well could be added into the table.
Because this class introduces specific semantic meaning, that is, looking up and mapping static string tokens to static string replacements, we can employ a number of optimizations to reduce the overall memory footprint and also to optimize the runtime. Such optimizations have been incorporated into the linked PR.
In #6387 I have built a spec-compliant HTML text decoder which utilizes this token map. The performance of the new decoder is approximately 20% slower than calling html_entity_decode()
directly, except it properly decodes what PHP can't. In fact, the performance bottleneck in that work is not even in the token map, but in converting a sequence of digits into a UTF-8 string from the given code point.
My proposal is adding a new class WP_Token_Map
providing at least two methods for normal use:
contains( $token )
returns whether the passed string is in the set.read_token( $text, $offset = 0, $skip_bytes )
indicates if the character sequence starting at the given offset in the passed string forms a token in the set, and if so, returns the replacement for that token. It also sets&$skip_bytes
to the length of the token so that calling code .
It also provides utility functions for pre-computing these classes, as they are designed for relatively-static cases where the actual code is intended to be generated dynamically, but stay static over time. For example, HTML5 defines the set of named character references and indicates that the list shall not change or be expanded. HTML5 spec. Precomputing can save on the startup-up cost of building the optimized lookup tables.
static::from_array( array $mappings )
generates a new token map from the given array of whose keys are tokens and whose values are the replacements.to_array()
dumps the set of mapping into an array suitable for passing back intofrom_array()
.static::from_precomputed_table( ...$table )
instantiates a token set from a precomputed table, skipping the computation for building the table and sorting the tokens.precomputed_php_source_table()
generates PHP source code which can be loaded with the previous static method for maintenance of the core static token sets.
Other potential uses
- Converting smilies like
:)
- Converting emoji sequences like
:happy:
- Determining if a given verb/action is allowed in an API call.
Change History (30)
#1
@
5 months ago
- Description modified (diff)
- Keywords has-patch needs-unit-tests added
- Milestone changed from Awaiting Review to 6.6
- Summary changed from Add optimized set lookup class. to Token Map: Introduce an efficient lookup and translation class for string mappings.
- Version changed from 6.5 to trunk
This ticket was mentioned in PR #5373 on WordPress/wordpress-develop by @dmsnell.
5 months ago
#4
Trac ticket: Core-60698
(previously Core-60841)
esc_attr()
does an in_array()
lookup for an array with many entries. could we replace that with an efficient lookup data structure?
for a normative test attribute value: from 4.1 µs to 3.9 µs
for an extreme case (12 MB single-page.html
file): from 90 ms to 84 ms
- the key length can be tuned
- for HTML4+ legacy comparison this approach doesn't work nearly as well as for HTML5 named character references, of which there are over 2300
- adding a small set of common checks up-front didn't show any demonstrable impact. e.g.
if ( $i === 'gt' || $i === 'lt' || $i === 'amp' ) { return "&$i;"; }
probably in_array()
is just-as fast for small arrays, but the idea of collapsing tokens into a linear string has merit for cache locality, whereas the arrays _always_ involve indirect memory lookups. (doing so here with fixed width slowed down both tests, KEY_SIZE = 8)
- ❓ will PHP end up storing subsequent array values in contiguous memory when declared as an array+string literal? how can we test this?
- ❓ would binary search on a linear string improve lookup? (seems doubtful since we can probably pre-load enough of the string into memory that it doesn't help, unless the linear string is very long)
I like the idea of having a class to delegate this behavior. it would be trivial on static literal lists to precompute and perform more analysis. e.g. for small data sets have a WP_Small_Token_Set
that only stores a linear list (or is this equivalent to storing a long key list?)
This ticket was mentioned in Slack in #core by dmsnell. View the logs.
5 months ago
This ticket was mentioned in Slack in #core by dmsnell. View the logs.
5 months ago
#7
@
5 months ago
- Keywords 2nd-opinion added
Can you point to the spots in core where this would be used? I think adding an API without also solving specific issues in core is a recipe for an API that doesn't benefit WordPress and thus isn't going to be beneficial to the users of WordPress.
#8
@
5 months ago
Thanks @jorbin - I tried to do that but apparently failed to communicate it well; sorry.
This was motivated by the needs of the HTML API for properly decoding HTML named character references, and the linked PR demonstrates using this class for that purpose.
The best performance characteristics I could come up with using existing mechanisms were over nine times slower than with this optimized lookup. The decoder itself is required because PHP is unable to accurately decode the HTML entities according to the specification, partly due to its legacy of using an XML parser, but also partly due to its interface which doesn't acknowledge the more complicated rules in the HTML spec.
The linked PR contains substantially more warrant for the decoder, as well as benchmarks and performance notes comparing different considered approaches.
#9
@
5 months ago
Thanks for the quick response @dmsnell.
So there are three spots where html_entity_decode
is swapped and in return WordPress is now responsible for maintaining an additional library?
Not sure why the diff isn't clear, but it's hard to tell which tests are now passing in tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php that would have failed before but now are passing. It also looks like there is at least one tests1/line0737
that wasn't in the SKIP_TESTS array before that is now. Is it possible to make this section clearer so it's possible to know what tests are specifically being added and removed? I also opened #61209 to help with this in the future so that the tests are run automatically and the test runs can be used for this in the future
For the specific implementation, my first thought is why is WP_HTML_Decoder
a class if nearly everything is going to be a public static function? There also seems to be a mix of code that is actual code that is intended to be used and code that is used to generate the entities map. Ideally code that isn't something we intend for either core or extenders to be used at run time isn't included at run time.
Also, src/wp-includes/html-api/html5-named-character-entities.php
isn't an auto-generated class but the comment says it is. Assuming this is the output of precomputed_php_source_table
? I think the comments should be clearer. Why is phpcs disabled on it? Is there any way to avoid adding a new global?
#10
@
5 months ago
So there are three spots where html_entity_decode is swapped and in return WordPress is now responsible for maintaining an additional library?
This isn't how I see the problem, as we wouldn't be doing any of this work if we had a way to avoid misparsing HTML. It's my hope that we can push fixes into PHP, but for now, we don't have a way to accurate decode HTML text, and that is leading to other challenges. This swap of three functions is replacing defective code with fixed code.
We will maintain an additional library, but I do see value in it beyond the HTML API, which is why I am proposing it as a separate addition. Its use may be limited, but I was exploring Emoji replacement on WordPress.com and found some obvious use for the very purpose of the token map: quick lookup for low-overhead matching and replacement.
Also, the HTML5 specification guarantees that the list of named character references will never change. No more will be added. None will be removed.
Not sure why the diff isn't clear, but it's hard to tell which tests are now passing in tests/phpunit/tests/html-api/wpHtmlProcessorHtml5lib.php that would have failed before but now are passing.
The diff is unclear because the WPCS linting rules require changing the entire whitespace structure of the array in response to a tiny change. It can help to view the diff ignoring whitespace, either through GitHub or with git diff -w
.
I'll check on the new skipped test; I remember adding it but I think it's unrelated to the change - or possibly it was masked by previously skipped tests now being able to run.
why is
WP_HTML_Decoder
a class if nearly everything is going to be a public static function?
This is all great stuff to discuss in the Text Decoder ticket #61072 and PR. The short answer is that I was under the impression that WordPress still avoids namespacing in PHP and this helps avoid bloating the global namespace for what amounts to a single main public method and a number of helper utilities.
There also seems to be a mix of code that is actual code that is intended to be used and code that is used to generate the entities map. Ideally code that isn't something we intend for either core or extenders to be used at run time isn't included at run time.
Are you talking about the WP_Token_Map
methods? I do think they are useful on their own and think that folks will find them useful. If it were up to me none of the PHP classes or modules would be loaded eagerly, but I've been required to do that, which I believe is for future auto-loading efforts.
If we remove the precomputed_php_source_table()
function that's not incredibly important to me, but we do miss out the opportunity to share the value Core gains from the class. It has made testing easier as well as updating the "cached" lookup tables because all you need is WordPress.
I'd be open to explore other approaches you have to offer, especially if they retain the ability for others to get the same benefit that Core does (e.g. a plugin looking for a set list of words to replace, or checking if a stock ticker is known, etc…).
I think the comments should be clearer. Why is phpcs disabled on it? Is there any way to avoid adding a new global?
Can definitely update the comment. It's auto-generated in the sense that it comes from precomputed_php_source_table()
. On this point I did not include the code meant to build the table because I didn't see it belonging in the runtime. In a strange way I was doing what I think you were asking me to do above; I think we're aligned on that notion, maybe we haven't fully seen the same way about how and where to fully apply the principle.
phpcs
is disabled on it because it'd considerably complicate the code generation to appease the linter, and then to maintain the code generation to update as the rules change. Even simple things like choosing single or double quotes has bloated my attempts to match the styling it wants, and that ends up confusing me whereas the code as-is is at least focused on the task. In addition, the file lights up in some IDEs with styling hints that aren't relevant in this file. It's not meant to be read, but only written.
Could we avoid the global? Absolutely? What is your proposal? Mainly I wanted this to be fast and as low-overhead as is possible. As a global it's available everywhere and increases WordPress' main runtime by a few hundred kilobytes. I know we can pass it around as a variable and not mess with function calls or additional allocations, and it will be called in tight loops so I wanted to avoid anything dynamic that we can.
This isn't something where I have a lot of experience. The existing attempts at enumerating the named character references are globals, and I in large part followed their steps because it seemed well-suited to the task.
#11
@
5 months ago
@jorbin I've added a generator script for the HTML5 named character references. Not sure where to put it, I tossed it into the HTML API test directory. Happy to move it wherever is appropriate. It comes with a more complete comment as well, and is fully auto-generated now.
As for the skipped test I'm not sure why I added that unless it was due to an earlier bug that I didn't realize was there. I've removed it and all the tests still pass, so now the only way the text decoder patch impacts the test suite is by removing tests from the skip lists.
#12
@
5 months ago
I'm prepping a note on Make for introducing this class.
Just because the text is there doesn't mean it will stay or that it represents the final form.
Given that this is a kind of low-level data structure though I wanted to inform people what it is and how to use it and when.
https://make.wordpress.org/core/?p=113042&preview=1&_ppp=452181011b
5 months ago
#14
Some CI jobs report that some static test helpers are private and it can’t access them.
5 months ago
#15
I've been reviewing the implementation and I have a concern. As far as I can tell, case-insensitive behavior is unreliable on some string strings, specifically multibyte strings with upper/lowercase variants like ü.
Have you considered this?
Yes, absolutely.
I did add case insensitivity during the testing cycle kind of as an afterthought, but I like having it. It's intended to aid in the purpose for which the class was made, which is focusing on ASCII casing. This may sound strange, but almost every basic casing functions are primarily if not entirely ASCII-oriented, because proper case-folding is a much more complicated topic, and no, mb_strtoupper()
is not enough, at all.
So I think the expectation that this works for ü
is more subjective. My gut feeling is that we can clarify this in documentation and let it be. There's no way this is going to be properly detecting more complicated case variants, but then again, nothing in this domain does. Otherwise we can remove it, but I like how the option gives us the ability to implement functions like WP_HTML_Processor::is_special()
without allocating a new string and upper-casing it.
@jonsurrell commented on PR #5373:
5 months ago
#16
It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.
5 months ago
#17
It seems fine to me if we don't support case insensitive non-ascii token matching as long as we're clear about it.
Ha! I went to add these and I already noted this in the function docblock.
'case-insensitive' to ignore ASCII case or default of 'case-sensitive'
Text issues are apparently always on my mind. Let me know if you still think this is unclear. I'm worried that if we try and add more documentation when it's already there right at the cursor when calling the function, then it won't be any more helpful, just more noise.
@jonsurrell commented on PR #5373:
5 months ago
#18
I already noted this in the function docblock … Let me know if you still think this is unclear.
I think this is fine. I was searching for words like "mb" or "multibyte", but it seems like ASCII is correct 👍
I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?
8.2.0
Case conversion no longer depends on the locale set with `setlocale()`. Only ASCII characters will be converted.
That seems like exactly what we want, but I've looked around PHP source, the changelog, etc. and I haven't managed to find exactly what changed or examples of the different.
5 months ago
#19
I did notice a changelog entry on the PHP documentation page that makes me wonder if we might need to support PHP versions that have varying behavior on non-ASCII treatment. I wasn't able to trigger any differing behavior myself, but do you know what this is about?
I've struggled to test this with the HTML API. in effect, some system locales could cause, I believe, a few characters to be case-folded differently, but this is such a broken system I'm not of the opinion that it's a big risk here. I'm willing to let this sit as a hypothetical bug until we have a clearer understanding of it.
#20
@
5 months ago
@jorbin do you have any further thoughts on this?
I'd like to merge this by the end of the week in case nobody has any further objections. I believe I've addressed all the feedback so far, as best as I can.
5 months ago
#21
One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.
this is a good suggestion, @aaronjorbin - and thanks for the review! (I'll be addressing the other feedback inline where you left it).
my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.
do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.
I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?
5 months ago
#22
I think we can add a new token-map group. It also might be worth putting in the html-api group since this is where the bigger effort lies
@aaronjorbin following the html5lib test suite, I added the wpTokenMap
test module to the @group html-api-token-map
group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a @package
and @subpackage
but I was looking for @subgroup
and didn't find any
5 months ago
#23
@aaronjorbin following the html5lib test suite, I added the
wpTokenMap
test module to the@group html-api-token-map
group. happy to change if this isn't idea; I was looking for more examples and didn't see. it can also be set as a@package
and@subpackage
but I was looking for@subgroup
and didn't find any
@subgroup isn't an annotation that is a part of PHPUnit. I think this group makes sense, thanks!
5 months ago
#24
One suggestion that can be done in a follow up, I think automating pulling https://html.spec.whatwg.org/entities.json and checking to see if the autogenerated file has been changed would be beneficial. Can likely be done on a weekly schedule and only in trunk.
this is a good suggestion, @aaronjorbin[[Image(chrome-extension://hgomfjikakokcbkjlfgodhklifiplmpg/images/wp-logo.png)]] - and thanks for the review! (I'll be addressing the other feedback inline where you left it).
my one main reluctance to automate the download is that the specification requires that the list never change, and automating it only adds a point of failure to the process, well, a point of failure plus some latency.
do you think that it's worth adding this additional complexity in order to change what is defined to not change? I believe that a wide cascade of failures would occur around the internet at large if we found new character references in HTML.
I'm happy to add this if there are strong feelings about it, though if that's the case, maybe we could consider it a follow-up task so as not to delay this functionality? a separate task where we can discuss the merits of the automation?
Upon reading https://github.com/whatwg/html/blob/main/FAQ.md#html-should-add-more-named-character-references I withdrawal this suggestion. I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.
5 months ago
#25
I do think it might make sense to link that and mention that the entities json should never change in the readme for others that come across this with a similar thinking as me.
Thanks again @aaronjorbin!
It's already mentioned at the top of the README, in the character reference class (and as well in the autogenerator script where it comes from), the PR description, and the Trac ticket description.
Is that good enough or was there somewhere else you wanted it? Maybe the way I wrote it wasn't as direct or clear as it should be?
#26
@
5 months ago
- Owner set to dmsnell
- Resolution set to fixed
- Status changed from new to closed
In 58188:
#30
@
4 months ago
Dev note prepped: Introducing the Token Map (public preview)
#60841 was marked as a duplicate.