Opened 9 years ago
Closed 7 years ago
#35293 closed defect (bug) (fixed)
Emoji Regex in wp_encode_emoji() is wildly inaccurate
Reported by: | pento | Owned by: | pento |
---|---|---|---|
Milestone: | 4.9 | Priority: | normal |
Severity: | normal | Version: | 4.2 |
Component: | Emoji | Keywords: | |
Focuses: | performance | Cc: |
Description
It's just plain wrong, it misses masses of edge cases.
We should figure out how to keep the regex from twemoji.js
in sync, instead.
Attachments (11)
Change History (60)
#3
@
9 years ago
Since there isn't an initial patch for this, punting from 4.5.
@pento, feel free to re-milestone if you'd like to tackle this.
#4
@
9 years ago
- Milestone changed from 4.5 to Future Release
- Owner set to pento
- Status changed from new to assigned
#5
@
8 years ago
@pento Any updates on this? Looks like wp_staticize_emoji()
doesn't support Unicode 8.0 and 9.0 andwp_encode_emoji()
needs an update for 9.0. Attached a simple test in 35293.patch.
#6
follow-up:
↓ 7
@
8 years ago
I don't have any good options at the moment.
The best I have is to add a grunt build
rule to update the regex based on the current regex in twemoji.js
. That's pretty fragile, though - changes to twemoji.js
's formatting would break it. In theory, it would only break when updating twemoji.js
, so maybe it wouldn't be too bad?
This would also be a good time for precommit rules - if the twemoji.js
regex is updated, but the corresponding PHP one isn't, die horribly.
#7
in reply to:
↑ 6
@
8 years ago
Replying to pento:
The best I have is to add a
grunt build
rule to update the regex based on the current regex intwemoji.js
. That's pretty fragile, though - changes totwemoji.js
's formatting would break it. In theory, it would only break when updatingtwemoji.js
, so maybe it wouldn't be too bad?
I don't think that this is a big issue, the current formatting exists since version 1.0.0. We could also write our own task to generate the regex: https://github.com/twitter/twemoji/blob/v2.1.0/twemoji-generator.js#L214-L263
In 35293.2.patch I added support for diversity, doesn't work for new Emoji though.
#8
@
8 years ago
Related, we already have a test that ensures our PHP and JS regexes for shortcode attributes match: https://core.trac.wordpress.org/browser/tags/4.5/tests/phpunit/tests/shortcode.php?rev=37298&marks=637-648#L637
#9
@
8 years ago
- Keywords emoji removed
35293.2.diff is the framework for generating the PHP regex from the twemoji.js
regex.
Proceeding from here is... tricky. The Twemoji regex uses UTF-16 code points, which PHP didn't support until PCRE 8.3.2 (PHP 5.4.14). There's no way to nicely convert the code point ranges to a PHP-compatible regex.
The main problem with using the method from twemoji-generator.js
is that it requires a local copy of the Twemoji images, to check which images Twemoji supports. It also takes us a further step away from the actual regex we need to build, creating potential inconsistencies.
I would not be adverse to providing an accurate regex for PHP versions that support it, and a more approximate fallback for those that don't.
#11
@
7 years ago
- Keywords has-patch needs-testing needs-unit-tests added
- Milestone changed from Future Release to 4.8.1
35293.3.diff Works!
A few things to note:
- The fallback regex that
wp_emoji_regex()
returns is wrong, since it now matches entire emoji, instead of individual bytes making up an emoji. - The regex that the Gruntfile task generates can probably be optimised a bit - it really only needs to unravel surrogate pair code points (\ud800-\udfff), rather than every code point.
- Needs unit tests. So. Many. Unit Tests.
#12
@
7 years ago
- Keywords needs-unit-tests removed
In 35293.4.diff:
- Expand the fallback regex in
wp_emoji_regex()
. - Add unit tests.
- Improve inline docs.
I tested optimising the regex that the Gruntfile task generates, but ran into a couple of problems - it didn't really save much space, and the code points still need to be unravelled for the entities
regex, as that can't use character ranges for matching.
#13
@
7 years ago
35293-performance.php tests the performance of the old and new approaches.
These functions aren't used in directly user-facing situations, the closest we have is RSS feeds. wp_staticize_emoji()
is run on each post in an RSS feed, to convert emoji characters to images. So, to simulate an RSS feed, the old and new versions of wp_staticize_emoji()
were run 10 times - once for each post in the feed. The time differences below are for 10 runs of each function.
(Percentages refer to the percentage of characters in the string that are emoji.)
Short Posts (100 characters)
0%: wp_staticize_emoji() is 2.5ms faster than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 0.1ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 0ms faster than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 0ms faster than wp_staticize_emoji2().
Medium Posts (1000 characters)
0%: wp_staticize_emoji() is 0ms faster than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 0.3ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 0.5ms slower than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 0.4ms slower than wp_staticize_emoji2().
Long Posts (10,000 characters)
0%: wp_staticize_emoji() is 0.1ms slower than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 42.1ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 55.5ms slower than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 132ms slower than wp_staticize_emoji2().
Epic Posts (100,000 characters)
0%: wp_staticize_emoji() is 0.9ms slower than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 3830.6ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 4555.5ms slower than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 11294.4ms slower than wp_staticize_emoji2().
I ran the 50% numbers for fun, sadly, I don't expect many people to be writing huge posts that are half emoji. :-)
The interesting numbers are the 0% and 1% cases, which I suspect would be the most common. For short and medium posts, the difference is negligible. For longer posts, however, the new approach is exponentially faster.
#15
@
7 years ago
- Keywords fixed-major added; needs-testing removed
- Resolution fixed deleted
- Status changed from closed to reopened
Reopening for merging to 4.8 branch.
#16
follow-up:
↓ 18
@
7 years ago
Micro-optimization: _wp_staticize_emoji()
now calls the two filters for each emoji, previously it was just once for the whole $text
. Should it be cached in a static variable?
#18
in reply to:
↑ 16
@
7 years ago
Replying to ocean90:
Micro-optimization:
_wp_staticize_emoji()
now calls the two filters for each emoji, previously it was just once for the whole$text
. Should it be cached in a static variable?
People who use lots of emoji deserve the fastest possible experience. No optimisation is too micro for them.
#20
@
7 years ago
Going to leave this ticket open for a bit, before I backport to the 4.8 branch - just in case any bugs come up.
This ticket was mentioned in Slack in #core by westonruter. View the logs.
7 years ago
#23
@
7 years ago
- Milestone changed from 4.8.1 to 4.9
- Resolution fixed deleted
- Status changed from closed to reopened
Moving out of 4.8.1 milestone due to impending [41069] revert on 4.8 branch from @westi.
#25
@
7 years ago
There is a typo in wp_staticize_emoji2
in the performance testing - props @jmdodd - with it fixed I see these numbers:
`
Short Posts
===========
0%: wp_staticize_emoji() is 0.4ms slower than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 0.3ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 1.3ms faster than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 0.7ms faster than wp_staticize_emoji2().
Medium Posts
============
0%: wp_staticize_emoji() is 1.2ms faster than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 5.8ms faster than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 6.4ms faster than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 9.2ms faster than wp_staticize_emoji2().
Long Posts
==========
0%: wp_staticize_emoji() is 10.4ms faster than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 34.3ms faster than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 47.5ms faster than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 20ms faster than wp_staticize_emoji2().
Epic Posts
==========
0%: wp_staticize_emoji() is 119.4ms faster than wp_staticize_emoji2().
1%: wp_staticize_emoji() is 1992.9ms slower than wp_staticize_emoji2().
10%: wp_staticize_emoji() is 2290ms slower than wp_staticize_emoji2().
50%: wp_staticize_emoji() is 6918.2ms slower than wp_staticize_emoji2().
`
This is with PHP 5.6.30 (cli) (built: Feb 7 2017 16:06:52)
but anyway this isn't ready yet.
#27
@
7 years ago
- Keywords has-patch removed
Alright! Thank you to everyone who handled this, I'm going to be doing some performance testing.
The baseline test (comparing previous behaviour, and the current state of the trunk) is here: https://travis-ci.org/pento/test-41501/builds/260016757
Note: "New" refers to whichever variation of the new code is currently being tested. "Old" refers to the old code.
There are a couple of interesting things to note:
- Performance for all tests on PHP 5.4-5.6 is fairly similar. New is always much slower, except for a handful of edge cases.
- There's a big jump in performance on PHP 7.0, then small improvements in both PHP 7.1 and PHP nightly. New is about the same speed as Old, or faster as the post length or emoji percentage increases. An interesting exception is on the zh_TW posts, with 0% emoji - New is significantly faster.
So, I'm going to be exploring a few different options for improving performance on old PHP, while not killing performance on new PHP.
TEST 1
Short circuit the New staticize function, when there are no emoji. Adding a fast-but-possibly-matches-non-emoji test may allow 0% en_US tests to run faster, with only a minor penalty on other languages, or posts containing emoji.
Add the following code at the start of wp_staticize_emoji2()
:
if ( ( ( function_exists( 'mb_check_encoding' ) && mb_check_encoding( $text, 'ASCII' ) ) || ! preg_match( '/[^\x00-\x7F]/', $text ) ) && false === strpos( $text, '&#x' ) ) {
// The text doesn't contain anything that might be emoji, so we can return early.
return $text;
}
Data: https://travis-ci.org/pento/test-41501/builds/260021583
Analysis:
- Negligible impact on all tests in PHP 7.0+
- Negligible impact on PHP 5.4-5.6, non en_US languages.
- Negligible impact on PHP 5.4-5.6, en_US, 1% and 10% emoji.
- Significant performance improvements on PHP 5.4-5.6, en_US, 0% emoji. On Long posts, processing time decreased from 360ms to 0.2 ms. Super Long decreased from 3700ms to 0.9ms.
Conclusion: Test 1 changes should be included.
#28
@
7 years ago
TEST 2
Remove the u
modifier from the entities regex. It was mistakenly included in the original patch, but isn't required.
Data: https://travis-ci.org/pento/test-41501/builds/260036411
Analysis:
- All versions of PHP saw similar results.
- Slight penalty on en_US posts. Some variation, but generally no more than 10%.
- Range of ~10% penalty on short post to ~10% improvement on de_DE. Improves with post length.
- Range of negligible change to 10% improvement on zh_TW. Improves with post length.
Conclusion: Test 2 should be included. The en_US penalties are lower for the most common cases (0% and 1% emoji), and and the additional few ms on shorter posts aren't an issue.
#29
@
7 years ago
TEST 3
Convert the regex to match against UTF-32 byte patterns. My theory is that fixed byte length characters will be faster to match, as they're also faster to process in the mb_*()
functions.
Data: https://travis-ci.org/pento/test-41501/builds/260073851
Analysis:
- It's slower across all PHP versions, string lengths, languages, and emoji usage.
Conclusion: Test 3 should not be included.
#30
@
7 years ago
TEST 4
Regexen are clearly slow, and there are 2661 emoji in the Twemoji library. Let's try putting it in an array, and str_replace()
with each element in said array.
Data: https://travis-ci.org/pento/test-41501/jobs/260153544
Analysis:
- For PHP 5.4-5.6, this faster than the baseline for every test except short posts, where time is up from <10ms, to ~30ms.
- For Long and Super Long posts, the difference is dramatic - down from 500ms to 40ms, and 4600ms to 100ms, respectively.
- For PHP 7.0+, the difference is similar, but less dramatic - the baseline was already usably fast.
- Compared to Old, there's still some work to do - there are several cases where Old processes in <5 ms, where New takes 30+ms. For longer posts, however, New is faster.
Conclusion: Test 4 is worth attempting to optimise further.
#31
@
7 years ago
Test 4.1
Based on Test 4, the staticize regex now runs as an array, too, along with some minor optimisations.
Data: https://travis-ci.org/pento/test-41501/builds/260169923
Analysis:
- Moderate improvements across the board for 0% emoji, as the staticize emoji loop doesn't run if there are no html entities to check.
- Longer posts that do contain emoji are slightly slower than in test 4.
Conclusion: There are still some optimisations to be made, but we're in the ballpark, at least.
#32
@
7 years ago
Test 4.2
While testing, I realised that the staticize code splits by HTML tags, but the test data had no HTML. I've added that in, which caused 4.1 to become a bit slower.
Fortunately, I was also able to optimise the loss, and a bit more, away by making staticize check if there's an instance of the current emoji to replace, before building the HTML string to replace it with.
Data: https://travis-ci.org/pento/test-41501/builds/260190895
Analysis:
- This test was generally faster than 4 and 4.1, despite the slower test data.
- Applying a similar optimisation to
wp_encode_emoji2()
seems to have no benefit.
Conclusion: Still on the right track.
#33
@
7 years ago
Test 5
Combine Test 4.2 with the early bail from Test 1.
Data: https://travis-ci.org/pento/test-41501/builds/260197654
Analysis:
- As expected, test 5 results are approximately the same as Test 4.2, except for the en_US, 0% emoji tests. These tests are much faster.
Conclusion: This takes care of the primary uses cases - shorter posts, and longer posts with no emoji. There's still some work to be done on longer posts that do contain emoji, but it's looking solid.
#34
@
7 years ago
Test 6
The slowest part of the code seems to be encoding, so reducing the number of characters to replace would be prudent. This test breaks the emoji down into their individual characters, and encodes each one separately. This reduces the number of items to encode from 2661 to 1186, but probably increases the number of str_replace()
calls each loop has to do.
Data: https://travis-ci.org/pento/test-41501/builds/260416806
Analysis:
- Performance changes seem to be fairly consistent across all PHP versions.
- Moderate improvements for short-medium posts with 0% or 1% emoji.
- Severe performance penalties for very long posts with lots of emoji. For example, PHP 7, en_US, 10% emoji, super long post increased from 2700ms to 3400ms (though is still faster than the old code, which took 3700ms).
Conclusion: Likely good to include, the performance penalties are for fairly severe edge cases, which haven't bitten us prior to the work on this ticket.
#35
@
7 years ago
TEST 6.1
After Test 6, try replacing the str_replace()
with a simple preg_replace()
. PHP's string functions are relatively slow for UTF-8, PCRE may be able to do things faster.
Data: https://travis-ci.org/pento/test-41501/builds/260434164
Analysis:
- Moderate to significant performance improvements across the board.
- New is now close or better than Old for the primary uses cases: between 10ms slower and 20ms faster for 0% emoji, Short to Long posts, all languages, all PHP versions.
- For secondary use cases (Short - Medium posts, all languages, 1% emoji), New is still slower, but not unworkably so. Old is usually 0.5-2ms, New is between 1-20ms, depending on PHP version.
- For the edge cases that were causing severe performance issues on WP.com (super long posts, 0% emoji), New is now significantly faster than Old across all languages for PHP 5.4-5.6, and en_US on PHP 7.0+. It's still slower for non-en_US PHP 7.0+, but not server-crashingly slow.
Conclusion: Test 6.1 fixes up a lot of the performance penalties from Test 6, particularly the ones that cause WP.com issues.
#36
@
7 years ago
TEST 7
Over 90% of sites use a version of MySQL that supports utf8mb4, which means that they'll be storing emoji as characters in their database, rather than HTML encoded entities. As such, it seems wasteful to HTML encoded emoji before replacing them, when we can probably skip the entire encoding process and replace the actual characters with their corresponding <img>
tag.
Data: https://travis-ci.org/pento/test-41501/builds/260458177
Analysis:
- At best, this is slightly faster for a few cases. In all other cases, it's much slower. This is primarily caused by the optimisations added earlier to short circuit expensive parts of the code no longer working.
Conclusion: This path is worth exploring a bit further, to see if the short circuits can be re-added, but may end up being a dead end.
#37
@
7 years ago
Test 7.1
Inspired by 6.1, use preg_replace()
for replacing the emoji characters, instead of str_replace()
.
Data: https://travis-ci.org/pento/test-41501/builds/260462732
Analysis:
- The same speed or slower than Test 7 across almost all cases.
Conclusion: preg_replace()
was not useful here, not worth using.
#38
@
7 years ago
Test 7.2
Add a very loose regex match, for ranges of characters that contain emoji. If something matches in these ranges, there might be emoji to process. If nothing matches, there are definitely no emoji.
Data: https://travis-ci.org/pento/test-41501/builds/260469301
Analysis:
- Some small performance improvements, but nothing significant. Mostly still slower than Test 6.1.
Conclusion: There are some other options to explore here.
#39
@
7 years ago
Test 7.3
Looping over the entire emoji array for every block in staticize is pretty wasteful. Instead, looping over it at the start to create a smaller array of emoji that we'll probably be replacing means we reduce the size of the inner loop significantly.
Data: https://travis-ci.org/pento/test-41501/builds/260498759
Analysis:
- We're now about the same or significantly faster on all tests, when compared to 6.1.
- When compared to the old behaviour, 7.3 is about the same, or a bit slower for primary use cases.
Conclusion: Despite now being the fastest, the optimisation applied here could be applied to 6.1, too. That'll need more testing to compare.
This ticket was mentioned in Slack in #core by ocean90. View the logs.
7 years ago
#42
@
7 years ago
- Keywords has-patch added
I'm back! 35293.5.diff is looking okay, it's not quite as fast as the original code, but it's in the same ballpark, and within a few ms for the most common cases.
Now, it needs testing. :-)
#43
@
7 years ago
- Focuses performance added
- Keywords needs-testing added
plugin-35293.php Is a plugin-ised version of this change, so it can be more easily tested.
35293.6.diff is just a fresh patch to apply cleanly against trunk.
#44
follow-up:
↓ 45
@
7 years ago
When testing for 4.8.1, @westonruter discovered a bug, from #core
I create a post with:
This is an emoji: 🤗✨🤷♂️❤️✅
But then when the post refreshes after saving it comes back with:
This is an emoji: ✨♂️❤️✅
I was able to recreate the above forcing wpdb:has_cap( 'utf8mb' )
to return false
, ie forcing the database to use utf8.
After applying this patch, the returned value was as in the screen shot in 35293.png.
#45
in reply to:
↑ 44
@
7 years ago
Replying to peterwilsoncc:
I was able to recreate the above forcing
wpdb:has_cap( 'utf8mb4' )
to returnfalse
, ie forcing the database to use utf8.
I can reproduce this behaviour like this, but I can't reproduce it if it I also convert the wp_posts.post_content
character set to utf8
. wp_insert_post
HTML encodes the character according to the post_content
field character set, not the DB connection character set. Things are going to get weird if your connection is utf8
but your database is utf8mb4
, just like it does with any mis-matched connection/storage character sets.
#47
@
7 years ago
Note: Unlike the earlier attempt, I'm not planning on backporting this to the 4.8 branch.
wp_encode_emoji()
should be all possible emoji (when they're encoded as an HTML entity for storing in autf8
database),wp_staticize_emoji()
should match the twemoji regex.