#57301 closed defect (bug) (fixed)
Emoji feature detection is incorrect
Reported by: | sergiomdgomes | Owned by: | dmsnell |
---|---|---|---|
Milestone: | 6.2 | Priority: | normal |
Severity: | normal | Version: | 5.3 |
Component: | Emoji | Keywords: | has-patch commit |
Focuses: | javascript, performance | Cc: |
Description
WP adds an inline script that performs feature detection for emoji support in the browser, the source for which is available at wp-includes/js/wp-emoji-loader.js
https://github.com/WordPress/WordPress/blob/master/wp-includes/js/wp-emoji-loader.js.
This feature-detection is performed using a canvas that renders two different sequences and comparing them. The sequences are stored as arrays of hexadecimal numbers.
Unfortunately, for a few revisions of this file now (more specifically since #45769), there have been mismatches introduced between the encoded format for the sequences and the way they're being decoded, in that the new sequences added have all been sequences of code points, but the decoding function (emojiSetsRenderIdentically
) expects sequences of char codes.
This causes every browser to fail the feature detection, even if it correctly supports the emoji features, and for the support polyfill to (unnecessarily) be loaded.
Attachments (13)
Change History (47)
#1
@
2 years ago
The above patch should fix the issue, and has been tested on several versions of Chrome, Safari, Firefox, and IE across several operating systems (where applicable).
#2
@
2 years ago
Note: the linked ticket above is incorrect, as I used the patchset number instead of the ticket number. The original ticket that introduced the problem is #47852.
#3
follow-up:
↓ 5
@
2 years ago
Love me some text encoding fixes 🤟
If the browser doesn't support String.fromCodePoint, assume that it doesn't correctly support the latest emoji either.
Is this a valid assumption? Or maybe I should ask, what are we referring to when we say "latest emoji"? For instance, are we talking about the latest revision of the Unicode standard with whatever emoji are newly added? There have been nine Unicode updates (each adding new Emoji) since fromCodePoint
was introduced (the count differs per browser) so I guess I'm confused on what this comment is asserting.
I tried installing versions of Chrome and Firefox from before they supported fromCodePoint
but both attempts failed. That being said, they are very very old (eight and ten years I think) and so I would guess for most people this isn't going to be an issue.
The only reason I ask is because I suspect we have more support for Emoji than we do for fromCodePoint
, and if, for instance, we used "\ud83c\udd70"
as [0xd83c, 0xdd70].map( c => String.fromCharCode(c) ).join('')
instead of [0x1f170].map( c => String.fromCodePoint(c) ).join('')
then we'd get the same result even though we might assert it's not there (because fromCodePoint
isn't).
Those two strings are identical. So if there's any confusion here, it may not matter. Still, I think it's possible that the browser and JavaScript might be old enough that fromCodePoint
isn't available, but when rendering, the emoji are present as expected.
The sequences are stored as arrays of hexadecimal numbers
You had me at arrays of numbers but lost me with "hexadecimal" numbers. I'm guessing you mean they are strings like "0x1F170"
? Or are they just numbers which happen to be recorded as hexadecimal literals (source code is 0x1f170 instead of the identical number, 127344
(and is "hexadecimal" relevant here or just a tangent)?
Again, maybe not important, just thrown off by being confused on the wording, in case it's important and I don't understand the bit that makes it important.
#4
follow-up:
↓ 6
@
2 years ago
okay well oops, looking at the linked ticket it looks like the input list from Twemoji is stored as HTML hexadecimal character references, which might explain the confusion. The input is 🅰
, representing code point U+1F170, as all hexadecimal character references have always referred to code points and not code units.
Strangely, it looks like the previous code in 47852.diff also used this representation and also referenced characters above the "code char" range (those code points which can be represented in UTF-16 without surrogate pairs).
And also if that's the case, then I'm confused on how fromCharCode
worked at all, lol, so I know I'm misunderstanding something. Is there something big I'm missing in the what the problem was or how it appeared in the linked patch?
Is the effect that we've been saying for years that a given browser doesn't support Emoji when in fact it does?
#5
in reply to:
↑ 3
@
2 years ago
Replying to dmsnell:
Is this a valid assumption? Or maybe I should ask, what are we referring to when we say "latest emoji"?
In this case, we're referring to Emoji 14, which is what the current version of Twemoji in use in wp-emoji
(Twemoji 14.0.2) polyfills, and that the current tests check for.
I think the statement is 100% correct, which is to say that any browser that does not support fromCodePoint
also does not support Emoji 14, but the impact of getting this wrong is minimal, in that we only add an unnecessary polyfill. Which is what's currently happening on every browser anyway.
if, for instance, we used "\ud83c\udd70" as [0xd83c, 0xdd70].map( c => String.fromCharCode(c) ).join() instead of [0x1f170].map( c => String.fromCodePoint(c) ).join() then we'd get the same result even though we might assert it's not there (because fromCodePoint isn't).
I imagine that our current set of tests isn't perfect, yes, even if they do use a canvas to actually render things and compare the rendered result. But again, the impact is low if we get it wrong; we just harmlessly polyfill something we don't need to.
You had me at arrays of numbers but lost me with "hexadecimal" numbers.
Sorry, you're right that it isn't particularly detailed; the goal was merely to help reviewers easily scan for them visually in the existing code. I meant that they're numbers that just happen to be recorded as hex literals.
The issue that the PR fixes arises from the confusion around what these hex literals represent; char codes vs code points.
#6
in reply to:
↑ 4
@
2 years ago
Replying to dmsnell:
Strangely, it looks like the previous code in 47852.diff also used this representation and also referenced characters above the "code char" range (those code points which can be represented in UTF-16 without surrogate pairs).
And also if that's the case, then I'm confused on how
fromCharCode
worked at all, lol, so I know I'm misunderstanding something. Is there something big I'm missing in the what the problem was or how it appeared in the linked patch?
As far as I can tell, before #47852 everything was indeed a char code and not a code point. Here's a link to the repo at that point in time: https://github.com/WordPress/WordPress/blob/a12598977269f1f78c524d112218196d723cb42e/wp-includes/js/wp-emoji-loader.js
Is the effect that we've been saying for years that a given browser doesn't support Emoji when in fact it does?
Yup, that's exactly it! As far as I can tell, since #47852 every browser fails the test and loads the polyfill, because the test is flawed. #47852 only introduced one incorrect test (the transgender flag test), but subsequent changes expanded this to other tests as well.
#7
@
2 years ago
More broadly, and as food for thought, I'm not entirely sure why we're hardcoding arrays of hexadecimal literals, which then need to be transformed into a string for the tests. It seems to me that it would make more sense to directly include the desired code sequences as JS string literals and call it a day.
Perhaps we're afraid of a contributor coming along with an editor that messes it up, by e.g. rewriting them to an equivalent sequence? Or perhaps this comes from a time when we couldn't rely on browsers supporting unicode characters in string literals?
I don't think we necessarily need to change this, and I think the current patch works well as a forward-looking and robust solution, it just struck me as odd.
#8
follow-up:
↓ 9
@
2 years ago
any browser that does not support fromCodePoint also does not support Emoji 14
This is the point I was questioning, actually. While I couldn't get a running copy of a browser that doesn't support fromCodePoint
, I was able to get Firefox 75 running (previous versions of Firefox opened as a transparent window with no rendering; I think it's an issue with Ventura 🤷♂️. Chrome has been very difficult to get running on older versions).
In Firefox 75, which was released April 7, 2020, I loaded a test site at https://s.xkq.io/unicode.html. This rendered correctly, handling Emoji added in Unicode 14, released on September 14, 2021.
So I think what's happening is that the browser renders via OS/platform support and doesn't have to directly support Unicode updates. Again, this may not matter practically for the patch here, but I think it's inaccurate to assume that a lack of String.prototype.fromCodePoint
implies a lack of updated Unicode rendering support. To know that we'd have to examine the OS release, or do the rendering trick we already do.
everything was indeed a char code and not a code point.
okay now I see, as I was looking at the patch earlier and not the file.
isIdentical = emojiSetsRenderIdentically( [ 0xD83C, 0xDDFA, 0xD83C, 0xDDF3 ], [ 0xD83C, 0xDDFA, 0x200B, 0xD83C, 0xDDF3 ] );
Looks like this tracks its lineage back to the original code
Strangely though back when that was introduced it was possible to write these as string literals, so I'm not sure why we're passing them in to fromCharCode
. They are listed as code units, but fed directly intro a string builder, so I think they could be listed as \ud83c\uddfa\d83c\uddf3
without any risk of support issues.
On that note they would still be appropriately stored as such, because there's no way in JavaScript to enter direct code points apart from calling something like fromCodePoint
.
---
On this note, I now wonder what you discovered is wrong about the original code, as passing a sequence of code units to fromCharCode
still properly decodes the string in the same way that fromCodePoint
does, and this a result of how JS strings are UTF-16.
You can see with a simple test.
String.fromCodePoint.apply(null, [0xd83c, 0xdd70]) === String.fromCharCode.apply(null, [0xd83c, 0xdd70]); String.… === '🅰'
Does this patch materially change anything? I guess I'm still fundamentally confused on what's broken and what's fixed.
---
For the purposes of this patch, I wonder if there's a reason to introduce another indirection through fromCodePoint
rather than resorting to string literals. Seems like this would be less confusing. Maybe @pento knows why this wouldn't work?
emojiSetsRenderIdentically( // U+1F1FA U+1F1F3 // "region indicator U, regional indicator N" '\ud83c\uddfa\ud83c\uddf3', // U+1F1FA ZWS U+1F1F3 // "region indicator U, zero-width space, regional indicator N" '\ud83c\uddfa\u200b\ud83c\uddf3' )
Not only less confusing, but that brings us back to a direct measure of whether the browser renders the Emoji as we expect, rather than testing that the browser builds the string and renders as we expect.
#9
in reply to:
↑ 8
@
2 years ago
Replying to dmsnell:
any browser that does not support fromCodePoint also does not support Emoji 14
This is the point I was questioning, actually.
[Followed by explanation]
Thanks for your clarification, the OS seems to have a larger influence than I thought.
Let's put it this way, then: there's a strong correlation between old browser versions and old OS versions, which means there's a strong correlation between lack of support for String.fromCodePoint
and lack of support for Emoji 14.
Global browser support for String.fromCodePoint is currently at 96.4%, which means that at most we'll be getting it wrong and unnecessarily serving a polyfill to 3.6% of users — and it will never be as high as that because of the aforementioned correlation.
But even if it were that high, unnecessarily serving a polyfill to 3.6% of users is a significant improvement over blindly serving it to everyone.
On this note, I now wonder what you discovered is wrong about the original code, as passing a sequence of code units to fromCharCode still properly decodes the string in the same way that fromCodePoint does, and this a result of how JS strings are UTF-16.
[...]
Does this patch materially change anything? I guess I'm still fundamentally confused on what's broken and what's fixed.
This patch definitely changes things, in that it fixes the bug. I apologise if my explanation so far has been less than clear (I am not well-versed in deep Unicode technicals and thus probably get a lot of details wrong), but I did test the fixed script in a number of browsers and OSes, and mentioned that in the original submission. The goal here is to fix a bug, so it's basic due diligence to ensure that the fix actually works.
Going back to why, the difference lies in the fact that String.fromCodePoint
works for any code points, whereas String.fromCharCode
effectively does not work correctly for code points 0x010000
– 0x10FFFF
.
Your simple test worked for you because you picked the values from a correct, older test, instead of one of the new tests that were introduced since #47852 and which effectively broke the feature-detection.
Here's an example where it doesn't work, straight from the current code:
String.fromCodePoint.apply(null, [0x1F3F3, 0xFE0F, 0x200D, 0x26A7, 0xFE0F]) === String.fromCharCode.apply(null, [0x1F3F3, 0xFE0F, 0x200D, 0x26A7, 0xFE0F]); > false
And here's MDN's explanation for it:
String.fromCharCode() cannot return supplementary characters (i.e. code points 0x010000 – 0x10FFFF) by specifying their code point. Instead, it requires the UTF-16 surrogate pair in order to return a supplementary character.
So what this patch does is make code points 0x010000
– 0x10FFFF
actually work, since they're being used in the tests since #47852.
For the purposes of this patch, I wonder if there's a reason to introduce another indirection through fromCodePoint rather than resorting to string literals. Seems like this would be less confusing.
That seems like the logical best solution over the incorrect approach I proposed in my last comment, assuming we're not missing anything here.
However, we'd still need to be defensive regarding feature support. Namely, we'd need to determine what the failure mode is for browsers that don't support unicode escape sequences, and we'd probably need to feature-detect that lack of support and default to polyfilling emoji, similar to what my current patch does when it detects a lack of support for String.fromCodePoint
.
#10
follow-up:
↓ 11
@
2 years ago
you picked the values from a correct, older test, instead of one of the new tests that were introduced since #47852 and which effectively broke the feature-detection.
Ah right, and that makes sense. I didn't see that we have directly encoded the integers above 0xFFFF in the newer patch.
So what this patch does is make code points 0x010000 – 0x10FFFF actually work
For curiosity sake, these code points would have worked before, but we would have had to split them up, as we did for U+1F170, into the code units, as is still the case for the tests for the UN flag and the English flag (they are also represented by these higher code points). That is, instead of String.fromCharCode.apply(null, [0x1F1F3])
it would have had to have been String.fromCharCode.apply(null, [0xD83C, 0xDDF3])
- these represent the same strings.
That seems like the logical best solution
Yeah I agree unless someone has a reason why this wouldn't work. In some cases, like the test where I saw the use of a zero-width-space I can understand why seeing the code points in their numeric form to be helpful. But then again, we can still have that in a string with '🇺\u200b🇳'
Or maybe this is the whole point, because now it looks funny. 🤷♂️
However, we'd still need to be defensive regarding feature support. Namely, we'd need to determine what the failure mode is for browsers that don't support unicode escape sequences, and we'd probably need to feature-detect that lack of support and default to polyfilling emoji, similar to what my current patch does when it detects a lack of support for String.fromCodePoint.
I hope I haven't delayed this patch, but I do think it's clear we're discussing two different issues. "Unicode support" should be entirely independent from JavaScript or JavaScript version. Essentially what we are wanting to ask is "are these Unicode Code Points rendered as expected in the browser," and this patch addresses a separate issue: a bug in the detection script.
That bug is that someone changed representation of the input test sets from a list of UTF-16 Code Units to a list of UTF-16 Code Points and for those code points that require surrogate pairs in UTF-16, this broke because they didn't update fromCharCode
to fromCodePoint
.
Food for thought: we can fix this bug without excluding older browser versions that don't support fromCodePoint
. It seems like a reasonable choice to exclude them, but it's not necessary to fix the problem, and if we fix the problem we can retain full support without excluding anyone: change the supplementary plane code points into code unit sequences.
isIdentical = emojiSetsRenderIdentically( - [0x1FAF1, 0x1F3FB, 0x200D, 0x1FAF2, 0x1F3FF], - [0x1FAF1, 0x1F3FB, 0x200B, 0x1FAF2, 0x1F3FF] + [0xD83C, 0xDEF1, 0xD83C, 0xDFFB, 0x200D, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF], + [0xD83C, 0xDEF1, 0xD83C, 0xDFFB, 0x200B, 0xD83E, 0xDEF2, 0xD83C, 0xDFFF] );
We can do this in a modern browser with the following one-liner
((input) => '[' + String.fromCodePoint.apply(null, input).split('').map(c => '0x' + c.charCodeAt(0).toString(16).padStart(4, '0').toUpperCase()).join(', ') + ']')([0x1FAF1, 0x1F3FB, 0x200D, 0x1FAF2, 0x1F3FF])
Three things we could do to prevent further future breakage:
- leave a comment explaining that these need to be UTF-16 code units according to JavaScript string representation
throw
ifemojiSetsRenderIdentically
receives a number above 0xFFFF in hopes of alerting a developer earlier in the process- perform the conversion automatically in a way older browsers can handle
function emojiSetsRenderIdentically( set1, set2 ) { var stringFromSet = function( set ) { var i, codeUnits = []; for (i = 0; i < set.length; i++) { if (set[i] <= 0xFFFF) { codeUnits.push(set[i]) } else { // Split large code points into their UTF-16 surrogate pairs // because we need this to create a JavaScript string. codeUnits.push( 0xD800 - (0x10000 >> 10) + (set[i] >> 10), 0xDC00 + (set[i] & 0x3FF) ); } } return String.fromCharCode.apply(null, codeUnits); } // Cleanup from previous test. context.clearRect( 0, 0, canvas.width, canvas.height ); context.fillText( stringFromSet( set1 ), 0, 0 ); var rendered1 = canvas.toDataURL(); // Cleanup from previous test. context.clearRect( 0, 0, canvas.width, canvas.height ); context.fillText( stringFromSet( this, set2 ), 0, 0 ); var rendered2 = canvas.toDataURL(); return rendered1 === rendered2; }
This little extra code safeguards everything so it'll all be what we expect. Of course, looking at that file wp-emoji-loader.js
, I can see that some sets are still listing surrogate pairs while others are listing single code points; maybe we should make everything consistent while we're here and try to avoid future confusion on the same point?
#11
in reply to:
↑ 10
@
2 years ago
The discussion has turned out pretty long, but I think we're mostly in alignment now. Let me try to summarise our options here, so that we can pick one and move forward:
- The current patch.
- Small set of changes.
- Allows for developers to specify sequences using higher code points or surrogate pairs.
- Somewhat reduced browser support vs the status quo.
- Change all of the existing sequences to surrogate pairs, and enforce it going forward. E.g.
[ 0xD83C, 0xDFF3, 0xFE0F, 0x200D, 0x26A7, 0xFE0F ]
.- Small set of changes.
- Forces developers to understand what surrogate pairs are, and to actually use surrogate pairs instead of higher code points.
- Maintains current level of browser support.
- This approach can be complemented with comments or errors to attempt to ensure developers don't make the mistake of using higher code points.
- Work around the absence of
fromCodePoint
by effectively reimplementing its functionality on top offromCharCode
, translating from higher code points to surrogate pairs.- Larger set of changes.
- More code, which means a larger script and potentially more maintenance.
- Allows for developers to specify sequences using higher code points or surrogate pairs.
- Maintains current level of browser support.
- Use unicode escape sequences in string literals instead of arrays of code points. E.g.
'\uD83C\uDFF3\uFE0F\u200D\u26A7\uFE0F'
.- Larger set of changes.
- Less code than the status quo, which means a smaller script and less maintenance / chance of errors.
- Still easy to see what sequences are being tested; the numbers are all there, just in a different format.
- Forces developers to understand what surrogate pairs are, and to actually use surrogate pairs instead of higher code points.
- Maintains current level of browser support (it's the same as for
fromCharCode
). - Strings are perfectly clear and consistent, with everything being the same type of escape sequence.
- This approach can be complemented with comments or errors to attempt to ensure developers don't make the mistake of using higher code points via unicode code point escape sequences.
- Use unicode code point escape sequences in string literals instead of arrays of code points. E.g.
'\u{1F3F3}\uFE0F\u200D\u26A7\uFE0F'
or'\u{1F3F3}\u{FE0F}\u{200D}\u{26A7}\u{FE0F}'
.- Larger set of changes.
- Less code than the status quo, which means a smaller script and less maintenance / chance of errors.
- Still easy to see what sequences are being tested; the numbers are all there, just in a different format.
- Allows for developers to specify escape sequences using higher code points or surrogate pairs.
- Slightly reduced browser support vs the status quo.
- Strings are mostly clear and consistent, with everything being an escape sequence of some sort.
- Directly use unescaped characters in string literals. E.g.
'🏳️⚧️'
.- Larger set of changes.
- Less code than the status quo, which means a smaller script and less maintenance / chance of errors.
- Easiest workflow for future changes: just copy and paste the emoji into a string literal.
- Hardest option to reason about; it's difficult to analyse exactly which sequences are being tested.
- Potential for a contributor's chosen editor to mess things up if it doesn't handle unicode string literals correctly, somehow.
- Unclear what the browser support and what the failure mode for this option are; would need testing.
Note that for all of the above options, reduced browser support should mean (pending implementation and testing) that we simply default to including the polyfill, instead of anything actually breaking. So it's really not a big deal at all.
So, all things considered, I'm leaning towards 5. It's easy to understand the sequences, it allows for contributors to use higher code points or surrogate pairs without worrying about things too much, it maintains nearly the same level of browser support, and it actually makes the code smaller and simpler.
Which option do you think is best, @dmsnell? Did I miss anything?
#12
@
2 years ago
Also, to address the browser compatibility question definitively, I understand and agree that Emoji support and JS support are completely unrelated, and ideally should be determined entirely separately.
But for practical reasons, I think it's absolutely fine to just give up and unconditionally serve the polyfill to older browsers if that makes our implementation simpler or otherwise better.
Nothing will break (it's not even a matter of graceful degradation), and these old browsers are not supported by Core anyway, so let's not lose sight of the forest for the trees in trying to achieve rigorous correctness.
#13
@
2 years ago
Thanks for the summary, @sergiomdgomes.
Whatever we do I think we should take the opportunity to remove the inconsistency between listing direct code points and code unit sequences. That different sets of tests use the different approach for the same purpose is confusing and is bound to continue to lead to more confusion.
My preference I think is also moving to strings and removing fromCharCode
altogether, but I don't prefer \u{1f3f3}
since that again creates a backwards compatibility problem where none needs to be. It seems more direct for this code to test Unicode support in the browser and not JavaScript support. As you note, it's not the worst problem to have, but in this case it seems just as easy to directly measure what we want and not introduce compatibility problems as it does to introduce them.
So in short, \uD83C\uDFF3\uFE0F\u200D\u26A7\uFE0F
is my preference because:
- there's less code, less indirection
- it doesn't create a browser-compatibility issue
- it warrants a comment explaining to enter code units and avoid entering supplementary-plane code points
- if someone wants to directly embed supplementary-plane characters they can (e.g.
🇺🇳
), they just can't use the escape sequence without translating into code units first (e.g. not\u{1F1FA}\u{1F1F3}
but rather\uD83C\uDDFA\uD83C\uDDF3
) - if someone does enter higher code points then the breakage will fall-back to the behavior you're proposing
#14
@
2 years ago
Not to keep fanning the fire of this discussion, but I see that Chrome and Safari have always supported full code-point string literals while Edge added it in version 12 and Firefox added it in version 40. That's pretty old, but still I guess in the same epoch as adding support for fromCodePoint
, though mostly exclusive to Firefox's limitations.
If we think it's a fair tradeoff to let those pre-Aug. 2015 Firefox versions be wrong then I would change my preference to listing the code points inside the strings like \u{1F1F3}
for clarity sake. If we prefer avoiding creating unnecessary compatibility issues then I have pushed out 57301-2.diff as an update.
#15
@
2 years ago
Thank you for the patch, @dmsnell!
Alright, we need to deal the compatibility bugbear once and for all if we want to make a truly informed decision here. I really didn't want to have to put in this amount of effort, but let's stop being hand-wavy, dig deep, and see what this "compatibility" really gets us.
First of all, your proposed patch produces a syntax error in both Firefox < 40 and Edge < 12 as-is. If we really want this to work there, we'd first need to put some more effort in, namely by removing trailing commas in function calls. This is an additional restriction than the ones imposed by the JS style guide, which even recommends some ES6 constructs like class
, but for the sake of argument let's go with it and say that we expect contributors to treat this file differently.
I tested this in several scenarios via Browserstack, after removing the trailing commas:
- Firefox 39 on Windows 11: support comes back as false, because apparently Windows 11 does not support the UN flag emoji. Every browser fails the test in Windows 11.
- Firefox 39 on Windows 10: Windows 10 seems to support even fewer emoji than Windows 11, with other tests also failing. Every browser fails the test in Windows 10.
- Firefox 39 on MacOS Monterey: support comes back as false, even though the OS supports all tested emoji (as confirmed by latest Chrome, Safari, and Firefox on the same OS). The browser itself is failing us here, possibly due to using older, incompatible system calls.
- Firefox 39 on MacOS Ventura: the browser doesn't open properly, as your independent testing had confirmed.
Edge < 12 will only run in Windows 10, which also doesn't support the UN flag emoji (and probably never will, as Microsoft has moved on to Windows 11, as evidenced by the current Emoji level support disparity). It's not available on Browserstack so I wasn't able to test with it, but given the other tests we know it won't work correctly anyway due to a lack of OS-level support.
Moving on to mobile OSes, old mobile Firefox versions are not available on the Play Store to install, because of API level restrictions. This means that there's effectively no plausible way for a regular user to have an old Firefox mobile on a modern Android phone. And even if the user were to enable sideloading and purposely sideload an old version of Firefox for whatever arcane reason, it's doubtful it would work given the changes to Android that have taken place in the last 7 years.
Firefox on iOS, like any other browser, uses the system WebKit engine instead, so it'll use a modern engine if running on a modern iOS. Which means that it's effectively impossible to have an old browser engine on a modern iOS, unless you jailbreak your device, port a browser and its engine to iOS, compile the whole thing, and sideload it.
So, given all of the above, who are we solving things for? Linux users running modern system unicode / text libraries but an ancient version of Firefox that they compiled themselves (because the package manager version is obviously much newer, and because old binaries are almost certainly ABI-incompatible)?
As far as I can tell, after my research and testing, attempting to maintain compatibility solves the issue for exactly zero users out there.
But moving on to hypotheticals, you could make the argument that this will one day help Firefox < 40 users on Windows 11 (once it adds support for the UN flag emoji). But then you also need to consider that:
- The failure mode (once the trailing commas are removed) is to load an unnecessary polyfill without anything breaking. That's what's happening on every browser right now because of the bug we're trying to solve. It's not ideal from a performance point of view, but it works, and Firefox < 40 / Edge < 12 users have much bigger performance issues elsewhere anyway. I can't overstate how this is not in any way, shape, or form, something we should be caring about given the level of usage of these browsers and the expectations of the users who run them.
- Core does not support those browsers, and other things will fail.
wp-includes
has other scripts that will fail in Firefox < 40, and this will only get worse as time goes by. - Realistically, how many Firefox < 40 users can there possibly be on Windows 11?! That's a 7 year old browser on a 1 year old OS, which is not exactly a common use-case.
As such, my research leads me to conclude that maintaining this level of compatibility brings absolutely no benefit in the real world today, that it's unlikely to bring any future benefit, and that it's thus merely a theoretical concern.
So what are our options?
I'm more than happy for us to use \uD83C\uDFF3\uFE0F\u200D\u26A7\uFE0F
, \u{1F3F3}\uFE0F\u200D\u26A7\uFE0F
, or \u{1F3F3}\u{FE0F}\u{200D}\u{26A7}\u{FE0F}
; to me it's all the same if the current tests use higher code points or surrogate pairs, as long as they work.
However, I don't think we should in any way try to convince contributors to put in the effort to learn about higher code points vs surrogate pairs, and to convince them to use surrogate pairs as the "better option". Nor to espouse a concern with compatibility that has no actual bearing in the real world.
So I'd remove the following entirely:
* Input strings in this function should split supplementary plane code points * into their constituent UTF-16 code units to avoid browser compatability issues. * These are code points above U+FFFF. For example, the character 🅰 is U+1F170, * and while on newer browsers we can write that as '\u{1f170}', on older browsers * we need the code unit sequence '\uD83C\uDD70' instead (Edge < 12, Firefox < 40).
We might as well remove the trailing commas too, to avoid the syntax error that actually breaks things and prevents the polyfill from being loaded in older browsers, given that we've identified the issue is there.
So I'm attaching my proposed patch based on yours, which:
- maintains your string literals
- removes the trailing commas
- removes the aforementioned comment
I hope this version is consensual, given everything I've outlined in this post, but if not, I hereby preemptively agree to any modifications you may want, in the interest of landing a fix before either of us becomes unavailable for further discussion, due to the quickly-approaching holiday season.
#16
@
2 years ago
For all my testing above, there is one important thing I forgot to test, and which invalidates some of my earlier points: '\u{1f170}'
doesn't work in older browsers, as expected, but unfortunately its failure mode is to produce a full-on syntax error. This is difficult to work around, as parsing happens before execution, and thus it's not something we can reasonably expect to feature-detect.
This behaviour means that if we were to have '\u{1f170}'
in the script, then an older browser wouldn't fall back to loading the polyfill, but instead cancel script execution entirely, load nothing, and thus fail to display emoji correctly.
This changes the equation somewhat, and brings some justification to the aforementioned comment, but I'm still not entirely sold on the idea of expecting contributors to use surrogate pairs instead of simply giving up on these browsers.
With that in mind, I'm happy to move forward with either my latest patch or with yours (ideally with the trailing commas removed, which produce syntax errors of their own).
#17
follow-up:
↓ 18
@
2 years ago
I really didn't want to have to put in this amount of effort, but let's stop being hand-wavy, dig deep, and see what this "compatibility" really gets us.
A shared sentiment! Though I wasn't at all trying to be hand-wavy, which is why I looked up specific versions, years they were released, and went through the effort the test in old browser versions. To reiterate, I'm not trying to make you pick one approach or another; all I wanted to do was uncover oddities that jumped out, and to provide a double-check that what we're doing is what we think we're doing. I know you're capable, because you consistently do great work; on my best days though I personally appreciate someone double-checking me and trying to make sure I'm doing what I think I am, which is where I'm coming from when sharing on this ticket.
First of all, your proposed patch produces a syntax error in both Firefox < 40 and Edge < 12 as-is.
For example 😆 And thanks for noticing that - bad habits from working in other codebases.
As such, my research leads me to conclude that maintaining this level of compatibility brings absolutely no benefit in the real world today, that it's unlikely to bring any future benefit, and that it's thus merely a theoretical concern.
I have zero problem with making these assessments, which is why I have tried to communicate multiple times that if it's not a problem then it's okay whatever we do, even suggesting breaking compatibility in order to make the input strings clearer if that's the route we go.
Core does not support those browsers
This pretty much settles the compatibility question I think, right? Now if we don't have to break something, why do it, but if we've already drawn the line then I think that's all the justification needed to move on.
Given that it seems like I might have unintentionally caused grief, which was not my intention, I think I'll just step away from this. Happy to clarify any of my points if you want, but I think everything I have to offer can be found in the thread of comments. Thanks for working on this issue!
#18
in reply to:
↑ 17
@
2 years ago
Replying to dmsnell:
Given that it seems like I might have unintentionally caused grief, which was not my intention, I think I'll just step away from this. Happy to clarify any of my points if you want, but I think everything I have to offer can be found in the thread of comments. Thanks for working on this issue!
I apologise that my frustration came across in my reply and made you feel like you had to withdraw. I should have spent more time tempering my words, and have focused less on me and more on the results. I'm sorry.
I truly appreciate your insight, @dmsnell, and believe we have a better result thanks to your contribution to this ticket.
#19
@
2 years ago
For anyone following this thread: TL;DR 57301-3.diff
is ready for review. Getting it merged into Core would avoid unnecessarily loading an emoji polyfill on browser / OS combinations that support the current emoji set.
#20
@
2 years ago
/cc a few people who have touched this before, would be nice to not have to load the emoji JS on every page view.
@peterwilsoncc @desrosj @pento
#21
@
2 years ago
- Milestone changed from Awaiting Review to 6.2
- Version changed from trunk to 5.3
Thanks for your work on this folks.
I've run a test on JS bin to render the canvas elements and can see the bug you describe, using the existing code and with the switch to fromCharCode.
I suspect an array of character points was used initially to allow two things:
- copy paste of emoji to replace zero width joiners with zero width spaces
- reduce the risk of typos (granted, this didn't play out as intended)
I don't mind much either way if it switches to a string though.
Core does not support those browsers
This pretty much settles the compatibility question I think, right? Now if we don't have to break something, why do it, but if we've already drawn the line then I think that's all the justification needed to move on.
There's a bit of a catch here. WordPress's support is primarily targeted towards the admin: browsers supported when updating a website.
As the emoji detection script runs on the front end of all WordPress sites, it does need to fail gracefully in old browsers. Gracefully meaning in such a way that other JavaScript on the page continues to execute.
This is to account for site developers needing to support a wider range of browsers on the front end of the site.
I've moved this on to the 6.2 milestone to be included in the next major wp release.
#22
@
23 months ago
Thank you for looking at this, @peterwilsoncc! 👍 I'm not entirely sure what the process looks like from this point onward, so please let me know if you need anything from my end.
#23
@
22 months ago
@dmsnell @sergiomdgomes @peterwilsoncc Would anyone of you be interested in owning this ticket? While this is a bug and therefore does not have to be completed before Beta next Tuesday, it would be great to get someone to own this to try to get it across the finish line for 6.2.
#24
@
22 months ago
@flixos90 I believe that the patch is ready, but it's unclear what more is expected from this ticket. what do you want?
#25
@
22 months ago
@dmsnell I'm only asking for someone to own this ticket; somebody has to take responsibility for it and commit it.
I am personally not qualified enough in this area to assess whether the patch is good to go, so I'd prefer if someone that is more familiar that already engaged here before would take ownership of it.
#26
@
22 months ago
ah, thanks for clarifying. I give my approval for https://core.trac.wordpress.org/attachment/ticket/57301/57301-3.diff
but I don't have commit access so I don't think I can own it. if I can then I'm happy to, but might need some instruction.
#27
@
22 months ago
- Owner set to dmsnell
- Status changed from new to assigned
@dmsnell You can definitely own it. :)
Thanks for confirming your approval. Would you mind providing a brief outline on (at least one of the) ways to test / QA this change?
I don't feel comfortable judging the code, but happy to give it a brief test and commit once I verified that.
#28
@
22 months ago
The easiest way to test this behavior will be to load a page in WordPress and then inspect _wpemojiSettings
in the browser dev tools. The supports
property will show whether the patch fixes the motivating bug.
- Start with a blank WordPress site.
- Set
define( SCRIPT_DEBUG, true )
inwp-config
so that the source JS file is served instead of the minified one. Remove thewp-includes/js/wp-emoji-loader.min.js
to ensure it won't be served. - Load the home page for that site.
- Open the dev tools for your browser and inspect
_wpemojiSettings.supports
- For modern browsers,
everything
,flag
, andemoji
should be false, andtwemoji.js
andwp-emoji.js
scripts should load to provide support for the Emoji. - Apply 57301-3.diff and reload the browser.
- This time
everything
,flag
, andemoji
should be true, and neither of the two emoji support scripts should load.
Looks like the attached pictures appeared inline so I won't copy them here into the instructions, but I have provided screenshots of tests in Safari, Firefox, and Chrome on macOS. Additionally I tested this on IE5 in Windows 98 but before and after both reported no support for the Emoji, which isn't surprising because that support is missing. Still, the page did not crash, which is the more important aspect of testing.
Since I could not open the dev tools in IE 5 I loaded this script instead, which directly serves the Emoji detector outside of WordPress, which is also usable in any browser. Fun fact, JSON.stringify()
and Object.keys()
are not supported in IE 5
<?php // test.php <script>_wpemojiSettings = { supports: {} };</script> <script src="wp-includes/js/wp-emoji-loader.js"></script> <script> setTimeout( function() { var e = document.createElement('pre'); var keys = ['everything', 'everythingExceptFlag', 'flag', 'emoji']; for ( i = 0; i < keys.length; i++ ) { e.innerHTML += keys[i] + ": " + _wpemojiSettings.supports[ keys[i] ] + "<br>\n"; }; document.body.appendChild( e ); }, 1000 ); </script>
For mixed-support testing one needs to examine this on an OS that is old enough to lack more recent Unicode updates, but I don't think that part of the code has been under scrutiny or change, whereas the change is mostly isolated to fixing two things:
- there was an outright bug in adding code points outside of the Basic Multilingual Plane which
fromCharCode
cannot properly translate - there was a change from storing these as numbers into storing these as text to try and simplify the indirect way these strings are created
So all-in-all this is tested, confirmed, and safe. The only remaining questions are additional enhancements beyond the scope of this ticket, which are risks already present in WP that this patch does not also ameliorate (that being, preventing someone from adding new Emoji detection that breaks again).
#29
@
22 months ago
- Keywords has-patch commit added
@dmsnell Thank you for the detailed testing instructions and feedback. I gave this a test as well, and it works great.
It's an amazing performance win as well, since in modern browsers this means avoiding the emoji support polyfill load. Even on a production site with the minified wp-emoji-release.min.js
file, that means avoiding 4.7 KB of extra JS. 👍
#30
follow-up:
↓ 31
@
22 months ago
Thanks @flixos90
It's an amazing performance win as well
While this is true it also seems odd to say it this way. @sergiomdgomes fixed a regression which had introduced the performance hit where it isn't necessary, so this was working as expected before, then it broke, and now it's back to where it was before.
For now it's good but there's still room for improvement in preventing this regression from occurring again, something this PR doesn't resolve (which is fine).
#31
in reply to:
↑ 30
@
22 months ago
Replying to dmsnell:
Thanks @flixos90
It's an amazing performance win as well
While this is true it also seems odd to say it this way. @sergiomdgomes fixed a regression which had introduced the performance hit where it isn't necessary, so this was working as expected before, then it broke, and now it's back to where it was before.
Fair point. It's safe to say this fixes the performance regression from before though, so a performance win in that way :)
#33
@
22 months ago
Looks like a lot has happened since I last looked at this ticket (is there any way of getting email notifications?)
Thank you for taking ownership of the ticket, @dmsnell! And thank you @flixos90 for handling the rest! 👍
#34
@
20 months ago
Now that this is fixed (since WordPress 6.2 update) I wonder if there is any way to get the Twemojis back, because they did look much better on some of my sites.
Would it be technically possible to add a switch for choosing Twemojis to be shown in WordPress front- and backend again?
Patch that fixes the issue, by having the emojiSetsRenderIdentically function expect code points