#24661 closed defect (bug) (fixed)
remove_accents is not removing combining accents
Reported by: | NumidWasNotAvailable | Owned by: | audrasjb |
---|---|---|---|
Milestone: | 6.1 | Priority: | normal |
Severity: | normal | Version: | 1.2.1 |
Component: | Formatting | Keywords: | has-patch has-unit-tests commit add-to-field-guide |
Focuses: | Cc: |
Description
As stated in ticket #22363, the function remove_accents
does not remove combining accents.
Here is my test:
remove_accents( 'Capture d’écran 2013-02-20 à 23.36.06.png' );
This returns Capture d’écran 2013-02-20 à 23.36.06.png
since é
and à
are made up with combining accents (e.g. à
is a
+ `
).
Attachments (20)
Change History (74)
#6
@
11 years ago
I tested the behavior in Safari 6 on OSX 10.8 to see if the bug is still present - it is. So I think it's still worthwhile to fix this issue that potentially can affect every OS X or iOS user.
For the case you need a file for testing, here's a PNG that's named exactly the way screen captures are named on a French language OS X. Those are the files that I see triggering that error most frequently.
I'm zipping it, for the case Trac would try to auto-fix those characters.
#7
@
11 years ago
- Keywords commit added; dev-feedback 3.7-early removed
- Milestone changed from 3.7 to 3.8
This looks great. Would be nice to have additional test data (such as a collection of filenames like what tar.gz uploaded), versus just ranges that get turned into empty strings.
#8
@
11 years ago
Today one of my German-speaking customers bumped into that problem, which motivated me to re-test the patch against the current beta (3.7-beta1-25639).
I applied both 22363.2.patch:ticket:22363 (makes remove_accents
run for sanitize_file_name
), and 24661.0.rev24527.patch (improves remove_accents
by fixing combined characters).
My findings are the same. To trigger the bug, I need to upload the customer's files via FF24/OSX. When uploading with Chrome or Safari, the browser seems to auto-convert the characters, while FF leaves them as combined characters. So it's an edge case, but within a population of 95–100 million native German speakers, 40% of them running FF, it still occurs too frequently (not mentioning the French)...
Before the patch:
http://sandbox.4o4.ch/wp-git/?attachment_id=555
Filename: Buttermödeli.jpg
Result: broken in Safari, working fine in FF/Chrome.
After the patch, uploading the same image:
http://sandbox.4o4.ch/wp-git/?attachment_id=557
Filename: Buttermodeli.jpg
Result: working fine everywhere, including Safari.
Files follow in attachment.
@
11 years ago
previous unit test plus tests for a collection of filenames that could be problematic
#10
@
11 years ago
The current patch uses preg_replace()
with the u
-modifier without checking if the PCRE UTF-8 extension is active. This will throw warnings and blanks the string. There should be a check for the extension and a fallback to plain UTF-8 bytes. And since we're using the UTF-8 extension, we can use the Unicode character properties for filtering. See attached patch.
#11
@
11 years ago
Here are my test results of strings using combined accents:
remove_accents( 'Cáo nâu lanh lẹ nhảy qua người lười biếng. Do bạch kim rất quý, sẽ để lắp vô xương' ); --> string(82) "Cao nau lanh le nhay qua nguoi luoi bieng. Do bach kim rat quy, se de lap vo xuong" remove_accents( 'Příliš žluťoučký kůň úpěl ďábelské kódy. Pójdźże, kiń tę chmurność w głąb flaszy! Päťtýždňové vĺčatá nervózne štekajú na môjho ďatľa v tŕní.' ); --> string(140) "Prilis zlutoucky kun upel dabelske kody. Pojdzze, kin te chmurnosc w glab flaszy! Pattyzdnove vlcata nervozne stekaju na mojho datla v trni." remove_accents( 'Les naïfs ægithales hâtifs pondant à noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.' ); --> string(115) "Les naifs aegithales hatifs pondant a noel ou il gele sont surs d'etre decus en voyant leurs droles d'oeufs abimes." remove_accents( 'a᷄ (a + combining macron-acute). +⃟ (plus with enclosing diamond). a︠e︡ (ae with a combining ligature)' ); --> string(98) "a (a + combining macron-acute). + (plus with enclosing diamond). ae (ae with a combining ligature)"
@
8 years ago
Refresh, modified to target only "Mn" chars following Latin chars. Includes unit tests.
#16
@
8 years ago
Refresh of the patch, modified to target only "Mn" non-spacing combining marks, and only if they follow Latin characters. This follows the approach taken by the ICU Latin-ASCII.xml transliteration rule file. The single-byte case (when PCRE UTF-8/UCP is unavailable) uses regexs generated from the UCD data files UnicodeData.txt and Scripts.txt.
The patch includes unit tests drawn from the previous ones.
It would address the OP bug report in the parent ticket #22363, assuming remove_accents
is added as a filter on sanitize_file_name
.
#17
@
8 years ago
- Milestone changed from Future Release to 4.7
WP_LATIN_REGEX_ALTS
should probably be a filter instead of a constant.
#18
@
8 years ago
- Keywords needs-refresh removed
Hi, do you mean put the constant through an apply_filters()
? If so, is there a use case?
Could use just ASCII [A-Za-z]
instead of WP_LATIN_SCRIPT_REGEX_ALT
(and \p{Latin}
), as the stripping here comes after the ASCII mapping, rather than before as in the ICU situation, but against that the $chars
mapping isn't complete.
#19
@
8 years ago
We should not make it exposable at all for now and look per use case that may come up if we want to expose it.
This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.
8 years ago
#21
@
8 years ago
Also wondering for the if check if we need to do something similar as _wp_can_use_pcre_u
#22
@
8 years ago
@pento, do you have any thoughts on the approach, especially with respect to the list of characters, in 24661.2.patch?
#24
@
8 years ago
A _wp_can_use_pcre_u
-like function would be handy, for the unit test for one thing.
A detail about the ICU Latin-ASCII is that the first thing it does is globally filter the input with :: [[:Latin:][:Common:][:Inherited:][〇]] ;
, so the later \p{Mn}
is actually only matching the combining marks that make it though the filter, which only occur in \p{Inherited}
. So the same intersection could be done here, which reduces the single byte regex alt to 554 code points, making it a lot less scary:
define( 'WP_MN_INHERITED_REGEX_ALTS', '\xcc[\x80-\xbf]|\xcd[\x80-\xaf]|\xd2[\x85\x86]|\xd9[\x8b-\x95\xb0]|\xe0\xa5[\x91\x92]|\xe1(?:\xaa[\xb0-\xbd]|\xb3[\x90-\x92\x94-\xa0\xa2-\xa8\xad\xb4\xb8\xb9]|\xb7[\x80-\xb5\xbb-\xbf])|\xe2\x83[\x90-\x9c\xa1\xa5-\xb0]|\xe3(?:\x80[\xaa-\xad]|\x82[\x99\x9a])|\xef\xb8[\x80-\x8f\xa0-\xad]|\xf0(?:\x90(?:\x87\xbd|\x8b\xa0)|\x9d(?:\x85[\xa7-\xa9\xbb-\xbf]|\x86[\x80-\x82\x85-\x8b\xaa-\xad]))|\xf3\xa0(?:[\x84-\x86][\x80-\xbf]|\x87[\x80-\xaf])' ); // 554 code points.
It complicates the UCP usage though, eg /(?<=\p{Latin})(?:(?=\p{Inherited})\p{Mn})+/u
This ticket was mentioned in Slack in #core-images by mike. View the logs.
8 years ago
#26
@
8 years ago
This patch incorporates the suggestion of @markoheijnen to introduce a _wp_can_use_pcre_u
-like function. It doesn't bother with the \p{Inherited}
change, as it complicates the UCP case, which (as hinted at by markoheijnen in Slack) is undoubtedly the "high-runner" by a country mile. The unit test has been expanded to include all Unicode 5.0.0 Mn characters.
#27
@
8 years ago
What's the point of the extra file and the constants? Why would I define WP_MN_REGEX_ALTS
by myself?
#28
@
8 years ago
The only point of the extra file is that it's generated from Unicode data, so this keeps it separate, accessible via the constants. The test for WP_MN_REGEX_ALTS
is just whether to include the file or not.
#29
@
8 years ago
Changed to use statics instead of global defines. This addresses #comment:19 and hopefully makes the code clearer.
This ticket was mentioned in Slack in #core by mike. View the logs.
8 years ago
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
8 years ago
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
8 years ago
#33
@
8 years ago
- Keywords dev-feedback removed
Initial thoughts on 24661.5.patch:
- Where are
gen_cat_regex_alts.php
andgen_script_regex_alts.php
? Ideally, these should be run as part of the build process, so it's built automatically. - I'm undecided about
regex_alts.php
being a separate file, it feels unnecessary. The regex will only be parsed if it's run, so the performance difference on load time will be negligible. - The
test_remove_accents_mn_combining_marks()
test only tests Unicode 5, it should include tests for Unicode 9. Similarly, I'd like to see these tests generated automatically. - The testing parameter for
_wp_can_use_pcre_ucp()
feels pretty hacky. It's less bad due to being an internal function, but still not great. What's the performance benefit of using thestatic
flag over just running the regex every time it's called? - I currently have no comment on the change to
remove_accents()
, because it's Saturday, and I'm not doing that deep a dive on my weekend. :-)
#34
@
8 years ago
Thanks for looking at this.
gen_cat_regex_alts.php
and gen_script_regex_alts.php
(and gen_remove_accents_ranges.php
below) were just on my local machine. As a temporary measure I've uploaded them (after a bit of a clean up) to the tools directory of the UNFC Normalizer plugin, as they use a functions library developed for it.
I'm not sure they should be part of the build process, seeing as they depend on infrequently changing Unicode data. However I will clean them up further and separate them out so they're suitable for upload to the trunc tools directory if required.
Nobody likes the separate file, so I'll upload a new patch with the regexs just copy-and-pasted in. It was never a performance thing, more for ease of generation (and to hide their ugliness). This now also argues against them being part of the build process though.
The reason it's only testing Unicode 5.0.0 is due to the PCREs that are bundled with PHP 5.2.4 to PHP 7 being built with Unicode data ranging from Unicode 5.0.0 to Unicode 7.0.0, and thus giving varying results for character properties over their various lifetimes. I could generate different ranges depending on the version of PCRE being used when the tests are run, but I thought that was a bit OTT so went for the lowest common denominator.
The tests were generated automatically, by gen_remove_accents_ranges.php
, which as mentioned above I've uploaded.
_wp_can_use_pcre_ucp()
is a direct analogue of the existing _wp_can_use_pcre_u()
function which you might have missed. Again it's not really a performance thing more for convenience of testing, as well as being semantic and usable elsewhere.
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
8 years ago
#36
@
8 years ago
- Milestone changed from 4.7 to Future Release
I hate to punt this ticket yet again, but we are into beta for 4.7 and with so much going on, I don't know that commits here would get the attention they really need and deserve. Punting; an intrepid committer can always bring it back.
This ticket was mentioned in Slack in #core by mike. View the logs.
7 years ago
#38
follow-up:
↓ 39
@
3 years ago
Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.
Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.
How can I help move this issue forward?
#39
in reply to:
↑ 38
;
follow-up:
↓ 40
@
3 years ago
Replying to rodrigosevero:
Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.
Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.
How can I help move this issue forward?
I think there is a working solution mentioned in https://core.trac.wordpress.org/ticket/35951#comment:4
I am waiting for feedback on this approach for months now.
The hosting team is officially recommending the intl
PHP extension which is needed to use this patch:
https://make.wordpress.org/hosting/2021/05/20/why-hosters-should-install-the-php-intl-extension/
There is a polyfill mentioned here https://core.trac.wordpress.org/ticket/52654#comment:14 which could be a solution for everyone on a hoster not having the intl
extension.
You could try to push this solution and provide the polyfill as a fallback.
Or you could provide the wanted solution in a plugin like https://github.com/Zodiac1978/tl-normalizer (https://de.wordpress.org/plugins/normalizer/) if you can't wait for a core solution.
Another plugin is https://github.com/gitlost/unfc-normalize (https://wordpress.org/plugins/unfc-normalize/).
@
3 years ago
Patch to solve expanded characters not being substituted using PHP's own normalizer_normalize function.
#40
in reply to:
↑ 39
@
3 years ago
Replying to zodiac1978:
Replying to rodrigosevero:
Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.
Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.
How can I help move this issue forward?
I think there is a working solution mentioned in https://core.trac.wordpress.org/ticket/35951#comment:4
I am waiting for feedback on this approach for months now.
I tried this solution. It didn't work for me: the UTF-8 expanded accented characters were still not substituted.
I just provided a small patch inspired by the above solution you mentioned that worked for me.
The hosting team is officially recommending the
intl
PHP extension which is needed to use this patch:
https://make.wordpress.org/hosting/2021/05/20/why-hosters-should-install-the-php-intl-extension/
Yes. That's great!
This ticket was mentioned in PR #2820 on WordPress/wordpress-develop by ironprogrammer.
2 years ago
#42
- Keywords has-unit-tests added
Refresh of previous patch for #24661, with addition of unit test to convert NFD-encoded strings.
Trac ticket: https://core.trac.wordpress.org/ticket/24661
#43
@
2 years ago
- Keywords needs-testing added
Thank you, everyone who has contributed toward this issue, and @zodiac1978 and @rodrigosevero for your relatively recent touches! 🙌🏻
Summary
This ticket has been around a while, so to summarize the situation in 2022:
- The
intl
extension's normalizer_normalize support was added to PHP 5.3.0+. - For native support, PHP needs to be built with
--enable-intl
(thanks @jrf via #52654#comment:13), if not already included in your bundle. intl
support can alternatively be installed via PECL.intl
has been recommended and is officially referenced in the Hosting Handbook.- WordPress currently supports PHP 5.6+.
- The proposed patch includes the recommended
function_exists()
check to ensure support.
With these factors it seems we have a pretty good runway to move this forward 🤞🏻. [Famous last words.]
Refresh
Based on @rodrigosevero's patch, I've opened PR 2820 as a refresh, and added a unit test for NFD sequences (alternate Unicode representations of the same characters) that are not part of the current function's strtr()
translation array.
The unit test includes Latin1 Supplement, but I wouldn't mind feedback on other sets to test, if necessary.
Related issues: #30130, #35951, #47763, #52654.
CC: @azaozz @hellofromTonya @SergeyBiryukov
#44
@
2 years ago
- Milestone changed from Future Release to 6.1
#45
@
2 years ago
- Keywords needs-refresh added; needs-testing removed
The patch fixes a long term issue, well known by Mac users in French since each screenshot uploaded as-is was incorrectly sanitized.
I'll add a new PR to address conflicting files, then it should be ready for commit
.
This ticket was mentioned in PR #3000 on WordPress/wordpress-develop by audrasjb.
2 years ago
#46
- Keywords needs-refresh removed
Trac ticket: https://core.trac.wordpress.org/ticket/24661
2 years ago
#47
Closed in favor of https://github.com/WordPress/wordpress-develop/pull/3000
#48
@
2 years ago
- Keywords commit added
The above PR refreshes the previous one against trunk. Unit tests are passing.
I think we're –finally– good for commit
.
@
2 years ago
After patch: accentuated character is properly sanitized (tested on MacOS Big Sur v.11.6.5)
@
2 years ago
Before the patch: the accented character has disappeared (tested on Windows 10 Professional)
@
2 years ago
After patch: the accented character is always properly sanitized (tested on Windows 10 Professional)
#51
@
2 years ago
- Owner set to audrasjb
- Resolution set to fixed
- Status changed from new to closed
In 53754:
2 years ago
#52
Committed in https://core.trac.wordpress.org/changeset/53754
2 years ago
#53
Committed in https://core.trac.wordpress.org/changeset/53754
Unit test