Make WordPress Core

Opened 11 years ago

Closed 2 years ago

Last modified 2 years ago

#24661 closed defect (bug) (fixed)

remove_accents is not removing combining accents

Reported by: numidwasnotavailable's profile NumidWasNotAvailable Owned by: audrasjb's profile audrasjb
Milestone: 6.1 Priority: normal
Severity: normal Version: 1.2.1
Component: Formatting Keywords: has-patch has-unit-tests commit add-to-field-guide
Focuses: Cc:

Description

As stated in ticket #22363, the function remove_accents does not remove combining accents.

Here is my test:

remove_accents( 'Capture d’écran 2013-02-20 à 23.36.06.png' );

This returns Capture d’écran 2013-02-20 à 23.36.06.png since é and à are made up with combining accents (e.g. à is a + `).

Attachments (20)

24661.0.rev24527.patch (773 bytes) - added by NumidWasNotAvailable 11 years ago.
24661.0.tests.rev1301.patch (1.2 KB) - added by numidwasnotavailable 11 years ago.
Unit test
Capture d’écran 2013-03-30 à 11.36.32.png.zip (89.8 KB) - added by tar.gz 11 years ago.
A file containing combined diacritics
HeavyMetalUmlaut.zip (481.2 KB) - added by tar.gz 11 years ago.
Two real-world attachments with German combined diacritics.
RemoveAccents.php.patch (1.9 KB) - added by nunomorgadinho 11 years ago.
previous unit test plus tests for a collection of filenames that could be problematic
24661.1.patch (1.1 KB) - added by p_enrique 11 years ago.
Use PCRE extension only when it's available
24661.2.patch (9.3 KB) - added by gitlost 8 years ago.
Refresh, modified to target only "Mn" chars following Latin chars. Includes unit tests.
24661.3.patch (13.4 KB) - added by gitlost 8 years ago.
Add _wp_can_use_pcre_ucp() function.
24661.4.patch (13.5 KB) - added by gitlost 8 years ago.
Use statics instead of global defines.
24661.5.patch (13.4 KB) - added by gitlost 8 years ago.
Er, do it right.
24661.6.patch (13.0 KB) - added by gitlost 8 years ago.
Don't use separate file for single-byte regexs.
normalize_before_removing_accents.patch (486 bytes) - added by rodrigosevero 3 years ago.
Patch to solve expanded characters not being substituted using PHP's own normalizer_normalize function.
Capture d’écran 2022-07-20 à 18.51.19.png (226.2 KB) - added by audrasjb 2 years ago.
Before patch: see the slug of the media
Capture d’écran 2022-07-20 à 18.52.28.png (226.7 KB) - added by audrasjb 2 years ago.
After patch: all good!
wrong.png (218.5 KB) - added by laboiteare 2 years ago.
before-patch
with-patch.png (215.5 KB) - added by laboiteare 2 years ago.
After patch
test-before-patch.png (28.7 KB) - added by nuryko 2 years ago.
Before patch: accentuated character remains (tested on MacOS Big Sur v.11.6.5)
test-after-patch.png (28.3 KB) - added by nuryko 2 years ago.
After patch: accentuated character is properly sanitized (tested on MacOS Big Sur v.11.6.5)
test-avant-patch.png (219.5 KB) - added by virgar 2 years ago.
Before the patch: the accented character has disappeared (tested on Windows 10 Professional)
test-après-patch.png (215.9 KB) - added by virgar 2 years ago.
After patch: the accented character is always properly sanitized (tested on Windows 10 Professional)

Download all attachments as: .zip

Change History (74)

#1 @NumidWasNotAvailable
11 years ago

  • Cc wordpress-org@… added
  • Keywords has-patch added

#2 @SergeyBiryukov
11 years ago

  • Keywords needs-unit-tests added
  • Version changed from trunk to 1.2.1

#3 @numidwasnotavailable
11 years ago

  • Keywords dev-feedback added

#4 @SergeyBiryukov
11 years ago

  • Keywords 3.7-early added
  • Milestone changed from Awaiting Review to Future Release

#5 @wonderboymusic
11 years ago

  • Milestone changed from Future Release to 3.7

these are all marked 3.7-early

#6 @tar.gz
11 years ago

I tested the behavior in Safari 6 on OSX 10.8 to see if the bug is still present - it is. So I think it's still worthwhile to fix this issue that potentially can affect every OS X or iOS user.

For the case you need a file for testing, here's a PNG that's named exactly the way screen captures are named on a French language OS X. Those are the files that I see triggering that error most frequently.

I'm zipping it, for the case Trac would try to auto-fix those characters.

@tar.gz
11 years ago

A file containing combined diacritics

#7 @nacin
11 years ago

  • Keywords commit added; dev-feedback 3.7-early removed
  • Milestone changed from 3.7 to 3.8

This looks great. Would be nice to have additional test data (such as a collection of filenames like what tar.gz uploaded), versus just ranges that get turned into empty strings.

#8 @tar.gz
11 years ago

Today one of my German-speaking customers bumped into that problem, which motivated me to re-test the patch against the current beta (3.7-beta1-25639).

I applied both 22363.2.patch:ticket:22363 (makes remove_accents run for sanitize_file_name), and 24661.0.rev24527.patch (improves remove_accents by fixing combined characters).

My findings are the same. To trigger the bug, I need to upload the customer's files via FF24/OSX. When uploading with Chrome or Safari, the browser seems to auto-convert the characters, while FF leaves them as combined characters. So it's an edge case, but within a population of 95–100 million native German speakers, 40% of them running FF, it still occurs too frequently (not mentioning the French)...

Before the patch:
http://sandbox.4o4.ch/wp-git/?attachment_id=555
Filename: Buttermödeli.jpg
Result: broken in Safari, working fine in FF/Chrome.

After the patch, uploading the same image:
http://sandbox.4o4.ch/wp-git/?attachment_id=557
Filename: Buttermodeli.jpg
Result: working fine everywhere, including Safari.

Files follow in attachment.

@tar.gz
11 years ago

Two real-world attachments with German combined diacritics.

@nunomorgadinho
11 years ago

previous unit test plus tests for a collection of filenames that could be problematic

#9 @nunomorgadinho
11 years ago

  • Cc nuno.morgadinho@… added
  • Keywords needs-unit-tests removed

#10 @p_enrique
11 years ago

The current patch uses preg_replace() with the u-modifier without checking if the PCRE UTF-8 extension is active. This will throw warnings and blanks the string. There should be a check for the extension and a fallback to plain UTF-8 bytes. And since we're using the UTF-8 extension, we can use the Unicode character properties for filtering. See attached patch.

@p_enrique
11 years ago

Use PCRE extension only when it's available

#11 @p_enrique
11 years ago

Here are my test results of strings using combined accents:

remove_accents( 'Cáo nâu lanh lẹ nhảy qua người lười biếng. Do bạch kim rất quý, sẽ để lắp vô xương' );
 --> string(82) "Cao nau lanh le nhay qua nguoi luoi bieng. Do bach kim rat quy, se de lap vo xuong"

remove_accents( 'Příliš žluťoučký kůň úpěl ďábelské kódy. Pójdźże, kiń tę chmurność w głąb flaszy! Päťtýždňové vĺčatá nervózne štekajú na môjho ďatľa v tŕní.' );
 --> string(140) "Prilis zlutoucky kun upel dabelske kody. Pojdzze, kin te chmurnosc w glab flaszy! Pattyzdnove vlcata nervozne stekaju na mojho datla v trni."

remove_accents( 'Les naïfs ægithales hâtifs pondant à noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.' );
 --> string(115) "Les naifs aegithales hatifs pondant a noel ou il gele sont surs d'etre decus en voyant leurs droles d'oeufs abimes."

remove_accents( 'a᷄ (a + combining macron-acute). +⃟ (plus with enclosing diamond). a︠e︡ (ae with a combining ligature)' );
 --> string(98) "a (a + combining macron-acute). + (plus with enclosing diamond). ae (ae with a combining ligature)"

#12 @p_enrique
11 years ago

  • Cc heikki@… added

#13 @nacin
11 years ago

  • Keywords commit removed
  • Milestone changed from 3.8 to Future Release

#14 @chriscct7
9 years ago

  • Keywords needs-refresh added

@gitlost
8 years ago

Refresh, modified to target only "Mn" chars following Latin chars. Includes unit tests.

#16 @gitlost
8 years ago

Refresh of the patch, modified to target only "Mn" non-spacing combining marks, and only if they follow Latin characters. This follows the approach taken by the ICU Latin-ASCII.xml transliteration rule file. The single-byte case (when PCRE UTF-8/UCP is unavailable) uses regexs generated from the UCD data files UnicodeData.txt and Scripts.txt.

The patch includes unit tests drawn from the previous ones.

It would address the OP bug report in the parent ticket #22363, assuming remove_accents is added as a filter on sanitize_file_name.

#17 @SergeyBiryukov
8 years ago

  • Milestone changed from Future Release to 4.7

WP_LATIN_REGEX_ALTS should probably be a filter instead of a constant.

Last edited 8 years ago by SergeyBiryukov (previous) (diff)

#18 @gitlost
8 years ago

  • Keywords needs-refresh removed

Hi, do you mean put the constant through an apply_filters()? If so, is there a use case?

Could use just ASCII [A-Za-z] instead of WP_LATIN_SCRIPT_REGEX_ALT (and \p{Latin}), as the stripping here comes after the ASCII mapping, rather than before as in the ICU situation, but against that the $chars mapping isn't complete.

#19 @markoheijnen
8 years ago

We should not make it exposable at all for now and look per use case that may come up if we want to expose it.

This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.


8 years ago

#21 @markoheijnen
8 years ago

Also wondering for the if check if we need to do something similar as _wp_can_use_pcre_u

#22 @kirasong
8 years ago

@pento, do you have any thoughts on the approach, especially with respect to the list of characters, in 24661.2.patch?

#23 @kirasong
8 years ago

  • Keywords dev-feedback added

Related: #22363

#24 @gitlost
8 years ago

A _wp_can_use_pcre_u-like function would be handy, for the unit test for one thing.

A detail about the ICU Latin-ASCII is that the first thing it does is globally filter the input with :: [[:Latin:][:Common:][:Inherited:][〇]] ;, so the later \p{Mn} is actually only matching the combining marks that make it though the filter, which only occur in \p{Inherited}. So the same intersection could be done here, which reduces the single byte regex alt to 554 code points, making it a lot less scary:

define( 'WP_MN_INHERITED_REGEX_ALTS', '\xcc[\x80-\xbf]|\xcd[\x80-\xaf]|\xd2[\x85\x86]|\xd9[\x8b-\x95\xb0]|\xe0\xa5[\x91\x92]|\xe1(?:\xaa[\xb0-\xbd]|\xb3[\x90-\x92\x94-\xa0\xa2-\xa8\xad\xb4\xb8\xb9]|\xb7[\x80-\xb5\xbb-\xbf])|\xe2\x83[\x90-\x9c\xa1\xa5-\xb0]|\xe3(?:\x80[\xaa-\xad]|\x82[\x99\x9a])|\xef\xb8[\x80-\x8f\xa0-\xad]|\xf0(?:\x90(?:\x87\xbd|\x8b\xa0)|\x9d(?:\x85[\xa7-\xa9\xbb-\xbf]|\x86[\x80-\x82\x85-\x8b\xaa-\xad]))|\xf3\xa0(?:[\x84-\x86][\x80-\xbf]|\x87[\x80-\xaf])' ); // 554 code points.

It complicates the UCP usage though, eg /(?<=\p{Latin})(?:(?=\p{Inherited})\p{Mn})+/u

This ticket was mentioned in Slack in #core-images by mike. View the logs.


8 years ago

@gitlost
8 years ago

Add _wp_can_use_pcre_ucp() function.

#26 @gitlost
8 years ago

This patch incorporates the suggestion of @markoheijnen to introduce a _wp_can_use_pcre_u-like function. It doesn't bother with the \p{Inherited} change, as it complicates the UCP case, which (as hinted at by markoheijnen in Slack) is undoubtedly the "high-runner" by a country mile. The unit test has been expanded to include all Unicode 5.0.0 Mn characters.

#27 @ocean90
8 years ago

What's the point of the extra file and the constants? Why would I define WP_MN_REGEX_ALTS by myself?

#28 @gitlost
8 years ago

The only point of the extra file is that it's generated from Unicode data, so this keeps it separate, accessible via the constants. The test for WP_MN_REGEX_ALTS is just whether to include the file or not.

@gitlost
8 years ago

Use statics instead of global defines.

#29 @gitlost
8 years ago

Changed to use statics instead of global defines. This addresses #comment:19 and hopefully makes the code clearer.

@gitlost
8 years ago

Er, do it right.

This ticket was mentioned in Slack in #core by mike. View the logs.


8 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 years ago

#33 @pento
8 years ago

  • Keywords dev-feedback removed

Initial thoughts on 24661.5.patch:

  • Where are gen_cat_regex_alts.php and gen_script_regex_alts.php? Ideally, these should be run as part of the build process, so it's built automatically.
  • I'm undecided about regex_alts.php being a separate file, it feels unnecessary. The regex will only be parsed if it's run, so the performance difference on load time will be negligible.
  • The test_remove_accents_mn_combining_marks() test only tests Unicode 5, it should include tests for Unicode 9. Similarly, I'd like to see these tests generated automatically.
  • The testing parameter for _wp_can_use_pcre_ucp() feels pretty hacky. It's less bad due to being an internal function, but still not great. What's the performance benefit of using the static flag over just running the regex every time it's called?
  • I currently have no comment on the change to remove_accents(), because it's Saturday, and I'm not doing that deep a dive on my weekend. :-)

#34 @gitlost
8 years ago

Thanks for looking at this.

gen_cat_regex_alts.php and gen_script_regex_alts.php (and gen_remove_accents_ranges.php below) were just on my local machine. As a temporary measure I've uploaded them (after a bit of a clean up) to the tools directory of the UNFC Normalizer plugin, as they use a functions library developed for it.

I'm not sure they should be part of the build process, seeing as they depend on infrequently changing Unicode data. However I will clean them up further and separate them out so they're suitable for upload to the trunc tools directory if required.

Nobody likes the separate file, so I'll upload a new patch with the regexs just copy-and-pasted in. It was never a performance thing, more for ease of generation (and to hide their ugliness). This now also argues against them being part of the build process though.

The reason it's only testing Unicode 5.0.0 is due to the PCREs that are bundled with PHP 5.2.4 to PHP 7 being built with Unicode data ranging from Unicode 5.0.0 to Unicode 7.0.0, and thus giving varying results for character properties over their various lifetimes. I could generate different ranges depending on the version of PCRE being used when the tests are run, but I thought that was a bit OTT so went for the lowest common denominator.

The tests were generated automatically, by gen_remove_accents_ranges.php, which as mentioned above I've uploaded.

_wp_can_use_pcre_ucp() is a direct analogue of the existing _wp_can_use_pcre_u() function which you might have missed. Again it's not really a performance thing more for convenience of testing, as well as being semantic and usable elsewhere.

@gitlost
8 years ago

Don't use separate file for single-byte regexs.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 years ago

#36 @helen
8 years ago

  • Milestone changed from 4.7 to Future Release

I hate to punt this ticket yet again, but we are into beta for 4.7 and with so much going on, I don't know that commits here would get the attention they really need and deserve. Punting; an intrepid committer can always bring it back.

This ticket was mentioned in Slack in #core by mike. View the logs.


7 years ago

#38 follow-up: @rodrigosevero
3 years ago

Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.

Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.

How can I help move this issue forward?

#39 in reply to: ↑ 38 ; follow-up: @zodiac1978
3 years ago

Replying to rodrigosevero:

Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.

Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.

How can I help move this issue forward?

I think there is a working solution mentioned in https://core.trac.wordpress.org/ticket/35951#comment:4

I am waiting for feedback on this approach for months now.

The hosting team is officially recommending the intl PHP extension which is needed to use this patch:
https://make.wordpress.org/hosting/2021/05/20/why-hosters-should-install-the-php-intl-extension/

There is a polyfill mentioned here https://core.trac.wordpress.org/ticket/52654#comment:14 which could be a solution for everyone on a hoster not having the intl extension.

You could try to push this solution and provide the polyfill as a fallback.

Or you could provide the wanted solution in a plugin like https://github.com/Zodiac1978/tl-normalizer (https://de.wordpress.org/plugins/normalizer/) if you can't wait for a core solution.

Another plugin is https://github.com/gitlost/unfc-normalize (https://wordpress.org/plugins/unfc-normalize/).

@rodrigosevero
3 years ago

Patch to solve expanded characters not being substituted using PHP's own normalizer_normalize function.

#40 in reply to: ↑ 39 @rodrigosevero
3 years ago

Replying to zodiac1978:

Replying to rodrigosevero:

Well, we are at 5.7.2 five years later and it seems that this issue haven't been solved yet.

Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/35951 seems to work.

How can I help move this issue forward?

I think there is a working solution mentioned in https://core.trac.wordpress.org/ticket/35951#comment:4

I am waiting for feedback on this approach for months now.

I tried this solution. It didn't work for me: the UTF-8 expanded accented characters were still not substituted.

I just provided a small patch inspired by the above solution you mentioned that worked for me.

The hosting team is officially recommending the intl PHP extension which is needed to use this patch:
https://make.wordpress.org/hosting/2021/05/20/why-hosters-should-install-the-php-intl-extension/

Yes. That's great!

#41 @ironprogrammer
2 years ago

#55807 was marked as a duplicate.

This ticket was mentioned in PR #2820 on WordPress/wordpress-develop by ironprogrammer.


2 years ago
#42

  • Keywords has-unit-tests added

Refresh of previous patch for #24661, with addition of unit test to convert NFD-encoded strings.

Trac ticket: https://core.trac.wordpress.org/ticket/24661

#43 @ironprogrammer
2 years ago

  • Keywords needs-testing added

Thank you, everyone who has contributed toward this issue, and @zodiac1978 and @rodrigosevero for your relatively recent touches! 🙌🏻

Summary

This ticket has been around a while, so to summarize the situation in 2022:

With these factors it seems we have a pretty good runway to move this forward 🤞🏻. [Famous last words.]

Refresh

Based on @rodrigosevero's patch, I've opened PR 2820 as a refresh, and added a unit test for NFD sequences (alternate Unicode representations of the same characters) that are not part of the current function's strtr() translation array.

The unit test includes Latin1 Supplement, but I wouldn't mind feedback on other sets to test, if necessary.

Related issues: #30130, #35951, #47763, #52654.

CC: @azaozz @hellofromTonya @SergeyBiryukov

#44 @azaozz
2 years ago

  • Milestone changed from Future Release to 6.1

The PR looks good. Would be great if the people experiencing this problem can test it. Also it looks like it will fix #35951 and #47763 as @ironprogrammer mentions above.

Imho this is (finally) almost ready for commit, just waiting for few more people to test.

Last edited 2 years ago by azaozz (previous) (diff)

@audrasjb
2 years ago

Before patch: see the slug of the media

@audrasjb
2 years ago

After patch: all good!

#45 @audrasjb
2 years ago

  • Keywords needs-refresh added; needs-testing removed

The patch fixes a long term issue, well known by Mac users in French since each screenshot uploaded as-is was incorrectly sanitized.

I'll add a new PR to address conflicting files, then it should be ready for commit.

This ticket was mentioned in PR #3000 on WordPress/wordpress-develop by audrasjb.


2 years ago
#46

  • Keywords needs-refresh removed

#48 @audrasjb
2 years ago

  • Keywords commit added

The above PR refreshes the previous one against trunk. Unit tests are passing.

I think we're –finally– good for commit.

@laboiteare
2 years ago

before-patch

@laboiteare
2 years ago

After patch

#49 @laboiteare
2 years ago

Hello, as you can see on my screenshots above, patch is working for me too.

@nuryko
2 years ago

Before patch: accentuated character remains (tested on MacOS Big Sur v.11.6.5)

@nuryko
2 years ago

After patch: accentuated character is properly sanitized (tested on MacOS Big Sur v.11.6.5)

@virgar
2 years ago

Before the patch: the accented character has disappeared (tested on Windows 10 Professional)

@virgar
2 years ago

After patch: the accented character is always properly sanitized (tested on Windows 10 Professional)

#50 @audrasjb
2 years ago

Thanks all for testing.

#51 @audrasjb
2 years ago

  • Owner set to audrasjb
  • Resolution set to fixed
  • Status changed from new to closed

In 53754:

Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the normalizer_normalize() PHP function which is available with the recommended intl PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

#54 @milana_cap
2 years ago

  • Keywords add-to-field-guide added
Note: See TracTickets for help on using tickets.