WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 5 weeks ago

#24661 new defect (bug)

remove_accents is not removing combining accents

Reported by: NumidWasNotAvailable Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 1.2.1
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

As stated in ticket #22363, the function remove_accents does not remove combining accents.

Here is my test:

remove_accents( 'Capture d’écran 2013-02-20 à 23.36.06.png' );

This returns Capture d’écran 2013-02-20 à 23.36.06.png since é and à are made up with combining accents (e.g. à is a + `).

Attachments (11)

24661.0.rev24527.patch (773 bytes) - added by NumidWasNotAvailable 3 years ago.
24661.0.tests.rev1301.patch (1.2 KB) - added by numidwasnotavailable 3 years ago.
Unit test
Capture d’écran 2013-03-30 à 11.36.32.png.zip (89.8 KB) - added by tar.gz 3 years ago.
A file containing combined diacritics
HeavyMetalUmlaut.zip (481.2 KB) - added by tar.gz 3 years ago.
Two real-world attachments with German combined diacritics.
RemoveAccents.php.patch (1.9 KB) - added by nunomorgadinho 3 years ago.
previous unit test plus tests for a collection of filenames that could be problematic
24661.1.patch (1.1 KB) - added by p_enrique 3 years ago.
Use PCRE extension only when it's available
24661.2.patch (9.3 KB) - added by gitlost 3 months ago.
Refresh, modified to target only "Mn" chars following Latin chars. Includes unit tests.
24661.3.patch (13.4 KB) - added by gitlost 2 months ago.
Add _wp_can_use_pcre_ucp() function.
24661.4.patch (13.5 KB) - added by gitlost 2 months ago.
Use statics instead of global defines.
24661.5.patch (13.4 KB) - added by gitlost 2 months ago.
Er, do it right.
24661.6.patch (13.0 KB) - added by gitlost 7 weeks ago.
Don't use separate file for single-byte regexs.

Download all attachments as: .zip

Change History (47)

#1 @NumidWasNotAvailable
3 years ago

  • Cc wordpress-org@… added
  • Keywords has-patch added

#2 @SergeyBiryukov
3 years ago

  • Keywords needs-unit-tests added
  • Version changed from trunk to 1.2.1

#3 @numidwasnotavailable
3 years ago

  • Keywords dev-feedback added

#4 @SergeyBiryukov
3 years ago

  • Keywords 3.7-early added
  • Milestone changed from Awaiting Review to Future Release

#5 @wonderboymusic
3 years ago

  • Milestone changed from Future Release to 3.7

these are all marked 3.7-early

#6 @tar.gz
3 years ago

I tested the behavior in Safari 6 on OSX 10.8 to see if the bug is still present - it is. So I think it's still worthwhile to fix this issue that potentially can affect every OS X or iOS user.

For the case you need a file for testing, here's a PNG that's named exactly the way screen captures are named on a French language OS X. Those are the files that I see triggering that error most frequently.

I'm zipping it, for the case Trac would try to auto-fix those characters.

@tar.gz
3 years ago

A file containing combined diacritics

#7 @nacin
3 years ago

  • Keywords commit added; dev-feedback 3.7-early removed
  • Milestone changed from 3.7 to 3.8

This looks great. Would be nice to have additional test data (such as a collection of filenames like what tar.gz uploaded), versus just ranges that get turned into empty strings.

#8 @tar.gz
3 years ago

Today one of my German-speaking customers bumped into that problem, which motivated me to re-test the patch against the current beta (3.7-beta1-25639).

I applied both 22363.2.patch:ticket:22363 (makes remove_accents run for sanitize_file_name), and 24661.0.rev24527.patch (improves remove_accents by fixing combined characters).

My findings are the same. To trigger the bug, I need to upload the customer's files via FF24/OSX. When uploading with Chrome or Safari, the browser seems to auto-convert the characters, while FF leaves them as combined characters. So it's an edge case, but within a population of 95–100 million native German speakers, 40% of them running FF, it still occurs too frequently (not mentioning the French)...

Before the patch:
http://sandbox.4o4.ch/wp-git/?attachment_id=555
Filename: Buttermödeli.jpg
Result: broken in Safari, working fine in FF/Chrome.

After the patch, uploading the same image:
http://sandbox.4o4.ch/wp-git/?attachment_id=557
Filename: Buttermodeli.jpg
Result: working fine everywhere, including Safari.

Files follow in attachment.

@tar.gz
3 years ago

Two real-world attachments with German combined diacritics.

@nunomorgadinho
3 years ago

previous unit test plus tests for a collection of filenames that could be problematic

#9 @nunomorgadinho
3 years ago

  • Cc nuno.morgadinho@… added
  • Keywords needs-unit-tests removed

#10 @p_enrique
3 years ago

The current patch uses preg_replace() with the u-modifier without checking if the PCRE UTF-8 extension is active. This will throw warnings and blanks the string. There should be a check for the extension and a fallback to plain UTF-8 bytes. And since we're using the UTF-8 extension, we can use the Unicode character properties for filtering. See attached patch.

@p_enrique
3 years ago

Use PCRE extension only when it's available

#11 @p_enrique
3 years ago

Here are my test results of strings using combined accents:

remove_accents( 'Cáo nâu lanh lẹ nhảy qua người lười biếng. Do bạch kim rất quý, sẽ để lắp vô xương' );
 --> string(82) "Cao nau lanh le nhay qua nguoi luoi bieng. Do bach kim rat quy, se de lap vo xuong"

remove_accents( 'Příliš žluťoučký kůň úpěl ďábelské kódy. Pójdźże, kiń tę chmurność w głąb flaszy! Päťtýždňové vĺčatá nervózne štekajú na môjho ďatľa v tŕní.' );
 --> string(140) "Prilis zlutoucky kun upel dabelske kody. Pojdzze, kin te chmurnosc w glab flaszy! Pattyzdnove vlcata nervozne stekaju na mojho datla v trni."

remove_accents( 'Les naïfs ægithales hâtifs pondant à noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.' );
 --> string(115) "Les naifs aegithales hatifs pondant a noel ou il gele sont surs d'etre decus en voyant leurs droles d'oeufs abimes."

remove_accents( 'a᷄ (a + combining macron-acute). +⃟ (plus with enclosing diamond). a︠e︡ (ae with a combining ligature)' );
 --> string(98) "a (a + combining macron-acute). + (plus with enclosing diamond). ae (ae with a combining ligature)"

#12 @p_enrique
3 years ago

  • Cc heikki@… added

#13 @nacin
3 years ago

  • Keywords commit removed
  • Milestone changed from 3.8 to Future Release

#14 @chriscct7
14 months ago

  • Keywords needs-refresh added

@gitlost
3 months ago

Refresh, modified to target only "Mn" chars following Latin chars. Includes unit tests.

#16 @gitlost
3 months ago

Refresh of the patch, modified to target only "Mn" non-spacing combining marks, and only if they follow Latin characters. This follows the approach taken by the ICU Latin-ASCII.xml transliteration rule file. The single-byte case (when PCRE UTF-8/UCP is unavailable) uses regexs generated from the UCD data files UnicodeData.txt and Scripts.txt.

The patch includes unit tests drawn from the previous ones.

It would address the OP bug report in the parent ticket #22363, assuming remove_accents is added as a filter on sanitize_file_name.

#17 @SergeyBiryukov
3 months ago

  • Milestone changed from Future Release to 4.7

WP_LATIN_REGEX_ALTS should probably be a filter instead of a constant.

Last edited 3 months ago by SergeyBiryukov (previous) (diff)

#18 @gitlost
3 months ago

  • Keywords needs-refresh removed

Hi, do you mean put the constant through an apply_filters()? If so, is there a use case?

Could use just ASCII [A-Za-z] instead of WP_LATIN_SCRIPT_REGEX_ALT (and \p{Latin}), as the stripping here comes after the ASCII mapping, rather than before as in the ICU situation, but against that the $chars mapping isn't complete.

#19 @markoheijnen
3 months ago

We should not make it exposable at all for now and look per use case that may come up if we want to expose it.

This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.


3 months ago

#21 @markoheijnen
3 months ago

Also wondering for the if check if we need to do something similar as _wp_can_use_pcre_u

#22 @mikeschroder
3 months ago

@pento, do you have any thoughts on the approach, especially with respect to the list of characters, in 24661.2.patch?

#23 @mikeschroder
3 months ago

  • Keywords dev-feedback added

Related: #22363

#24 @gitlost
3 months ago

A _wp_can_use_pcre_u-like function would be handy, for the unit test for one thing.

A detail about the ICU Latin-ASCII is that the first thing it does is globally filter the input with :: [[:Latin:][:Common:][:Inherited:][〇]] ;, so the later \p{Mn} is actually only matching the combining marks that make it though the filter, which only occur in \p{Inherited}. So the same intersection could be done here, which reduces the single byte regex alt to 554 code points, making it a lot less scary:

define( 'WP_MN_INHERITED_REGEX_ALTS', '\xcc[\x80-\xbf]|\xcd[\x80-\xaf]|\xd2[\x85\x86]|\xd9[\x8b-\x95\xb0]|\xe0\xa5[\x91\x92]|\xe1(?:\xaa[\xb0-\xbd]|\xb3[\x90-\x92\x94-\xa0\xa2-\xa8\xad\xb4\xb8\xb9]|\xb7[\x80-\xb5\xbb-\xbf])|\xe2\x83[\x90-\x9c\xa1\xa5-\xb0]|\xe3(?:\x80[\xaa-\xad]|\x82[\x99\x9a])|\xef\xb8[\x80-\x8f\xa0-\xad]|\xf0(?:\x90(?:\x87\xbd|\x8b\xa0)|\x9d(?:\x85[\xa7-\xa9\xbb-\xbf]|\x86[\x80-\x82\x85-\x8b\xaa-\xad]))|\xf3\xa0(?:[\x84-\x86][\x80-\xbf]|\x87[\x80-\xaf])' ); // 554 code points.

It complicates the UCP usage though, eg /(?<=\p{Latin})(?:(?=\p{Inherited})\p{Mn})+/u

This ticket was mentioned in Slack in #core-images by mike. View the logs.


2 months ago

@gitlost
2 months ago

Add _wp_can_use_pcre_ucp() function.

#26 @gitlost
2 months ago

This patch incorporates the suggestion of @markoheijnen to introduce a _wp_can_use_pcre_u-like function. It doesn't bother with the \p{Inherited} change, as it complicates the UCP case, which (as hinted at by markoheijnen in Slack) is undoubtedly the "high-runner" by a country mile. The unit test has been expanded to include all Unicode 5.0.0 Mn characters.

#27 @ocean90
2 months ago

What's the point of the extra file and the constants? Why would I define WP_MN_REGEX_ALTS by myself?

#28 @gitlost
2 months ago

The only point of the extra file is that it's generated from Unicode data, so this keeps it separate, accessible via the constants. The test for WP_MN_REGEX_ALTS is just whether to include the file or not.

@gitlost
2 months ago

Use statics instead of global defines.

#29 @gitlost
2 months ago

Changed to use statics instead of global defines. This addresses #comment:19 and hopefully makes the code clearer.

@gitlost
2 months ago

Er, do it right.

This ticket was mentioned in Slack in #core by mike. View the logs.


2 months ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 weeks ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 weeks ago

#33 @pento
7 weeks ago

  • Keywords dev-feedback removed

Initial thoughts on 24661.5.patch:

  • Where are gen_cat_regex_alts.php and gen_script_regex_alts.php? Ideally, these should be run as part of the build process, so it's built automatically.
  • I'm undecided about regex_alts.php being a separate file, it feels unnecessary. The regex will only be parsed if it's run, so the performance difference on load time will be negligible.
  • The test_remove_accents_mn_combining_marks() test only tests Unicode 5, it should include tests for Unicode 9. Similarly, I'd like to see these tests generated automatically.
  • The testing parameter for _wp_can_use_pcre_ucp() feels pretty hacky. It's less bad due to being an internal function, but still not great. What's the performance benefit of using the static flag over just running the regex every time it's called?
  • I currently have no comment on the change to remove_accents(), because it's Saturday, and I'm not doing that deep a dive on my weekend. :-)

#34 @gitlost
7 weeks ago

Thanks for looking at this.

gen_cat_regex_alts.php and gen_script_regex_alts.php (and gen_remove_accents_ranges.php below) were just on my local machine. As a temporary measure I've uploaded them (after a bit of a clean up) to the tools directory of the UNFC Normalizer plugin, as they use a functions library developed for it.

I'm not sure they should be part of the build process, seeing as they depend on infrequently changing Unicode data. However I will clean them up further and separate them out so they're suitable for upload to the trunc tools directory if required.

Nobody likes the separate file, so I'll upload a new patch with the regexs just copy-and-pasted in. It was never a performance thing, more for ease of generation (and to hide their ugliness). This now also argues against them being part of the build process though.

The reason it's only testing Unicode 5.0.0 is due to the PCREs that are bundled with PHP 5.2.4 to PHP 7 being built with Unicode data ranging from Unicode 5.0.0 to Unicode 7.0.0, and thus giving varying results for character properties over their various lifetimes. I could generate different ranges depending on the version of PCRE being used when the tests are run, but I thought that was a bit OTT so went for the lowest common denominator.

The tests were generated automatically, by gen_remove_accents_ranges.php, which as mentioned above I've uploaded.

_wp_can_use_pcre_ucp() is a direct analogue of the existing _wp_can_use_pcre_u() function which you might have missed. Again it's not really a performance thing more for convenience of testing, as well as being semantic and usable elsewhere.

@gitlost
7 weeks ago

Don't use separate file for single-byte regexs.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


6 weeks ago

#36 @helen
5 weeks ago

  • Milestone changed from 4.7 to Future Release

I hate to punt this ticket yet again, but we are into beta for 4.7 and with so much going on, I don't know that commits here would get the attention they really need and deserve. Punting; an intrepid committer can always bring it back.

Note: See TracTickets for help on using tickets.