Make WordPress Core

Opened 3 years ago

Last modified 8 months ago

#24661 new defect (bug)

remove_accents is not removing combining accents

Reported by: NumidWasNotAvailable Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 1.2.1
Component: Formatting Keywords: has-patch needs-refresh
Focuses: Cc:


As stated in ticket #22363, the function remove_accents does not remove combining accents.

Here is my test:

remove_accents( 'Capture d’écran 2013-02-20 à 23.36.06.png' );

This returns Capture d’écran 2013-02-20 à 23.36.06.png since é and à are made up with combining accents (e.g. à is a + `).

Attachments (6)

24661.0.rev24527.patch (773 bytes) - added by NumidWasNotAvailable 3 years ago.
24661.0.tests.rev1301.patch (1.2 KB) - added by numidwasnotavailable 3 years ago.
Unit test
Capture d’écran 2013-03-30 à 11.36.32.png.zip (89.8 KB) - added by tar.gz 3 years ago.
A file containing combined diacritics
HeavyMetalUmlaut.zip (481.2 KB) - added by tar.gz 3 years ago.
Two real-world attachments with German combined diacritics.
RemoveAccents.php.patch (1.9 KB) - added by nunomorgadinho 3 years ago.
previous unit test plus tests for a collection of filenames that could be problematic
24661.1.patch (1.1 KB) - added by p_enrique 3 years ago.
Use PCRE extension only when it's available

Download all attachments as: .zip

Change History (21)

#1 @NumidWasNotAvailable
3 years ago

  • Cc wordpress-org@… added
  • Keywords has-patch added

#2 @SergeyBiryukov
3 years ago

  • Keywords needs-unit-tests added
  • Version changed from trunk to 1.2.1

#3 @numidwasnotavailable
3 years ago

  • Keywords dev-feedback added

#4 @SergeyBiryukov
3 years ago

  • Keywords 3.7-early added
  • Milestone changed from Awaiting Review to Future Release

#5 @wonderboymusic
3 years ago

  • Milestone changed from Future Release to 3.7

these are all marked 3.7-early

#6 @tar.gz
3 years ago

I tested the behavior in Safari 6 on OSX 10.8 to see if the bug is still present - it is. So I think it's still worthwhile to fix this issue that potentially can affect every OS X or iOS user.

For the case you need a file for testing, here's a PNG that's named exactly the way screen captures are named on a French language OS X. Those are the files that I see triggering that error most frequently.

I'm zipping it, for the case Trac would try to auto-fix those characters.

3 years ago

A file containing combined diacritics

#7 @nacin
3 years ago

  • Keywords commit added; dev-feedback 3.7-early removed
  • Milestone changed from 3.7 to 3.8

This looks great. Would be nice to have additional test data (such as a collection of filenames like what tar.gz uploaded), versus just ranges that get turned into empty strings.

#8 @tar.gz
3 years ago

Today one of my German-speaking customers bumped into that problem, which motivated me to re-test the patch against the current beta (3.7-beta1-25639).

I applied both 22363.2.patch:ticket:22363 (makes remove_accents run for sanitize_file_name), and 24661.0.rev24527.patch (improves remove_accents by fixing combined characters).

My findings are the same. To trigger the bug, I need to upload the customer's files via FF24/OSX. When uploading with Chrome or Safari, the browser seems to auto-convert the characters, while FF leaves them as combined characters. So it's an edge case, but within a population of 95–100 million native German speakers, 40% of them running FF, it still occurs too frequently (not mentioning the French)...

Before the patch:
Filename: Buttermödeli.jpg
Result: broken in Safari, working fine in FF/Chrome.

After the patch, uploading the same image:
Filename: Buttermodeli.jpg
Result: working fine everywhere, including Safari.

Files follow in attachment.

3 years ago

Two real-world attachments with German combined diacritics.

3 years ago

previous unit test plus tests for a collection of filenames that could be problematic

#9 @nunomorgadinho
3 years ago

  • Cc nuno.morgadinho@… added
  • Keywords needs-unit-tests removed

#10 @p_enrique
3 years ago

The current patch uses preg_replace() with the u-modifier without checking if the PCRE UTF-8 extension is active. This will throw warnings and blanks the string. There should be a check for the extension and a fallback to plain UTF-8 bytes. And since we're using the UTF-8 extension, we can use the Unicode character properties for filtering. See attached patch.

3 years ago

Use PCRE extension only when it's available

#11 @p_enrique
3 years ago

Here are my test results of strings using combined accents:

remove_accents( 'Cáo nâu lanh lẹ nhảy qua người lười biếng. Do bạch kim rất quý, sẽ để lắp vô xương' );
 --> string(82) "Cao nau lanh le nhay qua nguoi luoi bieng. Do bach kim rat quy, se de lap vo xuong"

remove_accents( 'Příliš žluťoučký kůň úpěl ďábelské kódy. Pójdźże, kiń tę chmurność w głąb flaszy! Päťtýždňové vĺčatá nervózne štekajú na môjho ďatľa v tŕní.' );
 --> string(140) "Prilis zlutoucky kun upel dabelske kody. Pojdzze, kin te chmurnosc w glab flaszy! Pattyzdnove vlcata nervozne stekaju na mojho datla v trni."

remove_accents( 'Les naïfs ægithales hâtifs pondant à noël où il gèle sont sûrs d'être déçus en voyant leurs drôles d'œufs abîmés.' );
 --> string(115) "Les naifs aegithales hatifs pondant a noel ou il gele sont surs d'etre decus en voyant leurs droles d'oeufs abimes."

remove_accents( 'a᷄ (a + combining macron-acute). +⃟ (plus with enclosing diamond). a︠e︡ (ae with a combining ligature)' );
 --> string(98) "a (a + combining macron-acute). + (plus with enclosing diamond). ae (ae with a combining ligature)"

#12 @p_enrique
3 years ago

  • Cc heikki@… added

#13 @nacin
3 years ago

  • Keywords commit removed
  • Milestone changed from 3.8 to Future Release

#14 @chriscct7
11 months ago

  • Keywords needs-refresh added
Note: See TracTickets for help on using tickets.