Make WordPress Core

Opened 2 years ago

Last modified 2 years ago

#54818 new defect (bug)

Some file names are no longer sanitized

Reported by: chaton666's profile Chaton666 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Media Keywords:
Focuses: administration Cc:


When uploading a file which filename contain accents, those accents are no longer replaced by non-accent letter.
It's specific to JPEG file.
Tested with PDF and PNG: accent are well removed.

See attached screenshot with JPEG, PNG and PDF example.

I can reproduce this with WP 5.9 RC2 and WP 5.8.3.

Attachments (2)

filenames.png (105.1 KB) - added by Chaton666 2 years ago.
54818.diff (1.0 KB) - added by Chaton666 2 years ago.

Download all attachments as: .zip

Change History (10)

2 years ago

#1 @audrasjb
2 years ago

  • Milestone changed from Awaiting Review to 5.9

Thanks for opening this issue @Chaton666, let's move this to milestone 5.9 for further investigation.

#2 @Chaton666
2 years ago

It's look like the problem is in the sanitize_file_name function (wp-includes/formatting.php).
With a file named "Exemple-1-activités-de-construction.jpg", I don't pass this check and accent is not removed :

// Return if only one extension.
if ( count( $parts ) <= 2 ) {
	/** This filter is documented in wp-includes/formatting.php */
	return apply_filters( 'sanitize_file_name', $filename, $filename_raw );

#3 @Chaton666
2 years ago

Well, it has nothing to do with JPEG file type.
With a file named "Exemple-2-activités-de-construction.png" (PNG), accent is not removed.
I guess there is, in this filename, some pattern which is not processed by str_replace and preg_replace.

Last edited 2 years ago by Chaton666 (previous) (diff)

#4 @Chaton666
2 years ago

  • Summary changed from JPEG file names are no longer sanitized to Some file names are no longer sanitized

#5 @Chaton666
2 years ago

After some research, my "é" is not a real é.
Found on StackOverflow ( :

"Your character é is actually 0x65cc81, rather than the more usual single Unicode codepoint in UTF-8 0xc3a9 (é LATIN SMALL LETTER E WITH ACUTE (U+00E9)). 0x65cc81 is a Unicode "Combining sequence": 0x65 is e "LATIN SMALL LETTER E" (U+0065) and 0xcc81 is ́ "COMBINING ACUTE ACCENT (U+0301)"."

#6 @Chaton666
2 years ago

Here is a patch adding "Latin small letter e with combining acute accent" to the characters to filters in remove_accents function.

2 years ago

#7 @Chaton666
2 years ago

  • Keywords has-patch added

#8 @audrasjb
2 years ago

  • Keywords has-patch removed
  • Milestone changed from 5.9 to Awaiting Review
  • Severity changed from major to normal
  • Version 5.8.3 deleted

Alright, this is a known issue and it wasn't introduced in 5.9 cycle.
Let's move this back to the awaiting review queue.

Also we need a fix for other combinaisons as well.

Note: See TracTickets for help on using tickets.