WordPress.org

Make WordPress Core

Opened 8 months ago

Last modified 4 weeks ago

#22363 new defect (bug)

Accents in attachment filenames should be sanitized

Reported by: tar.gz Owned by:
Priority: normal Milestone: Future Release
Component: Upload Version: 3.4
Severity: normal Keywords: has-patch
Cc: code@…, knut@…

Description

There is an inconsistency in the way WP is sanitizing post slugs and attachment filenames.

Sanitizing the post slugs is a Good Thing(tm) for non-english users who use diacritics in their post titles.

Example: If I write a post with the title "Moiré patterns", the actual page slug will be: "moire-patterns". The space is replaced with a hyphen, the "é" becomes "e". Even if I try to change the slug manually into "moiré", WP won't let me (for my own good, since that URL would break in lesser capable browsers).

For some reason, WP doesn't apply that error-correction to attachment filenames.

Example: If I attach a file named "moiré pattern.png", it gets renamed into "moiré-pattern.png".

We can see that the space (and some other forbidden characters) are corrected by sanitize_file_name(), but diacritics such as "é" or "ä" are left as they are.

Currently, most modern browsers are capable of displaying files with diacritics, but some of them still fail (most prominently, Safari).

For the sake of cross-browser compatibility, attachment filenames should benefit from the same safety measures that we apply for the post slugs (I guess that's the remove_accents() function).

Attachments (5)

diacritics-ff-safari.png (22.3 KB) - added by tar.gz 8 months ago.
Diacritics resulting in broken images in Safari.
22363.patch (505 bytes) - added by SergeyBiryukov 5 months ago.
22363.2.patch (658 bytes) - added by SergeyBiryukov 5 months ago.
møiré pättern.png (78.8 KB) - added by tar.gz 5 months ago.
A test file with diacritics. The title has been typed under MacOSX.
img-src-Trac-WP.png (36.2 KB) - added by tar.gz 5 months ago.
The img src code, in Trac and in WP

Download all attachments as: .zip

Change History (23)

tar.gz8 months ago

Diacritics resulting in broken images in Safari.

comment:1 tar.gz8 months ago

  • Summary changed from Accentes in attachement filenames should be sanitized to Accents in attachement filenames should be sanitized

comment:2 tar.gz8 months ago

  • Cc code@… added
  • Summary changed from Accents in attachement filenames should be sanitized to Accents in attachment filenames should be sanitized

comment:3 tar.gz8 months ago

Possibly related: #15955, #16191, #16330

comment:5 SergeyBiryukov7 months ago

  • Version changed from trunk to 3.4

comment:6 knutsp7 months ago

  • Cc knut@… added

comment:7 tar.gz7 months ago

Ok, so I attempted a rudimentary patch, by adding a remove_accents() step at the beginning of wp_unique_filename(), in wp-includes/functions.php.

Like, adding

$filename = remove_accents($filename);

just after line 1606.

But no luck, this doesn't work, it doesn't change anything. A file named "moiré.jpg" still keeps the accent after being uploaded.

comment:8 SergeyBiryukov6 months ago

#23029 was marked as a duplicate.

comment:9 tar.gz5 months ago

By the way, I'm offering a free drink to the developer who patches this issue:

http://www.freedomsponsors.org/core/issue/121/accents-in-attachment-filenames-should-be-sanitized

SergeyBiryukov5 months ago

SergeyBiryukov5 months ago

comment:10 SergeyBiryukov5 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

22363.patch calls remove_accents() in sanitize_file_name(), like we do in sanitize_title().

22363.2.patch adds it as a filter instead.

comment:11 tar.gz5 months ago

I applied 22363.patch and 22363.2.patch on WP 3.6-alpha-23288, but it's not working for me: when uploading "moiré.jpg" and "höhö.jpg", they don't get corrected.

What am I doing wrong ?

Last edited 5 months ago by tar.gz (previous) (diff)

comment:12 SergeyBiryukov5 months ago

Hmm, they both work for me.

Tested in PHP 5.2.14 on Windows and PHP 5.2.17 on Linux. When uploading "moiré.jpg" and "höhö.jpg", the resulting files are "moire.jpg" and "hoho.jpg", respectively.

comment:13 tar.gz5 months ago

Indeed, the results are different in Linux, Windows, OSX.

Here is the current status of my testing:

Viewing of images with accents ("moiré.jpg") is broken in:

  • Safari 5.1.7 on OSX (wasn't able to test Safari 6 yet).
  • Safari on iOS 5.1.1 (test device: iPod)
  • Safari on iOS 6 (test device: iPad)

With the patch you provided, when uploading a file named "Forêt.jpg":

  • Windows Vista / IE7 : works, saves as "Foret.jpg".
  • Ubuntu 11.04 / Firefox 18 : works, saves as "Foret.jpg".
  • Mac OSX 10.6 / Firefox 18 : doesn't work, saves as "Forêt.jpg".

Then I tested with some other characters, and found out that it's even more complicated:

Uploading on Ubuntu/Firefox: ê, ç, ä get converted (to e, c, a) but the "ö" remains as it is. A file named "höhö.jpg" does not get renamed.

Uploading a file named "møiré pättern.png" under OSX/Firefox:

  • On an unpatched WP 3.5, only the blank space is converted into hyphen, the file is saved as: "møiré-pättern.png".
  • On WP 3.5 with your patch applied, the file is saved as "moiré-pättern.png" - the nordic "ø" has been converted into "o".

So something is working, it's just that some accented characters aren't correctly recognized! I hope this brings us on the right track.

One thing that comes to my mind is that there are two different ways to generate those accented characters, one of them being "combined". For instance, there is a "single glyph" version of "é" that a hex editor displays as "C3 A9", and a combined (base+diacritic) version that displays as "65 CC 81".

And indeed, if I paste the filename of "moiré-pättern.png" into a text file and open it with some hex editor, I see that the ø is a singly glyph (= gets converted correctly), while the é and ä are combined characters.

I imagine that this could be the source of the inconsistencies? So the result actually depends upon the OS, and perhaps even the type of keyboard, on which the filenames have been typed.

One more test. If I copy-paste that "møiré pättern" string from the filename into the title field of a new post, WP generates the following permalink: "moire-pättern". That's interesting: the combined-character é has been fixed by WP, but the combined-character ä hasn't.

FYI, my test server is running PHP 5.3.10.

And by the way, congrats on your Bug Gardener nomination :)

tar.gz5 months ago

A test file with diacritics. The title has been typed under MacOSX.

tar.gz5 months ago

The img src code, in Trac and in WP

comment:14 tar.gz5 months ago

Above you can see a quick demonstration of how Trac handles the same file.

Trac uses a different approach: all the characters get converted into html, so the resulting file is actually named "m%C3%B8ir%C3%A9%20p%C3%A4ttern.png". This works fine in Safari.

Personally, I prefer the "human-readable" approach of WP, which in theory would result in the file being renamed "moire-pattern.png". But currently it isn't able to catch all those ugly diacritics.

comment:15 tar.gz5 months ago

One more thing: during my tests, I came across a possible workaround.

We can make those files viewable in safari if we convert the filenames to html in the source code.

When embedding the image into a post, if we link to it with this path (the code given by the Media Uploader):

<img src="/wp-content/uploads/moiré-pättern.png" alt="møiré pättern" />

It will be broken in Safari. But if we write the link like this:

<img src="/wp-content/uploads/moire%CC%81-pa%CC%88ttern.png" alt="møiré pättern" />

then it will display just fine, although it's exactly the same file.

Version 2, edited 5 months ago by tar.gz (previous) (next) (diff)

comment:16 tar.gz5 months ago

Some more information after further research:

  • The same issue obviously appears for any type of attachment, such as a PDF that would be linked from within a post. In Safari, linking to a PDF file with diacritics will lead to the 404 page.
  • On WordPress.com, a solution has been implemented: when uploading the "moiré-pättern.png" file from above, it gets renamed into "mc3b8irecc81-pacc88ttern.png". So there is some sanitization in place, which is lacking in WordPress.org. And it works in OSX.

Here are two more links from support forums, showing how this issue affects users around the world:

All this testing actually leads me to think that I should open another ticket, not about renaming the files, but about generating URL-encoded links when embedding files into a post (or generating galleries with the shortcode). According to RFC 3986 (URI Generic Syntax), accented characters should be percent-encoded. I guess the simple solution would be to cleanse those filenames with the rawurlencode() PHP function when generating the link.

With that strategy, we would have a pretty good fix, even without touching the upload / file renaming process.

comment:17 ryan4 weeks ago

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

Last edited 4 weeks ago by ryan (previous) (diff)

comment:18 ryan4 weeks ago

  • Milestone changed from 3.6 to Future Release
Note: See TracTickets for help on using tickets.