Opened 6 months ago
Last modified 4 months ago
#22363 new defect (bug)
Accents in attachment filenames should be sanitized
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | 3.6 |
| Component: | Upload | Version: | 3.4 |
| Severity: | normal | Keywords: | has-patch |
| Cc: | code@…, knut@… |
Description
There is an inconsistency in the way WP is sanitizing post slugs and attachment filenames.
Sanitizing the post slugs is a Good Thing(tm) for non-english users who use diacritics in their post titles.
Example: If I write a post with the title "Moiré patterns", the actual page slug will be: "moire-patterns". The space is replaced with a hyphen, the "é" becomes "e". Even if I try to change the slug manually into "moiré", WP won't let me (for my own good, since that URL would break in lesser capable browsers).
For some reason, WP doesn't apply that error-correction to attachment filenames.
Example: If I attach a file named "moiré pattern.png", it gets renamed into "moiré-pattern.png".
We can see that the space (and some other forbidden characters) are corrected by sanitize_file_name(), but diacritics such as "é" or "ä" are left as they are.
Currently, most modern browsers are capable of displaying files with diacritics, but some of them still fail (most prominently, Safari).
For the sake of cross-browser compatibility, attachment filenames should benefit from the same safety measures that we apply for the post slugs (I guess that's the remove_accents() function).
Attachments (5)
Change History (21)
- Summary changed from Accentes in attachement filenames should be sanitized to Accents in attachement filenames should be sanitized
- Cc code@… added
- Summary changed from Accents in attachement filenames should be sanitized to Accents in attachment filenames should be sanitized
A quick sample of forum entries from people bumping into that problem:
- http://wordpress.org/support/topic/plugin-nextgen-gallery-a-solution-for-accented-characters-in-uploaded-filenames
- http://wordpress.org/support/topic/plugin-grand-flagallery-best-photo-gallery-problem-with-accents-in-filenames?replies=3
- http://wordpress.org/support/topic/plugin-wp-post-to-pdf-problem-with-the-image-names-with-accented-characters?replies=1
- http://wordpress.org/support/topic/images-with-diacritic-letter-dont-work?replies=10
- http://wordpress.org/support/topic/plugin-nextgen-gallery-pictures-with-diacritics-are-not-making-thumbs?replies=1
Most of them are talking about specific plugins, but if WP could fix the diacritics at the upload, all those problems would be fixed as well.
comment:5
SergeyBiryukov — 6 months ago
- Version changed from trunk to 3.4
Ok, so I attempted a rudimentary patch, by adding a remove_accents() step at the beginning of wp_unique_filename(), in wp-includes/functions.php.
Like, adding
$filename = remove_accents($filename);
just after line 1606.
But no luck, this doesn't work, it doesn't change anything. A file named "moiré.jpg" still keeps the accent after being uploaded.
comment:8
SergeyBiryukov — 5 months ago
#23029 was marked as a duplicate.
By the way, I'm offering a free drink to the developer who patches this issue:
http://www.freedomsponsors.org/core/issue/121/accents-in-attachment-filenames-should-be-sanitized
SergeyBiryukov — 4 months ago
SergeyBiryukov — 4 months ago
comment:10
SergeyBiryukov — 4 months ago
- Keywords has-patch added
- Milestone changed from Awaiting Review to 3.6
22363.patch calls remove_accents() in sanitize_file_name(), like we do in sanitize_title().
22363.2.patch adds it as a filter instead.
comment:11
tar.gz — 4 months ago
I applied 22363.patch and 22363.2.patch on WP 3.6-alpha-23288, but it's not working for me: when uploading "moiré.jpg" and "höhö.jpg", they don't get corrected.
What am I doing wrong ?
comment:12
SergeyBiryukov — 4 months ago
Hmm, they both work for me.
Tested in PHP 5.2.14 on Windows and PHP 5.2.17 on Linux. When uploading "moiré.jpg" and "höhö.jpg", the resulting files are "moire.jpg" and "hoho.jpg", respectively.
comment:13
tar.gz — 4 months ago
Indeed, the results are different in Linux, Windows, OSX.
Here is the current status of my testing:
Viewing of images with accents ("moiré.jpg") is broken in:
- Safari 5.1.7 on OSX (wasn't able to test Safari 6 yet).
- Safari on iOS 5.1.1 (test device: iPod)
- Safari on iOS 6 (test device: iPad)
With the patch you provided, when uploading a file named "Forêt.jpg":
- Windows Vista / IE7 : works, saves as "Foret.jpg".
- Ubuntu 11.04 / Firefox 18 : works, saves as "Foret.jpg".
- Mac OSX 10.6 / Firefox 18 : doesn't work, saves as "Forêt.jpg".
Then I tested with some other characters, and found out that it's even more complicated:
Uploading on Ubuntu/Firefox: ê, ç, ä get converted (to e, c, a) but the "ö" remains as it is. A file named "höhö.jpg" does not get renamed.
Uploading a file named "møiré pättern.png" under OSX/Firefox:
- On an unpatched WP 3.5, only the blank space is converted into hyphen, the file is saved as: "møiré-pättern.png".
- On WP 3.5 with your patch applied, the file is saved as "moiré-pättern.png" - the nordic "ø" has been converted into "o".
So something is working, it's just that some accented characters aren't correctly recognized! I hope this brings us on the right track.
One thing that comes to my mind is that there are two different ways to generate those accented characters, one of them being "combined". For instance, there is a "single glyph" version of "é" that a hex editor displays as "C3 A9", and a combined (base+diacritic) version that displays as "65 CC 81".
And indeed, if I paste the filename of "moiré-pättern.png" into a text file and open it with some hex editor, I see that the ø is a singly glyph (= gets converted correctly), while the é and ä are combined characters.
I imagine that this could be the source of the inconsistencies? So the result actually depends upon the OS, and perhaps even the type of keyboard, on which the filenames have been typed.
One more test. If I copy-paste that "møiré pättern" string from the filename into the title field of a new post, WP generates the following permalink: "moire-pättern". That's interesting: the combined-character é has been fixed by WP, but the combined-character ä hasn't.
FYI, my test server is running PHP 5.3.10.
And by the way, congrats on your Bug Gardener nomination :)
comment:14
tar.gz — 4 months ago
Above you can see a quick demonstration of how Trac handles the same file.
Trac uses a different approach: all the characters get converted into html, so the resulting file is actually named "m%C3%B8ir%C3%A9%20p%C3%A4ttern.png". This works fine in Safari.
Personally, I prefer the "human-readable" approach of WP, which in theory would result in the file being renamed "moire-pattern.png". But currently it isn't able to catch all those ugly diacritics.
comment:15
tar.gz — 4 months ago
One more thing: during my tests, I came across a possible workaround.
We can make those files viewable in safari if we URL-encode the filenames.
When embedding the image into a post, if we link to it with this path (the code given by the Media Uploader):
<img src="/wp-content/uploads/moiré-pättern.png" alt="møiré pättern" />
It will be broken in Safari. But if we write the link like this:
<img src="/wp-content/uploads/moire%CC%81-pa%CC%88ttern.png" alt="møiré pättern" />
then it will display just fine, although it's exactly the same file.
comment:16
tar.gz — 4 months ago
Some more information after further research:
- The same issue obviously appears for any type of attachment, such as a PDF that would be linked from within a post. In Safari, linking to a PDF file with diacritics will lead to the 404 page.
- On WordPress.com, a solution has been implemented: when uploading the "moiré-pättern.png" file from above, it gets renamed into "mc3b8irecc81-pacc88ttern.png". So there is some sanitization in place, which is lacking in WordPress.org. And it works in OSX.
Here are two more links from support forums, showing how this issue affects users around the world:
- https://discussions.apple.com/thread/4381117 (has some good info!)
- http://wordpress.org/support/topic/uploaded-image-with-accents-in-name-image-dont-show-in-safari-6?replies=5
All this testing actually leads me to think that I should open another ticket, not about renaming the files, but about generating URL-encoded links when embedding files into a post (or generating galleries with the shortcode). According to RFC 3986 (URI Generic Syntax), accented characters should be percent-encoded. I guess the simple solution would be to cleanse those filenames with the rawurlencode() PHP function when generating the link.
With that strategy, we would have a pretty good fix, even without touching the upload / file renaming process.

Diacritics resulting in broken images in Safari.