WordPress.org

Make WordPress Core

Opened 20 months ago

Last modified 4 months ago

#22363 new defect (bug)

Accents in attachment filenames should be sanitized

Reported by: tar.gz Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.4
Component: Upload Keywords: has-patch needs-testing dev-feedback 4.0-early needs-unit-tests
Focuses: Cc:

Description

There is an inconsistency in the way WP is sanitizing post slugs and attachment filenames.

Sanitizing the post slugs is a Good Thing(tm) for non-english users who use diacritics in their post titles.

Example: If I write a post with the title "Moiré patterns", the actual page slug will be: "moire-patterns". The space is replaced with a hyphen, the "é" becomes "e". Even if I try to change the slug manually into "moiré", WP won't let me (for my own good, since that URL would break in lesser capable browsers).

For some reason, WP doesn't apply that error-correction to attachment filenames.

Example: If I attach a file named "moiré pattern.png", it gets renamed into "moiré-pattern.png".

We can see that the space (and some other forbidden characters) are corrected by sanitize_file_name(), but diacritics such as "é" or "ä" are left as they are.

Currently, most modern browsers are capable of displaying files with diacritics, but some of them still fail (most prominently, Safari).

For the sake of cross-browser compatibility, attachment filenames should benefit from the same safety measures that we apply for the post slugs (I guess that's the remove_accents() function).

Attachments (16)

diacritics-ff-safari.png (22.3 KB) - added by tar.gz 20 months ago.
Diacritics resulting in broken images in Safari.
22363.patch (505 bytes) - added by SergeyBiryukov 18 months ago.
22363.2.patch (658 bytes) - added by SergeyBiryukov 18 months ago.
møiré pättern.png (78.8 KB) - added by tar.gz 18 months ago.
A test file with diacritics. The title has been typed under MacOSX.
img-src-Trac-WP.png (36.2 KB) - added by tar.gz 18 months ago.
The img src code, in Trac and in WP
22363.3.rev24527.patch (2.5 KB) - added by NumidWasNotAvailable 13 months ago.
22363.4.rev24527.patch (2.7 KB) - added by NumidWasNotAvailable 13 months ago.
22363.5.diff (704 bytes) - added by tar.gz 9 months ago.
Same as 22363.2.patch, but updated to match the current context of default-filters.php
sanitize_file_name_breaking_characters_utf8.txt (22.5 KB) - added by p_enrique 8 months ago.
A list of Unicode characters that break sanitize_file_name()
22363.5.patch (3.8 KB) - added by p_enrique 8 months ago.
Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower
22363.6.patch (4.9 KB) - added by p_enrique 8 months ago.
Better check for PCRE UTF-8, updated comment block
test-SanitizeFileName.patch (3.0 KB) - added by p_enrique 8 months ago.
Updated unit test for sanitize_file_name()
22363.7.patch (5.0 KB) - added by p_enrique 8 months ago.
Check for ASCII-only instead of seems_utf8() before strtolower()
22363.8.patch (5.2 KB) - added by p_enrique 8 months ago.
Be sensitive to original encoding that might differ from UTF-8
22363.9.patch (5.2 KB) - added by p_enrique 7 months ago.
Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding
22363.10.patch (4.7 KB) - added by p_enrique 7 months ago.
Put the filters back in the same place

Download all attachments as: .zip

Change History (70)

tar.gz20 months ago

Diacritics resulting in broken images in Safari.

comment:1 tar.gz20 months ago

  • Summary changed from Accentes in attachement filenames should be sanitized to Accents in attachement filenames should be sanitized

comment:2 tar.gz20 months ago

  • Cc code@… added
  • Summary changed from Accents in attachement filenames should be sanitized to Accents in attachment filenames should be sanitized

comment:3 tar.gz20 months ago

Possibly related: #15955, #16191, #16330

comment:5 SergeyBiryukov20 months ago

  • Version changed from trunk to 3.4

comment:6 knutsp20 months ago

  • Cc knut@… added

comment:7 tar.gz20 months ago

Ok, so I attempted a rudimentary patch, by adding a remove_accents() step at the beginning of wp_unique_filename(), in wp-includes/functions.php.

Like, adding

$filename = remove_accents($filename);

just after line 1606.

But no luck, this doesn't work, it doesn't change anything. A file named "moiré.jpg" still keeps the accent after being uploaded.

comment:8 SergeyBiryukov19 months ago

#23029 was marked as a duplicate.

comment:9 tar.gz18 months ago

By the way, I'm offering a free drink to the developer who patches this issue:

http://www.freedomsponsors.org/core/issue/121/accents-in-attachment-filenames-should-be-sanitized

SergeyBiryukov18 months ago

SergeyBiryukov18 months ago

comment:10 SergeyBiryukov18 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

22363.patch calls remove_accents() in sanitize_file_name(), like we do in sanitize_title().

22363.2.patch adds it as a filter instead.

comment:11 tar.gz18 months ago

I applied 22363.patch and 22363.2.patch on WP 3.6-alpha-23288, but it's not working for me: when uploading "moiré.jpg" and "höhö.jpg", they don't get corrected.

What am I doing wrong ?

Last edited 18 months ago by tar.gz (previous) (diff)

comment:12 SergeyBiryukov18 months ago

Hmm, they both work for me.

Tested in PHP 5.2.14 on Windows and PHP 5.2.17 on Linux. When uploading "moiré.jpg" and "höhö.jpg", the resulting files are "moire.jpg" and "hoho.jpg", respectively.

comment:13 tar.gz18 months ago

Indeed, the results are different in Linux, Windows, OSX.

Here is the current status of my testing:

Viewing of images with accents ("moiré.jpg") is broken in:

  • Safari 5.1.7 on OSX (wasn't able to test Safari 6 yet).
  • Safari on iOS 5.1.1 (test device: iPod)
  • Safari on iOS 6 (test device: iPad)

With the patch you provided, when uploading a file named "Forêt.jpg":

  • Windows Vista / IE7 : works, saves as "Foret.jpg".
  • Ubuntu 11.04 / Firefox 18 : works, saves as "Foret.jpg".
  • Mac OSX 10.6 / Firefox 18 : doesn't work, saves as "Forêt.jpg".

Then I tested with some other characters, and found out that it's even more complicated:

Uploading on Ubuntu/Firefox: ê, ç, ä get converted (to e, c, a) but the "ö" remains as it is. A file named "höhö.jpg" does not get renamed.

Uploading a file named "møiré pättern.png" under OSX/Firefox:

  • On an unpatched WP 3.5, only the blank space is converted into hyphen, the file is saved as: "møiré-pättern.png".
  • On WP 3.5 with your patch applied, the file is saved as "moiré-pättern.png" - the nordic "ø" has been converted into "o".

So something is working, it's just that some accented characters aren't correctly recognized! I hope this brings us on the right track.

One thing that comes to my mind is that there are two different ways to generate those accented characters, one of them being "combined". For instance, there is a "single glyph" version of "é" that a hex editor displays as "C3 A9", and a combined (base+diacritic) version that displays as "65 CC 81".

And indeed, if I paste the filename of "moiré-pättern.png" into a text file and open it with some hex editor, I see that the ø is a singly glyph (= gets converted correctly), while the é and ä are combined characters.

I imagine that this could be the source of the inconsistencies? So the result actually depends upon the OS, and perhaps even the type of keyboard, on which the filenames have been typed.

One more test. If I copy-paste that "møiré pättern" string from the filename into the title field of a new post, WP generates the following permalink: "moire-pättern". That's interesting: the combined-character é has been fixed by WP, but the combined-character ä hasn't.

FYI, my test server is running PHP 5.3.10.

And by the way, congrats on your Bug Gardener nomination :)

tar.gz18 months ago

A test file with diacritics. The title has been typed under MacOSX.

tar.gz18 months ago

The img src code, in Trac and in WP

comment:14 tar.gz18 months ago

Above you can see a quick demonstration of how Trac handles the same file.

Trac uses a different approach: all the characters get converted into html, so the resulting file is actually named "m%C3%B8ir%C3%A9%20p%C3%A4ttern.png". This works fine in Safari.

Personally, I prefer the "human-readable" approach of WP, which in theory would result in the file being renamed "moire-pattern.png". But currently it isn't able to catch all those ugly diacritics.

comment:15 tar.gz18 months ago

One more thing: during my tests, I came across a possible workaround.

We can make those files viewable in safari if we URL-encode the filenames.

When embedding the image into a post, if we link to it with this path (the code given by the Media Uploader):

<img src="/wp-content/uploads/moiré-pättern.png" alt="møiré pättern" />

It will be broken in Safari. But if we write the link like this:

<img src="/wp-content/uploads/moire%CC%81-pa%CC%88ttern.png" alt="møiré pättern" />

then it will display just fine, although it's exactly the same file.

Last edited 18 months ago by tar.gz (previous) (diff)

comment:16 tar.gz18 months ago

Some more information after further research:

  • The same issue obviously appears for any type of attachment, such as a PDF that would be linked from within a post. In Safari, linking to a PDF file with diacritics will lead to the 404 page.
  • On WordPress.com, a solution has been implemented: when uploading the "moiré-pättern.png" file from above, it gets renamed into "mc3b8irecc81-pacc88ttern.png". So there is some sanitization in place, which is lacking in WordPress.org. And it works in OSX.

Here are two more links from support forums, showing how this issue affects users around the world:

All this testing actually leads me to think that I should open another ticket, not about renaming the files, but about generating URL-encoded links when embedding files into a post (or generating galleries with the shortcode). According to RFC 3986 (URI Generic Syntax), accented characters should be percent-encoded. I guess the simple solution would be to cleanse those filenames with the rawurlencode() PHP function when generating the link.

With that strategy, we would have a pretty good fix, even without touching the upload / file renaming process.

comment:17 follow-up: ryan14 months ago

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

Last edited 14 months ago by ryan (previous) (diff)

comment:18 ryan14 months ago

  • Milestone changed from 3.6 to Future Release

comment:19 in reply to: ↑ 17 tar.gz13 months ago

Replying to ryan:

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

On the other hand, such manipulations are applied when files are uploaded to WordPress.com.

Also, what do you think about percent-encoding the file URLs when inserting images into a post? (or when storing the file URL into the database - not sure how the process works internally). That would pretty much solve the issue, without changing the actual filename.

comment:20 NumidWasNotAvailable13 months ago

About sanitizing filenames

Modern browsers handle Unicode characters in URLs. Actually they perform an automatic translation using percent encoding. In some configurations it is quite transparent for the users.

RFC 3986 says that non ASCII charcaters are not allowed in URIs.

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser. This is understandable, just have a look at the statistics at StatCounter.

However, some users may encounter an issue since URLs are not RFC 3986 compliant. Furthermore, I would be interested to know what configurations are concerned by this issue so far.

I am coming up with the following patch: filenames are sanitized using sanitize_title as suggested by tar.gz.

About remove_accents

In this ticket we have discovered that remove_accents was not handling Unicode combining characters at all.

I think we can create a new ticket for this issue.

Version 0, edited 13 months ago by NumidWasNotAvailable (next)

comment:21 NumidWasNotAvailable13 months ago

  • Cc NumidWasNotAvailable added

comment:22 tar.gz13 months ago

Thanks for those patches!

I just tested 22363.3.rev24527.patch in combination with 24661.0.rev24527.patch from #24661 - works great, it fixes the problem. Here is my test page: http://sandbox.4o4.ch/wp-git/?p=506

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser.

This is not entirely true, since for *Post URLs*, WP *does* remove accented characters.

Secondly, the reason I brought up this issue, is that one of the "modern browsers", Safari (on OSX and iOS) is failing to display files with combined characters in their name (unless the url is %-encoded). Depending on the country where you live, that can be a significant segment - for Switzerland, StatCounter ranks Safari at 20% (and 60% for mobile).

Finally, those files with combined diacritics are also poorly supported by FTP clients. Whenever I have to migrate a WP site, in a French or German speaking area, some files will allways be broken during FTP transfer, and will need manual replacement. Fixing them at upload would simply eliminate all those potential future issues.

comment:23 NumidWasNotAvailable13 months ago

I updated the patch because filenames without extension were not processed properly.

comment:24 SergeyBiryukov13 months ago

  • Keywords needs-unit-tests added

comment:25 numidwasnotavailable11 months ago

Tests fail to pass with 22363.4.rev24527.patch. Indeed sanitize_file_name no longer performs extension munging.

I propose a rollback to 22363.2.patch) considering the fix in #24661.

comment:26 tar.gz10 months ago

Is there anything I can do to advance this bug, in order to see the issue fixed in WP 3.7? More testing needed?

comment:27 SergeyBiryukov10 months ago

  • Milestone changed from Future Release to 3.7

Looks like 22363.2.patch is the way to go here, and we should look into removing combined accents in #24661.

tar.gz9 months ago

Same as 22363.2.patch, but updated to match the current context of default-filters.php

comment:28 tar.gz9 months ago

For the record, Julio Poitier, well respected French WP coder, is also advocating this fix, and gets much positive feedback for it:
http://www.geekpress.fr/wordpress/astuce/suppression-accents-media-1903/

comment:29 juliobox9 months ago

Thanks tar.gz ;)

comment:30 SergeyBiryukov9 months ago

  • Milestone changed from 3.7 to 3.8

comment:31 GregLone8 months ago

  • Cc greg@… added

comment:32 p_enrique8 months ago

I've been experimenting with this issue. Now, I know I'm repeating most of the things said above, but I'd like to confirm the following:

  1. As for filenames:
    • Non-ASCII filenames will break on Windows. This has to do with how PHP interacts with the Windows file system. There might be a workaround that allows filenames to contain characters contained on the current "code page", but other characters will break filenames.
    • *nix filesystems allow any characters in filenames (minus the reserved characters, of course).
    • WordPress allows uploaded files to have any characters in their names (minus some reserved characters).
    • UTF-8 filenames work OK on WordPress on a *nix platform.
    • URL-encoded (percent-encoded) filenames work on any platform, but make the filename 6 times longer in certain scripts, such as Cyrillic ('%D0%B6' for the character 'ж', for example).
    • The maximum file name length under NTFS is 259 characters including the path.
    • URL-encoded filenames are not a solution. <img src='ca%C3%B1o.jpg'/> will 404 even though the file 'ca%C3%B1o.jpg' exists, since the server decodes the request to 'caño.jpg'.
    • remove_accents() is only a partial solution since 1) it has no support at all for non-Latin based scripts and 2) it doesn't do anything with such things as curly quotes, dashes, copyright symbols or other similar things (which are included in sanitize_title_with_dashes())
    • sanitize_title() as such is no good for filenames, since it URL-encodes the non-ASCII parts.
  1. As for URLs:
    • The only allowed characters in URLs according to RFC3986 are only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes. Everything else should be encoded.
    • The attachment anchor href and img src attributes are not encoded on WP.
    • Not encoding the above may break things in some browsers.
    • Not encoding an URI may break other software that recognizes pasted URIs.
    • An encoded anchor href is handled transparently by modern browsers. For examples, see The Russian Wikipedia, where hovering on a link shows Cyrillic text but in the source, the URI is %-encoded.
Last edited 8 months ago by p_enrique (previous) (diff)

comment:33 p_enrique8 months ago

  • Cc heikki@… added

comment:34 p_enrique8 months ago

I'm not sure if I should open a new ticket for this, maybe someone will advise me. Anyway, I've been testing sanitize_file_name( 'X.jpg' ) where X is an Unicode character that is a number or a letter (matching regex /[\p{L}\p{N}]/u). Alarmingly, there are many rather common characters that will result in a malformed, broken string being returned:

(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er

A complete list is attached.

Edit:
The culprit is the preg_replace without a Unicode pattern modifier. I found that the U-modifier is used here and there in WordPress sources, usually without any checking. I'd write some patches now, but I don't know how to do it & don't have time to learn just now.

Last edited 8 months ago by p_enrique (previous) (diff)

p_enrique8 months ago

A list of Unicode characters that break sanitize_file_name()

comment:35 tar.gz8 months ago

Thanks p_enrique for the feedback.

If those characters you mention present an issue, how comes that they don't produce malformed post slugs when using them in a post title?

Basically, the aim of this ticket is that file names should be sanitized in the same manner as post slugs.

Another question: based on your observations on percent-encoding, would it make sense to open another ticket, asking for all permalinks in WP (including the file permalinks) to be percent-encoded? See #comment:16 above.

comment:36 p_enrique8 months ago

tar.gz: They don't produce malformed slugs because sanitize_post_title() urlencodes everything that's not ASCII. This works great for anything that's an URL. But the mechanism cannot be used for file names, since the browser cannot access a file that has a urlencoded file name as I point out in #comment:32.

comment:38 p_enrique8 months ago

My recent patch uses wp_strip_all_tags, adds filter remove_accents and relies on preg_replace (with possible UTF-8 modifier) to remove other characters with support for the sanitize_file_name_chars filter. Also converting the filename to lower case if mb_string extension is available.

Here are some test results with my recent patch:

sanitize_file_name( "Posyłają\tSzczegóły\r\n(Październik).jpg" );
 -> string(34) "posylaja-szczegoly-pazdziernik.jpg"

sanitize_file_name( "<strong>Mọi người đều có quyền <foo> tự do tham gia vào đời sống văn hoá của cộng đồn</strong>" );
 -> string(97) "mọi-nguòi-dèu-có-quyèn-tụ-do-tham-gia-vào-dòi-sóng-van-hoá-của-cọng-dòn"

sanitize_file_name( "-ÉŁè Šíæ 20 % —40° “über/Þøß” βαΣιΛειος à © Ñœijç.õs¿ & Kö.yr.ä = 10 €-.gif.gif" );
 -> string(75) "ele-siae-20-40-uber-thos-βασιλειος-a-noeijc.os-ko.yr_.a-10.gif.gif"

sanitize_file_name( "---Все люди 1) рождаются свободными и 2) равными в своем 2A) достоинстве и 2B) ПРАВАХ!!!---.jpg" );
 -> string(140) "все-люди-1-рождаются-свободными-и-2-равными-в-своем-2a-достоинстве-и-2b-правах.jpg"

sanitize_file_name( "ณ ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ โลกจะใช้เพียง Unicode เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน .jpg" );
 -> string(246) "ณ-ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ-โลกจะใช้เพียง-unicode-เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน.jpg"

sanitize_file_name( "Invalid Unicode string: abc-\xff-def"
 -> string(0) ""

(There seem to be some Vietnamese characters not captured by remove_accents.)

Last edited 8 months ago by p_enrique (previous) (diff)

comment:39 p_enrique8 months ago

I have two more comments on this matter:

  1. I think there are plugins like cyr2lat, arabic-to-lat etc. that convert characters in Non-Latin scripts to ASCII. These will work if they hook to sanitize_file_name.
  1. There should be an extra filter for Windows-based servers that tries to convert everything possible to ASCII and then strips out the rest. Otherwise, the attachments and images will break under Windows. But such a filter would need a fallback filename for a case where converting/stripping leaves (practically) no ASCII at all. For example, the Russian file name (example 4 above) when stripping all non-ASCII and multiple dashes leaves 1-2-2a-2b.jpg How should such a case be handled?
Last edited 8 months ago by p_enrique (previous) (diff)

comment:40 p_enrique8 months ago

Patch updated. This now gives correct results:

sanitize_file_name( "A.String.With.Many.Dots.gif.gif.gif.ÖÑÉ.twõ.très.fö.ür" );
 --> string(61) "a.string.with_.many_.dots_.gif.gif.gif.one_.two_.tres_.fo_.ur"}}}

p_enrique8 months ago

Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower

comment:41 p_enrique8 months ago

Patch updated again. Added a call to html_entity_decode() if PCRE UTF-8 is available, fixed some typos, consolidated preg statements.

p_enrique8 months ago

Better check for PCRE UTF-8, updated comment block

p_enrique8 months ago

Updated unit test for sanitize_file_name()

comment:42 p_enrique8 months ago

  • Keywords needs-testing added; needs-unit-tests removed

I don't know if you're supposed to write unit tests for your own contributions, but I did anyway. Here they are. Please note that I modified one of the earlier tests since the earlier behaviour was to strip illegal characters, now they are replaced with dashes.

p_enrique8 months ago

Check for ASCII-only instead of seems_utf8() before strtolower()

p_enrique8 months ago

Be sensitive to original encoding that might differ from UTF-8

comment:43 cristovaov7 months ago

hi, I've been following this ticket and previously plugged patch 22363.2 (http://wp.me/KmdR)
Trying to apply patch 22363.8 also as a plugin but my technical knowledge is rather limited: getting 'Fatal error: Cannot redeclare sanitize_file_name()'

Could someone have a look? I've put my code as a gist here:
https://gist.github.com/cristovaov/7674225 --thanks

comment:44 cristovaov7 months ago

  • Cc cristovaov added

comment:45 NumidWasNotAvailable7 months ago

  • Cc NumidWasNotAvailable removed

p_enrique7 months ago

Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding

comment:46 follow-up: p_enrique7 months ago

@cristovaov: You can't make a plugin out of a patch that modifies core WP files. In this case, my patch modifies the function sanitize_file_name() in /wp-includes/formatting.php. If you put the code in a plugin, you'll get precisely that error message since the function is already declared in the file I mentioned. If you want to experiment, you can try the SVN install as described in the Contributor Handbook and then applying the patch. I'd be grateful for any feedback about my patch.

comment:47 SergeyBiryukov7 months ago

The sanitize_file_name filter should stay where it is now. It should be possible to override the sanitization if needed.

p_enrique7 months ago

Put the filters back in the same place

comment:48 p_enrique7 months ago

  • Keywords dev-feedback added

comment:49 in reply to: ↑ 46 cristovaov7 months ago

@p_enrique: Thanks for clearing that, I was blindly assuming it was possible.
I'll look at doing an svn install--

Replying to p_enrique:

@cristovaov: You can't make a plugin out of a patch that modifies core WP files.

comment:50 nacin7 months ago

  • Keywords 3.9-early added

I think this is great but it needs a lot of time to be soaked. Let's try it out in 3.9.

comment:51 nacin7 months ago

  • Milestone changed from 3.8 to Future Release

comment:52 markoheijnen6 months ago

  • Milestone changed from Future Release to 3.9

Nacin can we give this a try this release?

comment:53 ircbot4 months ago

This ticket was mentioned in IRC in #wordpress-dev by SergeyBiryukov. View the logs.

comment:54 DH-Shredder4 months ago

  • Keywords 4.0-early needs-unit-tests added; 3.9-early removed
  • Milestone changed from 3.9 to Future Release

We definitely want to do this, but it needs unit tests.

Of note, there is a lot more we can probably learn with regards to UTF8/PCRE from #22692

Note: See TracTickets for help on using tickets.