Make WordPress Core

Opened 12 years ago

Closed 4 years ago

Last modified 4 years ago

#22363 closed defect (bug) (fixed)

Accents in attachment filenames should be sanitized

Reported by: targz's profile tar.gz Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 5.5 Priority: normal
Severity: critical Version: 3.4
Component: Media Keywords: has-patch has-dev-note
Focuses: Cc:

Description

There is an inconsistency in the way WP is sanitizing post slugs and attachment filenames.

Sanitizing the post slugs is a Good Thing(tm) for non-english users who use diacritics in their post titles.

Example: If I write a post with the title "Moiré patterns", the actual page slug will be: "moire-patterns". The space is replaced with a hyphen, the "é" becomes "e". Even if I try to change the slug manually into "moiré", WP won't let me (for my own good, since that URL would break in lesser capable browsers).

For some reason, WP doesn't apply that error-correction to attachment filenames.

Example: If I attach a file named "moiré pattern.png", it gets renamed into "moiré-pattern.png".

We can see that the space (and some other forbidden characters) are corrected by sanitize_file_name(), but diacritics such as "é" or "ä" are left as they are.

Currently, most modern browsers are capable of displaying files with diacritics, but some of them still fail (most prominently, Safari).

For the sake of cross-browser compatibility, attachment filenames should benefit from the same safety measures that we apply for the post slugs (I guess that's the remove_accents() function).

Attachments (23)

diacritics-ff-safari.png (22.3 KB) - added by tar.gz 12 years ago.
Diacritics resulting in broken images in Safari.
22363.patch (505 bytes) - added by SergeyBiryukov 12 years ago.
22363.2.patch (658 bytes) - added by SergeyBiryukov 12 years ago.
møiré pättern.png (78.8 KB) - added by tar.gz 12 years ago.
A test file with diacritics. The title has been typed under MacOSX.
img-src-Trac-WP.png (36.2 KB) - added by tar.gz 12 years ago.
The img src code, in Trac and in WP
22363.3.rev24527.patch (2.5 KB) - added by NumidWasNotAvailable 12 years ago.
22363.4.rev24527.patch (2.7 KB) - added by NumidWasNotAvailable 12 years ago.
22363.5.diff (704 bytes) - added by tar.gz 12 years ago.
Same as 22363.2.patch, but updated to match the current context of default-filters.php
sanitize_file_name_breaking_characters_utf8.txt (22.5 KB) - added by p_enrique 11 years ago.
A list of Unicode characters that break sanitize_file_name()
22363.5.patch (3.8 KB) - added by p_enrique 11 years ago.
Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower
22363.6.patch (4.9 KB) - added by p_enrique 11 years ago.
Better check for PCRE UTF-8, updated comment block
test-SanitizeFileName.patch (3.0 KB) - added by p_enrique 11 years ago.
Updated unit test for sanitize_file_name()
22363.7.patch (5.0 KB) - added by p_enrique 11 years ago.
Check for ASCII-only instead of seems_utf8() before strtolower()
22363.8.patch (5.2 KB) - added by p_enrique 11 years ago.
Be sensitive to original encoding that might differ from UTF-8
22363.9.patch (5.2 KB) - added by p_enrique 11 years ago.
Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding
22363.10.patch (4.7 KB) - added by p_enrique 11 years ago.
Put the filters back in the same place
22363.11.patch (8.1 KB) - added by markoheijnen 9 years ago.
22363.diff (8.9 KB) - added by swissspidy 9 years ago.
22363.12.patch (4.2 KB) - added by gitlost 9 years ago.
A minimal patch.
Revelation‏.png (7.1 KB) - added by programmin 8 years ago.
Another file with special (hidden) character
22363.2.diff (683 bytes) - added by nilovelez 5 years ago.
added remove_accents to sanitize_file_name() in /wp-includes/formatting.php
22363.3.diff (650 bytes) - added by audrasjb 5 years ago.
Media: Remove accent from media file names (patch refreshed against trunk)
22363.4.diff (650 bytes) - added by audrasjb 5 years ago.
Partial patch refreshed

Download all attachments as: .zip

Change History (164)

@tar.gz
12 years ago

Diacritics resulting in broken images in Safari.

#1 @tar.gz
12 years ago

  • Summary changed from Accentes in attachement filenames should be sanitized to Accents in attachement filenames should be sanitized

#2 @tar.gz
12 years ago

  • Cc code@… added
  • Summary changed from Accents in attachement filenames should be sanitized to Accents in attachment filenames should be sanitized

#3 @tar.gz
12 years ago

Possibly related: #15955, #16191, #16330

#5 @SergeyBiryukov
12 years ago

  • Version changed from trunk to 3.4

#6 @knutsp
12 years ago

  • Cc knut@… added

#7 @tar.gz
12 years ago

Ok, so I attempted a rudimentary patch, by adding a remove_accents() step at the beginning of wp_unique_filename(), in wp-includes/functions.php.

Like, adding

$filename = remove_accents($filename);

just after line 1606.

But no luck, this doesn't work, it doesn't change anything. A file named "moiré.jpg" still keeps the accent after being uploaded.

#8 @SergeyBiryukov
12 years ago

#23029 was marked as a duplicate.

#9 @tar.gz
12 years ago

By the way, I'm offering a free drink to the developer who patches this issue:

http://www.freedomsponsors.org/core/issue/121/accents-in-attachment-filenames-should-be-sanitized

#10 @SergeyBiryukov
12 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

22363.patch calls remove_accents() in sanitize_file_name(), like we do in sanitize_title().

22363.2.patch adds it as a filter instead.

#11 @tar.gz
12 years ago

I applied 22363.patch and 22363.2.patch on WP 3.6-alpha-23288, but it's not working for me: when uploading "moiré.jpg" and "höhö.jpg", they don't get corrected.

What am I doing wrong ?

Last edited 12 years ago by tar.gz (previous) (diff)

#12 @SergeyBiryukov
12 years ago

Hmm, they both work for me.

Tested in PHP 5.2.14 on Windows and PHP 5.2.17 on Linux. When uploading "moiré.jpg" and "höhö.jpg", the resulting files are "moire.jpg" and "hoho.jpg", respectively.

#13 @tar.gz
12 years ago

Indeed, the results are different in Linux, Windows, OSX.

Here is the current status of my testing:

Viewing of images with accents ("moiré.jpg") is broken in:

  • Safari 5.1.7 on OSX (wasn't able to test Safari 6 yet).
  • Safari on iOS 5.1.1 (test device: iPod)
  • Safari on iOS 6 (test device: iPad)

With the patch you provided, when uploading a file named "Forêt.jpg":

  • Windows Vista / IE7 : works, saves as "Foret.jpg".
  • Ubuntu 11.04 / Firefox 18 : works, saves as "Foret.jpg".
  • Mac OSX 10.6 / Firefox 18 : doesn't work, saves as "Forêt.jpg".

Then I tested with some other characters, and found out that it's even more complicated:

Uploading on Ubuntu/Firefox: ê, ç, ä get converted (to e, c, a) but the "ö" remains as it is. A file named "höhö.jpg" does not get renamed.

Uploading a file named "møiré pättern.png" under OSX/Firefox:

  • On an unpatched WP 3.5, only the blank space is converted into hyphen, the file is saved as: "møiré-pättern.png".
  • On WP 3.5 with your patch applied, the file is saved as "moiré-pättern.png" - the nordic "ø" has been converted into "o".

So something is working, it's just that some accented characters aren't correctly recognized! I hope this brings us on the right track.

One thing that comes to my mind is that there are two different ways to generate those accented characters, one of them being "combined". For instance, there is a "single glyph" version of "é" that a hex editor displays as "C3 A9", and a combined (base+diacritic) version that displays as "65 CC 81".

And indeed, if I paste the filename of "moiré-pättern.png" into a text file and open it with some hex editor, I see that the ø is a singly glyph (= gets converted correctly), while the é and ä are combined characters.

I imagine that this could be the source of the inconsistencies? So the result actually depends upon the OS, and perhaps even the type of keyboard, on which the filenames have been typed.

One more test. If I copy-paste that "møiré pättern" string from the filename into the title field of a new post, WP generates the following permalink: "moire-pättern". That's interesting: the combined-character é has been fixed by WP, but the combined-character ä hasn't.

FYI, my test server is running PHP 5.3.10.

And by the way, congrats on your Bug Gardener nomination :)

@tar.gz
12 years ago

A test file with diacritics. The title has been typed under MacOSX.

@tar.gz
12 years ago

The img src code, in Trac and in WP

#14 @tar.gz
12 years ago

Above you can see a quick demonstration of how Trac handles the same file.

Trac uses a different approach: all the characters get converted into html, so the resulting file is actually named "m%C3%B8ir%C3%A9%20p%C3%A4ttern.png". This works fine in Safari.

Personally, I prefer the "human-readable" approach of WP, which in theory would result in the file being renamed "moire-pattern.png". But currently it isn't able to catch all those ugly diacritics.

#15 @tar.gz
12 years ago

One more thing: during my tests, I came across a possible workaround.

We can make those files viewable in safari if we URL-encode the filenames.

When embedding the image into a post, if we link to it with this path (the code given by the Media Uploader):

<img src="/wp-content/uploads/moiré-pättern.png" alt="møiré pättern" />

It will be broken in Safari. But if we write the link like this:

<img src="/wp-content/uploads/moire%CC%81-pa%CC%88ttern.png" alt="møiré pättern" />

then it will display just fine, although it's exactly the same file.

Last edited 12 years ago by tar.gz (previous) (diff)

#16 @tar.gz
12 years ago

Some more information after further research:

  • The same issue obviously appears for any type of attachment, such as a PDF that would be linked from within a post. In Safari, linking to a PDF file with diacritics will lead to the 404 page.
  • On WordPress.com, a solution has been implemented: when uploading the "moiré-pättern.png" file from above, it gets renamed into "mc3b8irecc81-pacc88ttern.png". So there is some sanitization in place, which is lacking in WordPress.org. And it works in OSX.

Here are two more links from support forums, showing how this issue affects users around the world:

All this testing actually leads me to think that I should open another ticket, not about renaming the files, but about generating URL-encoded links when embedding files into a post (or generating galleries with the shortcode). According to RFC 3986 (URI Generic Syntax), accented characters should be percent-encoded. I guess the simple solution would be to cleanse those filenames with the rawurlencode() PHP function when generating the link.

With that strategy, we would have a pretty good fix, even without touching the upload / file renaming process.

#17 follow-up: @ryan
12 years ago

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

Last edited 12 years ago by ryan (previous) (diff)

#18 @ryan
12 years ago

  • Milestone changed from 3.6 to Future Release

#19 in reply to: ↑ 17 @tar.gz
12 years ago

Replying to ryan:

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

On the other hand, such manipulations are applied when files are uploaded to WordPress.com.

Also, what do you think about percent-encoding the file URLs when inserting images into a post? (or when storing the file URL into the database - not sure how the process works internally). That would pretty much solve the issue, without changing the actual filename.

#20 @NumidWasNotAvailable
12 years ago

About sanitizing filenames

Modern browsers handle Unicode characters in URLs. Actually they perform an automatic translation using percent encoding. In some configurations it is quite transparent for the users.

RFC 3986 says that non ASCII charcaters are not allowed in URIs.

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser. This is understandable, just have a look at the statistics at StatCounter.

However, some users may encounter an issue since URLs are not RFC 3986 compliant. Furthermore, I would be interested to know what configurations are concerned by this issue so far.

I am coming up with the following patch: filenames are sanitized using sanitize_title as suggested by tar.gz.

About remove_accents

In this ticket we discovered that remove_accents was not handling Unicode combining characters at all.

I created a new ticket for this issue: #24661.

Last edited 12 years ago by NumidWasNotAvailable (previous) (diff)

#21 @NumidWasNotAvailable
12 years ago

  • Cc NumidWasNotAvailable added

#22 @tar.gz
12 years ago

Thanks for those patches!

I just tested 22363.3.rev24527.patch in combination with 24661.0.rev24527.patch from #24661 - works great, it fixes the problem. Here is my test page: http://sandbox.4o4.ch/wp-git/?p=506

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser.

This is not entirely true, since for *Post URLs*, WP *does* remove accented characters.

Secondly, the reason I brought up this issue, is that one of the "modern browsers", Safari (on OSX and iOS) is failing to display files with combined characters in their name (unless the url is %-encoded). Depending on the country where you live, that can be a significant segment - for Switzerland, StatCounter ranks Safari at 20% (and 60% for mobile).

Finally, those files with combined diacritics are also poorly supported by FTP clients. Whenever I have to migrate a WP site, in a French or German speaking area, some files will allways be broken during FTP transfer, and will need manual replacement. Fixing them at upload would simply eliminate all those potential future issues.

#23 @NumidWasNotAvailable
12 years ago

I updated the patch because filenames without extension were not processed properly.

#24 @SergeyBiryukov
12 years ago

  • Keywords needs-unit-tests added

#25 @numidwasnotavailable
12 years ago

Tests fail to pass with 22363.4.rev24527.patch. Indeed sanitize_file_name no longer performs extension munging.

I propose a rollback to 22363.2.patch) considering the fix in #24661.

#26 @tar.gz
12 years ago

Is there anything I can do to advance this bug, in order to see the issue fixed in WP 3.7? More testing needed?

#27 @SergeyBiryukov
12 years ago

  • Milestone changed from Future Release to 3.7

Looks like 22363.2.patch is the way to go here, and we should look into removing combined accents in #24661.

@tar.gz
12 years ago

Same as 22363.2.patch, but updated to match the current context of default-filters.php

#28 @tar.gz
12 years ago

For the record, Julio Poitier, well respected French WP coder, is also advocating this fix, and gets much positive feedback for it:
http://www.geekpress.fr/wordpress/astuce/suppression-accents-media-1903/

#29 @juliobox
12 years ago

Thanks tar.gz ;)

#30 @SergeyBiryukov
12 years ago

  • Milestone changed from 3.7 to 3.8

#31 @GregLone
11 years ago

  • Cc greg@… added

#32 @p_enrique
11 years ago

I've been experimenting with this issue. Now, I know I'm repeating most of the things said above, but I'd like to confirm the following:

  1. As for filenames:
    • Non-ASCII filenames will break on Windows. This has to do with how PHP interacts with the Windows file system. There might be a workaround that allows filenames to contain characters contained on the current "code page", but other characters will break filenames.
    • *nix filesystems allow any characters in filenames (minus the reserved characters, of course).
    • WordPress allows uploaded files to have any characters in their names (minus some reserved characters).
    • UTF-8 filenames work OK on WordPress on a *nix platform.
    • URL-encoded (percent-encoded) filenames work on any platform, but make the filename 6 times longer in certain scripts, such as Cyrillic ('%D0%B6' for the character 'ж', for example).
    • The maximum file name length under NTFS is 259 characters including the path.
    • URL-encoded filenames are not a solution. <img src='ca%C3%B1o.jpg'/> will 404 even though the file 'ca%C3%B1o.jpg' exists, since the server decodes the request to 'caño.jpg'.
    • remove_accents() is only a partial solution since 1) it has no support at all for non-Latin based scripts and 2) it doesn't do anything with such things as curly quotes, dashes, copyright symbols or other similar things (which are included in sanitize_title_with_dashes())
    • sanitize_title() as such is no good for filenames, since it URL-encodes the non-ASCII parts.
  1. As for URLs:
    • The only allowed characters in URLs according to RFC3986 are only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes. Everything else should be encoded.
    • The attachment anchor href and img src attributes are not encoded on WP.
    • Not encoding the above may break things in some browsers.
    • Not encoding an URI may break other software that recognizes pasted URIs.
    • An encoded anchor href is handled transparently by modern browsers. For examples, see The Russian Wikipedia, where hovering on a link shows Cyrillic text but in the source, the URI is %-encoded.
Last edited 11 years ago by p_enrique (previous) (diff)

#33 @p_enrique
11 years ago

  • Cc heikki@… added

#34 @p_enrique
11 years ago

I'm not sure if I should open a new ticket for this, maybe someone will advise me. Anyway, I've been testing sanitize_file_name( 'X.jpg' ) where X is an Unicode character that is a number or a letter (matching regex /[\p{L}\p{N}]/u). Alarmingly, there are many rather common characters that will result in a malformed, broken string being returned:

(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er

A complete list is attached.

Edit:
The culprit is the preg_replace without a Unicode pattern modifier. I found that the U-modifier is used here and there in WordPress sources, usually without any checking. I'd write some patches now, but I don't know how to do it & don't have time to learn just now.

Last edited 11 years ago by p_enrique (previous) (diff)

@p_enrique
11 years ago

A list of Unicode characters that break sanitize_file_name()

#35 @tar.gz
11 years ago

Thanks p_enrique for the feedback.

If those characters you mention present an issue, how comes that they don't produce malformed post slugs when using them in a post title?

Basically, the aim of this ticket is that file names should be sanitized in the same manner as post slugs.

Another question: based on your observations on percent-encoding, would it make sense to open another ticket, asking for all permalinks in WP (including the file permalinks) to be percent-encoded? See #comment:16 above.

#36 @p_enrique
11 years ago

tar.gz: They don't produce malformed slugs because sanitize_post_title() urlencodes everything that's not ASCII. This works great for anything that's an URL. But the mechanism cannot be used for file names, since the browser cannot access a file that has a urlencoded file name as I point out in #comment:32.

#38 @p_enrique
11 years ago

My recent patch uses wp_strip_all_tags, adds filter remove_accents and relies on preg_replace (with possible UTF-8 modifier) to remove other characters with support for the sanitize_file_name_chars filter. Also converting the filename to lower case if mb_string extension is available.

Here are some test results with my recent patch:

sanitize_file_name( "Posyłają\tSzczegóły\r\n(Październik).jpg" );
 -> string(34) "posylaja-szczegoly-pazdziernik.jpg"

sanitize_file_name( "<strong>Mọi người đều có quyền <foo> tự do tham gia vào đời sống văn hoá của cộng đồn</strong>" );
 -> string(97) "mọi-nguòi-dèu-có-quyèn-tụ-do-tham-gia-vào-dòi-sóng-van-hoá-của-cọng-dòn"

sanitize_file_name( "-ÉŁè Šíæ 20 % —40° “über/Þøß” βαΣιΛειος à © Ñœijç.õs¿ & Kö.yr.ä = 10 €-.gif.gif" );
 -> string(75) "ele-siae-20-40-uber-thos-βασιλειος-a-noeijc.os-ko.yr_.a-10.gif.gif"

sanitize_file_name( "---Все люди 1) рождаются свободными и 2) равными в своем 2A) достоинстве и 2B) ПРАВАХ!!!---.jpg" );
 -> string(140) "все-люди-1-рождаются-свободными-и-2-равными-в-своем-2a-достоинстве-и-2b-правах.jpg"

sanitize_file_name( "ณ ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ โลกจะใช้เพียง Unicode เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน .jpg" );
 -> string(246) "ณ-ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ-โลกจะใช้เพียง-unicode-เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน.jpg"

sanitize_file_name( "Invalid Unicode string: abc-\xff-def"
 -> string(0) ""

(There seem to be some Vietnamese characters not captured by remove_accents.)

Last edited 11 years ago by p_enrique (previous) (diff)

#39 @p_enrique
11 years ago

I have two more comments on this matter:

  1. I think there are plugins like cyr2lat, arabic-to-lat etc. that convert characters in Non-Latin scripts to ASCII. These will work if they hook to sanitize_file_name.
  1. There should be an extra filter for Windows-based servers that tries to convert everything possible to ASCII and then strips out the rest. Otherwise, the attachments and images will break under Windows. But such a filter would need a fallback filename for a case where converting/stripping leaves (practically) no ASCII at all. For example, the Russian file name (example 4 above) when stripping all non-ASCII and multiple dashes leaves 1-2-2a-2b.jpg How should such a case be handled?
Last edited 11 years ago by p_enrique (previous) (diff)

#40 @p_enrique
11 years ago

Patch updated. This now gives correct results:

sanitize_file_name( "A.String.With.Many.Dots.gif.gif.gif.ÖÑÉ.twõ.très.fö.ür" );
 --> string(61) "a.string.with_.many_.dots_.gif.gif.gif.one_.two_.tres_.fo_.ur"}}}

@p_enrique
11 years ago

Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower

#41 @p_enrique
11 years ago

Patch updated again. Added a call to html_entity_decode() if PCRE UTF-8 is available, fixed some typos, consolidated preg statements.

@p_enrique
11 years ago

Better check for PCRE UTF-8, updated comment block

@p_enrique
11 years ago

Updated unit test for sanitize_file_name()

#42 @p_enrique
11 years ago

  • Keywords needs-testing added; needs-unit-tests removed

I don't know if you're supposed to write unit tests for your own contributions, but I did anyway. Here they are. Please note that I modified one of the earlier tests since the earlier behaviour was to strip illegal characters, now they are replaced with dashes.

@p_enrique
11 years ago

Check for ASCII-only instead of seems_utf8() before strtolower()

@p_enrique
11 years ago

Be sensitive to original encoding that might differ from UTF-8

#43 @cristovaov
11 years ago

hi, I've been following this ticket and previously plugged patch 22363.2 (http://wp.me/KmdR)
Trying to apply patch 22363.8 also as a plugin but my technical knowledge is rather limited: getting 'Fatal error: Cannot redeclare sanitize_file_name()'

Could someone have a look? I've put my code as a gist here:
https://gist.github.com/cristovaov/7674225 --thanks

#44 @cristovaov
11 years ago

  • Cc cristovaov added

#45 @NumidWasNotAvailable
11 years ago

  • Cc NumidWasNotAvailable removed

@p_enrique
11 years ago

Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding

#46 follow-up: @p_enrique
11 years ago

@cristovaov: You can't make a plugin out of a patch that modifies core WP files. In this case, my patch modifies the function sanitize_file_name() in /wp-includes/formatting.php. If you put the code in a plugin, you'll get precisely that error message since the function is already declared in the file I mentioned. If you want to experiment, you can try the SVN install as described in the Contributor Handbook and then applying the patch. I'd be grateful for any feedback about my patch.

#47 @SergeyBiryukov
11 years ago

The sanitize_file_name filter should stay where it is now. It should be possible to override the sanitization if needed.

@p_enrique
11 years ago

Put the filters back in the same place

#48 @p_enrique
11 years ago

  • Keywords dev-feedback added

#49 in reply to: ↑ 46 @cristovaov
11 years ago

@p_enrique: Thanks for clearing that, I was blindly assuming it was possible.
I'll look at doing an svn install--

Replying to p_enrique:

@cristovaov: You can't make a plugin out of a patch that modifies core WP files.

#50 @nacin
11 years ago

  • Keywords 3.9-early added

I think this is great but it needs a lot of time to be soaked. Let's try it out in 3.9.

#51 @nacin
11 years ago

  • Milestone changed from 3.8 to Future Release

#52 @markoheijnen
11 years ago

  • Milestone changed from Future Release to 3.9

Nacin can we give this a try this release?

This ticket was mentioned in IRC in #wordpress-dev by SergeyBiryukov. View the logs.


11 years ago

#54 follow-up: @kirasong
11 years ago

  • Keywords 4.0-early needs-unit-tests added; 3.9-early removed
  • Milestone changed from 3.9 to Future Release

We definitely want to do this, but it needs unit tests.

Of note, there is a lot more we can probably learn with regards to UTF8/PCRE from #22692

#55 @chriscct7
10 years ago

  • Keywords 4.0-early removed

#56 in reply to: ↑ 54 ; follow-up: @zodiac1978
9 years ago

Replying to mikeschroder:

We definitely want to do this, but it needs unit tests.

There *are* unit tests. Or not?
https://core.trac.wordpress.org/attachment/ticket/22363/test-SanitizeFileName.patch

This ticket was mentioned in Slack in #core-images by mike. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by paaljoachim. View the logs.


9 years ago

#59 in reply to: ↑ 56 @kirasong
9 years ago

  • Keywords needs-refresh added; has-patch needs-unit-tests removed

Replying to zodiac1978:

Thre *are* unit tests. Or not?
https://core.trac.wordpress.org/attachment/ticket/22363/test-SanitizeFileName.patch

Right you are; my very late apologies. I'd like to take another look at this, but it needs a refresh by someone more knowledgable in formatting.php than I.

cc: @miqrogroove @pento Is this up your alley (in addition to those already on this ticket)?

This ticket was mentioned in Slack in #core-images by paaljoachim. View the logs.


9 years ago

#61 @markoheijnen
9 years ago

  • Keywords needs-refresh removed

I have updated it and changed some tests.

The filter sanitize_file_name_chars is almost not needed anymore with PCRE. The moment PCRE isn't available a lot of things still need to cleaned up. I assume this is a bug because output should always be the same (or close enough).

This ticket was mentioned in Slack in #core-images by paaljoachim. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by chriscct7. View the logs.


9 years ago

#64 @chriscct7
9 years ago

From the core-images chat:
Marko:

First thing I want to do is split the unit tests so it runs with and without PCRE. Secondly looking at what to do with sanitize_file_name_chars. New code almost completely ignores it. Which could be fine.

https://wordpress.slack.com/archives/core-images/p1455214794000335

#65 @ocean90
9 years ago

#35809 was marked as a duplicate.

#66 @SergeyBiryukov
9 years ago

#35931 was marked as a duplicate.

#67 @Mte90
9 years ago

any hope for this in 4.5? yesterday we found a site with that problem on safari.

This ticket was mentioned in Slack in #core by mte90. View the logs.


9 years ago

#69 @SergeyBiryukov
9 years ago

#36645 was marked as a duplicate.

#70 @Mimergt
9 years ago

  • Component changed from Upload to Permalinks
  • Severity changed from normal to critical

Some patch or similar... on 4.5.3 the problem back to safari :(

I try to use the sanitize_file_name but nothing works.

This ticket was mentioned in Slack in #core-images by ocean90. View the logs.


9 years ago

#72 @jorbin
9 years ago

@swissspidy What are your thoughts on this one?

This ticket was mentioned in Slack in #core-images by mike. View the logs.


9 years ago

#74 @joemcgill
9 years ago

  • Milestone changed from Future Release to 4.7

Let's try to address this early in 4.7.

@swissspidy
9 years ago

#75 @swissspidy
9 years ago

  • Keywords has-patch added

I'd really like to get this into 4.7.

22363.diff makes the last patch apply cleanly against trunk again, so that should get us going. Worked well on my local install.

Replying to chriscct7:

From the core-images chat:
Marko:

First thing I want to do is split the unit tests so it runs with and without PCRE. Secondly looking at what to do with sanitize_file_name_chars. New code almost completely ignores it. Which could be fine.

https://wordpress.slack.com/archives/core-images/p1455214794000335

I agree with that. Unfortunately I'm not really familiar with all the PCRE stuff in PHP so I'd need some help there. IIRC @SergeyBiryukov has some knowledge there.

Looking at an earlier comment, @miqrogroove and @pento should be quite familiar with the formatting component, which can't hurt either.

#76 @DrProtocols
9 years ago

Whilst a little late to the game I would say resolving this issue (and perhaps if the resolution also extended to transliterating all international characters in a uploaded filenames and ensuring all lower-case) would be a massive win in terms of site portability since issues can often arise with character set support on different platforms and also case-sensitive vs case-insensitive filesystems.

The IEEE Portable Filename Character Set (http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html section 3.276) but with the added condition of lower-case only is an ideal goal for any uploaded media file.

To paraphrase Shakespeare, "A file by any other name still displays the same".

#77 @pento
9 years ago

  • Keywords dev-feedback removed

Playing with these parts of Core can cause things to break in strange ways. Let's find out what breaks!

22363.diff looks reasonable, I'm okay with #yolofriday-ing it.

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 years ago

#79 @gitlost
9 years ago

I think there's a number of problems with the latest proposed patch.

The encoding of the filename can only be guessed at as far as I know - it's just a string of bytes with no encoding info from the client and is not connected to the page's (or blog's) charset, so there should be no defaulting to the blog's charset. The default should just be nothing.

It's not clear to me why wp_strip_all_tags() and html_entity_decode() and entity stripping are being done, as opposed to the current solution of just stripping angle brackets and ampersands via special_chars, which seems to be as effective and is less drastic.

A small point is that the conversion of the special_chars filter to a concatenated character class regex is not strictly backward compatible, as people could have been doing multi-byte stripping, which would fail now.

The use of the $utf8_modifier is only needed when the regex contains a multi-byte pattern, and there's only 2 places in the proposed patch where that is (intentionally) the case, one of which is problematic [\s-]+ and the other arguable (?!\.)[^\p{L}\p{Nd}]+. A third use on ^[a-zA-Z]{2,5}\d?$ looks unintentional as \d will match \p{Nd} if UTF-8 mode is set and PCRE has UCP support (which unfortunately are not necessarily the same thing - see #22692#comment:39). Other uses are unneeded.

The [\s-]+ use is problematic and as it stands could actually re-introduce the bug that corrupts UTF-8 that was fixed up above (see #26094) if the filename is in UTF-8 and PCRE UTF-8 mode is unavailable (as $utf9_modifier would then be blank). Even if that's fixed it still has issues due to the difficulty in predicting what it will match - in single-byte mode it's PHP locale dependent, and in UTF-8 mode it's UCP dependent. Its introduction appears to be driven by wanting to encompass both the U+00A0 substitution (see #27281 but it's not particularly clear from that bug report whether the fix was appropriate) and the simple substitution [\r\n\t -]+. I think it would be better and more targeted to leave the current [\r\n\t -]+ as is, and to replace the current preg_match( "#\x{00a0}#siu", ' ', $filename ) with a straightforward str_replace( "\xc2\xa0", ' ', $filename ) (to be done only when the filename is in UTF-8).

The (?!\.)[^\p{L}\p{Nd}]+ use should only be done if the filename is in UTF-8, and also technically should check that UCP support is available, as a previous patch 22363.5.patch above does. This is the substantive change that fixes up the bug in remove_accents(), and works. However I think the bug in remove_accents() - which could be renamed transliterate_commonly_used_latin_to_ascii() - should be fixed in remove_accents(), as specified in the branched off bug #24661. Also to be picky(!) it removes too much, not just decomposed latin accents. The solutions mentioned in #24661 are more targeted.

The use of mb_strtolower() should probably also only be done if the filename is in UTF-8, and then explicitly pass UTF-8 as the encoding (see sanitize_title_with_dashes()), rather than using mb_detect_encoding(), which has issues.

#80 @joemcgill
9 years ago

Related: #37989. Looks like we need to consider use cases where people are expecting multi-word file names to be respected as the title in the media editor. 22363.diff converts the title of an attachment named This is a file.jpg to this-is-a-file-jpg, which is unexpected.

This ticket was mentioned in Slack in #core-images by mike. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 years ago

#83 @dustinbolton
9 years ago

While also a little late to the game, I also wanted to chime in that we regularly encounter users attempting to move their site to different servers only to be met with difficulty due to certain characters in filenames not playing nicely with their new host. It's an ongoing problem that we are seeing in greater numbers as international use of WordPress grows and more 'offending' files are uploaded into users' sites every day. Would love to see this make it in as it's been a growing known issue for 4+ years now. :)

Thanks,
Dustin Bolton
BackupBuddy Developer

This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 years ago

#87 @kirasong
9 years ago

  • Component changed from Permalinks to Media
  • Owner set to mikeschroder
  • Status changed from new to assigned

Assigning to myself for information gathering/a decision with how to proceed for 4.7.

#88 follow-up: @gitlost
9 years ago

I think the problem here is that the patch has lost its way from the original bug report, which is really a Mac specific issue, and the same as various other reports such as #35951.

The issue is that Safari always normalizes filenames to NFC. This causes problems if files are uploaded using other browsers (which don't normalize) and then viewed in Safari. The "correct" fix would be to likewise normalize filenames to NFC in sanitize_file_name(). As core doesn't have this facility (yet!), the workaround of using remove_accents() as a sort of poor man's normalizer seems good.

Re the encoding of filenames in something other than UTF-8, I found the only way I could get current browsers to do this was by specifying the accept-charset attribute in the <form>, which is pretty unnatural to say the least (and even here Safari ignored it). So non-UTF-8 filenames should be treated cursorily I think and just reduced to ASCII.

Re the other changes, they don't make much sense to me as given, without going to a full-blown ASCII transliteration, as has been suggested in various places (eg above and in #15955), and which would have major portability and interoperability advantages, but would obviously be a major change and presumably deserve its own ticket.

FWIW, I'll upload a patch which apart from the reduce-non-UTF-8-to-ASCII change just replaces the U+00A0 preg_replace with a straight str_replace (actually lumps it in with other subs to '-' but it amounts to the same), adjusting the unit tests accordingly.

@gitlost
9 years ago

A minimal patch.

#89 in reply to: ↑ 88 @swissspidy
9 years ago

The issue is that Safari always normalizes filenames to NFC. This causes problems if files are uploaded using other browsers (which don't normalize) and then viewed in Safari. The "correct" fix would be to likewise normalize filenames to NFC in sanitize_file_name(). As core doesn't have this facility (yet!), the workaround of using remove_accents() as a sort of poor man's normalizer seems good.

Related: #30130

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 years ago

#92 follow-up: @swissspidy
9 years ago

The latest patch does what it says it does, but I'm actually having a hard time reproducing this on Safari 10 (macOS Sierra), so I can't tell for sure. Images with diacritics are displayed just fine in the media library and the front end. The only problem Safari has is with the permalinks, i.e.. example.com/aaâãäåçeeêëiiîïñooôõöuuûüyÿ/ works in Safari, but in the other browsers it's example.com/aaa%CC%82a%CC%83a%CC%88a%CC%8Ac%CC%A7eee%CC%82e%CC%88iii%CC%82i%CC%88n%CC%83ooo%CC%82o%CC%83o%CC%88uuu%CC%82u%CC%88yy%CC%88/, which is still a problem with the current patch.

#93 @gitlost
9 years ago

Note in order for the OP to be fixed, #24661 needs to be fixed also, using eg patch 24661.5.patch.

Maybe Safari's behaviour has changed in version 10 but I can reproduce it with the latest nightly (4.7-alpha-38766) running on Ubuntu 16.04 using Safari (version 6) and Firefox (48.0.1) or Chrome (49.0.2623.112) on OS X 10.8 Mountain Lion (VirtualBox Hackintosh).

Upload a file with diacritics eg "forêt.png" using Firefox or Chrome on OS X. The file will be uploaded as "2016/10/fore\xcc\x82t.png" (NFD). Then view the uploaded file in Safari ("wp-admin/upload.php"). Safari converts the filename to "2016/10/for\xc3\xaat.png" (NFC) and it doesn't display. If the two patches 22363.12.patch and 24661.5.patch are applied then the file is uploaded as "2016/10/foret.png" so the issue doesn't arise.

#94 @gitlost
9 years ago

Just tried it on Safari version 10.0 on macOS Sierra (TechReviews' VirtualBox) and the behaviour's the same.

#95 in reply to: ↑ 92 @kirasong
9 years ago

Replying to swissspidy:

The latest patch does what it says it does, but I'm actually having a hard time reproducing this on Safari 10 (macOS Sierra), so I can't tell for sure. Images with diacritics are displayed just fine in the media library and the front end. The only problem Safari has is with the permalinks, i.e.. example.com/aaâãäåçeeêëiiîïñooôõöuuûüyÿ/ works in Safari, but in the other browsers it's example.com/aaa%CC%82a%CC%83a%CC%88a%CC%8Ac%CC%A7eee%CC%82e%CC%88iii%CC%82i%CC%88n%CC%83ooo%CC%82o%CC%83o%CC%88uuu%CC%82u%CC%88yy%CC%88/, which is still a problem with the current patch.

I also had a hard time reproducing it until I realized that the upload had to happen with a browser other than Safari, as a believe @gitlost is referencing above. Once I uploaded with Chrome, then followed up with a test on Safari, I saw the bug.

This ticket was mentioned in Slack in #core by mike. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by mike. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-images by swissspidy. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


9 years ago

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


9 years ago

#102 @pento
9 years ago

Sorry about the lack of feedback on this ticket!

My concerns are less about this ticket, and more about the changes proposed in #24661. There are some complex changes proposed that need to be tested and verified.

This ticket was mentioned in Slack in #core by helen. View the logs.


8 years ago

#104 @helen
8 years ago

  • Milestone changed from 4.7 to Future Release

Seeing as I punted #24661, I suppose this one should go, too. :(

#105 @SergeyBiryukov
8 years ago

#39588 was marked as a duplicate.

@programmin
8 years ago

Another file with special (hidden) character

#106 @Vayu
8 years ago

I really would love for a fix on this issue. I manage several website and use BackupBuddy to create backups. Whenever a file is named with diacritics characters or others of the likes (æøå), BackupBuddy will not back it up on their sync servers. If I make a manual backup via an FTP client, the file names will become corrupt during the download and I will not be able to upload the files again to a new or same server.
Every time I install a new WordPress website, I also install a plugin (File Renaming on upload) that will rename the file during upload. This works for me. But sometimes I am asked to manage a WordPress website that is setup by others, and this is where I have trouble doing backups.

#107 @MaximeCulea
7 years ago

We have made a plugin for this issue resuming all this discussion, maybe a workaround for merge proposal :

#108 @madvic
7 years ago

This little improve is scheduled for 2026 in the core ?

This ticket was mentioned in Slack in #core by benoitchantre. View the logs.


7 years ago

#110 follow-up: @lucasbustamante
7 years ago

As a professional PHP developer, I've lost count on how many times my client's websites broke during migration because of images with accents being converted to symbols upon unzipping on Linux. Sanitizing filenames is a must.

#111 @kirasong
7 years ago

  • Owner mikeschroder deleted

Unassigning myself from this to make it more clear that it's open for folks to work on.

@nilovelez
5 years ago

added remove_accents to sanitize_file_name() in /wp-includes/formatting.php

#112 @nilovelez
5 years ago

I have just updating a working patch.

I have applied the remove_accents() function to the sanitize_file_name() in /wp-includes/formatting.php just after $special_chars are removed.

Test filename: Bolígrafo Ecológico 2€.jpg
Before patch: Bolígrafo Ecológico 2€.jpg (Bol%C3%ADgrafo%20Ecol%C3%B3gico%202%E2%82%AC.jpg)
After patch: Boligrafo-Ecologico-2E.jpg

This ticket was mentioned in Slack in #core by nilovelez. View the logs.


5 years ago

#114 @SergeyBiryukov
5 years ago

  • Owner set to SergeyBiryukov
  • Status changed from assigned to reviewing

#115 @SergeyBiryukov
5 years ago

  • Milestone changed from Future Release to 5.5

#116 in reply to: ↑ 110 @RavanH
5 years ago

Replying to lucasbustamante:

As a professional PHP developer, I've lost count on how many times my client's websites broke during migration because of images with accents being converted to symbols upon unzipping on Linux. Sanitizing filenames is a must.

Completely agree. Hope this will get into 5.5 at last!

@audrasjb
5 years ago

Media: Remove accent from media file names (patch refreshed against trunk)

#117 follow-up: @audrasjb
5 years ago

  • Keywords needs-testing removed

This is definitely a long standing issue we should fix asap.
I refreshed the patch against trunk.
I tested it and it works fine.

Not directly related, but FYI, I'll open another one to add other missing characters to $special_chars array.

#118 @audrasjb
5 years ago

  • Keywords needs-dev-note added

Probably worth adding some words about this in the "Miscellaneous changes" dev note.

#119 in reply to: ↑ 117 @zodiac1978
5 years ago

Replying to audrasjb:

I tested it and it works fine.

This does not fix the problem with NFD characters, correct?

So if a file with NFD characters in the name is uploaded on Safari, but displayed on Chrome/Firefox it will still fail. Or not?

#120 @audrasjb
5 years ago

Yes @zodiac1978 it's a first step to help this ticket to move forward. It fixes the issue for at least some use cases…

Worth noting there is also #24661.

#121 @RavanH
5 years ago

Hi @audrasjb hope you are doing fine after the "confinement" :)

About issue #24661 : I did run into this after moving a site from OVH to o2switch where somehow filenames with combination accents are changed in admin to filenames with normal accented characters. This makes images mysteriously appear blank/missing in the media library sometimes even though they are correctly uploaded and present in the uploads directory. Still not sure why/how this is happening exactly (there are plugins like WPML and Shortpixel running on this particular site, so it might be combination of factors) but converting accented characters and combination characters in filenames would certainly prevent such issues...

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


5 years ago

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


5 years ago

#124 @audrasjb
5 years ago

  • Keywords needs-refresh added

My partial patch doesn't apply anymore. Refreshing.

@audrasjb
5 years ago

Partial patch refreshed

#125 @audrasjb
5 years ago

  • Keywords needs-refresh removed

22363.4.diff is a partial patch for the issue.

It doesn't fully address the issue (doesn't affect NFD characters), but it would be a great step forward this long pending issue 🙏

#126 @nilovelez
5 years ago

22363.4.diff is more than enough for most cases. It uses the san function that convert a post title into a slug.

It doesn't address every single use case, but it is a huge step forward and it is a completely safe fix.

I propose to include this patch in WP 5.5, close the ticket and create a new one to study if it is possible to address special cases.

#127 @SergeyBiryukov
5 years ago

In 48603:

Media: Remove accents in sanitize_file_name().

This brings some consistency with sanitize_title() and sanitize_user().

Props tar.gz, NumidWasNotAvailable, juliobox, p_enrique, cristovaov, zodiac1978, mikeschroder, markoheijnen, chriscct7, swissspidy, DrProtocols, pento, gitlost, joemcgill, dustinbolton, programmin, Vayu, MaximeCulea, lucasbustamante, nilovelez, RavanH, audrasjb, SergeyBiryukov.
See #22363.

#128 @SergeyBiryukov
5 years ago

In 48604:

Tests: Update wp_unique_filename() unit tests to account for sanitize_file_name() removing accents.

Follow-up to [48603].

See #22363.

#129 @SergeyBiryukov
5 years ago

  • Milestone changed from 5.5 to 5.6

Moving to 5.6 for investigating the remaining pieces here.

This ticket was mentioned in Slack in #core by justinahinon. View the logs.


5 years ago

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


5 years ago

#132 @antpb
5 years ago

I've noticed a partial fix was merged already. @SergeyBiryukov is there anything the Media team can help with here to bring this across the line before Beta 2?

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


4 years ago

This ticket was mentioned in Slack in #core-media by hellofromtonya. View the logs.


4 years ago

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


4 years ago

#136 @antpb
4 years ago

Hi @SergeyBiryukov ! Can the Media team help here in any way to get things across the line before the next release?

This ticket was mentioned in Slack in #core by daisyo. View the logs.


4 years ago

#138 follow-up: @leemon
4 years ago

Can you merge this in 5.6, please? It's a long standing issue when migrating sites between some hosts. Thanks.

#139 @madvic
4 years ago

For pity's sake! PLEASE

#140 in reply to: ↑ 138 @SergeyBiryukov
4 years ago

  • Milestone changed from 5.6 to 5.5
  • Resolution set to fixed
  • Status changed from reviewing to closed

Replying to leemon:

Can you merge this in 5.6, please? It's a long standing issue when migrating sites between some hosts. Thanks.

This was merged in [48603] for 5.5.

The ticket was only left open to go through the comments and see if any of them should be split into a new ticket.

On second thought, let's close this one as fixed in 5.5. If someone has any follow-up ideas, fixes or enhancements, please create a new ticket. Thanks!

Note: See TracTickets for help on using tickets.