WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 2 days ago

#22363 new defect (bug)

Accents in attachment filenames should be sanitized

Reported by: tar.gz Owned by:
Milestone: 4.7 Priority: normal
Severity: critical Version: 3.4
Component: Permalinks Keywords: needs-testing has-patch
Focuses: Cc:

Description

There is an inconsistency in the way WP is sanitizing post slugs and attachment filenames.

Sanitizing the post slugs is a Good Thing(tm) for non-english users who use diacritics in their post titles.

Example: If I write a post with the title "Moiré patterns", the actual page slug will be: "moire-patterns". The space is replaced with a hyphen, the "é" becomes "e". Even if I try to change the slug manually into "moiré", WP won't let me (for my own good, since that URL would break in lesser capable browsers).

For some reason, WP doesn't apply that error-correction to attachment filenames.

Example: If I attach a file named "moiré pattern.png", it gets renamed into "moiré-pattern.png".

We can see that the space (and some other forbidden characters) are corrected by sanitize_file_name(), but diacritics such as "é" or "ä" are left as they are.

Currently, most modern browsers are capable of displaying files with diacritics, but some of them still fail (most prominently, Safari).

For the sake of cross-browser compatibility, attachment filenames should benefit from the same safety measures that we apply for the post slugs (I guess that's the remove_accents() function).

Attachments (18)

diacritics-ff-safari.png (22.3 KB) - added by tar.gz 4 years ago.
Diacritics resulting in broken images in Safari.
22363.patch (505 bytes) - added by SergeyBiryukov 4 years ago.
22363.2.patch (658 bytes) - added by SergeyBiryukov 4 years ago.
møiré pättern.png (78.8 KB) - added by tar.gz 4 years ago.
A test file with diacritics. The title has been typed under MacOSX.
img-src-Trac-WP.png (36.2 KB) - added by tar.gz 4 years ago.
The img src code, in Trac and in WP
22363.3.rev24527.patch (2.5 KB) - added by NumidWasNotAvailable 3 years ago.
22363.4.rev24527.patch (2.7 KB) - added by NumidWasNotAvailable 3 years ago.
22363.5.diff (704 bytes) - added by tar.gz 3 years ago.
Same as 22363.2.patch, but updated to match the current context of default-filters.php
sanitize_file_name_breaking_characters_utf8.txt (22.5 KB) - added by p_enrique 3 years ago.
A list of Unicode characters that break sanitize_file_name()
22363.5.patch (3.8 KB) - added by p_enrique 3 years ago.
Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower
22363.6.patch (4.9 KB) - added by p_enrique 3 years ago.
Better check for PCRE UTF-8, updated comment block
test-SanitizeFileName.patch (3.0 KB) - added by p_enrique 3 years ago.
Updated unit test for sanitize_file_name()
22363.7.patch (5.0 KB) - added by p_enrique 3 years ago.
Check for ASCII-only instead of seems_utf8() before strtolower()
22363.8.patch (5.2 KB) - added by p_enrique 3 years ago.
Be sensitive to original encoding that might differ from UTF-8
22363.9.patch (5.2 KB) - added by p_enrique 3 years ago.
Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding
22363.10.patch (4.7 KB) - added by p_enrique 3 years ago.
Put the filters back in the same place
22363.11.patch (8.1 KB) - added by markoheijnen 8 months ago.
22363.diff (8.9 KB) - added by swissspidy 5 weeks ago.

Download all attachments as: .zip

Change History (103)

@tar.gz
4 years ago

Diacritics resulting in broken images in Safari.

#1 @tar.gz
4 years ago

  • Summary changed from Accentes in attachement filenames should be sanitized to Accents in attachement filenames should be sanitized

#2 @tar.gz
4 years ago

  • Cc code@… added
  • Summary changed from Accents in attachement filenames should be sanitized to Accents in attachment filenames should be sanitized

#3 @tar.gz
4 years ago

Possibly related: #15955, #16191, #16330

#5 @SergeyBiryukov
4 years ago

  • Version changed from trunk to 3.4

#6 @knutsp
4 years ago

  • Cc knut@… added

#7 @tar.gz
4 years ago

Ok, so I attempted a rudimentary patch, by adding a remove_accents() step at the beginning of wp_unique_filename(), in wp-includes/functions.php.

Like, adding

$filename = remove_accents($filename);

just after line 1606.

But no luck, this doesn't work, it doesn't change anything. A file named "moiré.jpg" still keeps the accent after being uploaded.

#8 @SergeyBiryukov
4 years ago

#23029 was marked as a duplicate.

#9 @tar.gz
4 years ago

By the way, I'm offering a free drink to the developer who patches this issue:

http://www.freedomsponsors.org/core/issue/121/accents-in-attachment-filenames-should-be-sanitized

#10 @SergeyBiryukov
4 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.6

22363.patch calls remove_accents() in sanitize_file_name(), like we do in sanitize_title().

22363.2.patch adds it as a filter instead.

#11 @tar.gz
4 years ago

I applied 22363.patch and 22363.2.patch on WP 3.6-alpha-23288, but it's not working for me: when uploading "moiré.jpg" and "höhö.jpg", they don't get corrected.

What am I doing wrong ?

Last edited 4 years ago by tar.gz (previous) (diff)

#12 @SergeyBiryukov
4 years ago

Hmm, they both work for me.

Tested in PHP 5.2.14 on Windows and PHP 5.2.17 on Linux. When uploading "moiré.jpg" and "höhö.jpg", the resulting files are "moire.jpg" and "hoho.jpg", respectively.

#13 @tar.gz
4 years ago

Indeed, the results are different in Linux, Windows, OSX.

Here is the current status of my testing:

Viewing of images with accents ("moiré.jpg") is broken in:

  • Safari 5.1.7 on OSX (wasn't able to test Safari 6 yet).
  • Safari on iOS 5.1.1 (test device: iPod)
  • Safari on iOS 6 (test device: iPad)

With the patch you provided, when uploading a file named "Forêt.jpg":

  • Windows Vista / IE7 : works, saves as "Foret.jpg".
  • Ubuntu 11.04 / Firefox 18 : works, saves as "Foret.jpg".
  • Mac OSX 10.6 / Firefox 18 : doesn't work, saves as "Forêt.jpg".

Then I tested with some other characters, and found out that it's even more complicated:

Uploading on Ubuntu/Firefox: ê, ç, ä get converted (to e, c, a) but the "ö" remains as it is. A file named "höhö.jpg" does not get renamed.

Uploading a file named "møiré pättern.png" under OSX/Firefox:

  • On an unpatched WP 3.5, only the blank space is converted into hyphen, the file is saved as: "møiré-pättern.png".
  • On WP 3.5 with your patch applied, the file is saved as "moiré-pättern.png" - the nordic "ø" has been converted into "o".

So something is working, it's just that some accented characters aren't correctly recognized! I hope this brings us on the right track.

One thing that comes to my mind is that there are two different ways to generate those accented characters, one of them being "combined". For instance, there is a "single glyph" version of "é" that a hex editor displays as "C3 A9", and a combined (base+diacritic) version that displays as "65 CC 81".

And indeed, if I paste the filename of "moiré-pättern.png" into a text file and open it with some hex editor, I see that the ø is a singly glyph (= gets converted correctly), while the é and ä are combined characters.

I imagine that this could be the source of the inconsistencies? So the result actually depends upon the OS, and perhaps even the type of keyboard, on which the filenames have been typed.

One more test. If I copy-paste that "møiré pättern" string from the filename into the title field of a new post, WP generates the following permalink: "moire-pättern". That's interesting: the combined-character é has been fixed by WP, but the combined-character ä hasn't.

FYI, my test server is running PHP 5.3.10.

And by the way, congrats on your Bug Gardener nomination :)

@tar.gz
4 years ago

A test file with diacritics. The title has been typed under MacOSX.

@tar.gz
4 years ago

The img src code, in Trac and in WP

#14 @tar.gz
4 years ago

Above you can see a quick demonstration of how Trac handles the same file.

Trac uses a different approach: all the characters get converted into html, so the resulting file is actually named "m%C3%B8ir%C3%A9%20p%C3%A4ttern.png". This works fine in Safari.

Personally, I prefer the "human-readable" approach of WP, which in theory would result in the file being renamed "moire-pattern.png". But currently it isn't able to catch all those ugly diacritics.

#15 @tar.gz
4 years ago

One more thing: during my tests, I came across a possible workaround.

We can make those files viewable in safari if we URL-encode the filenames.

When embedding the image into a post, if we link to it with this path (the code given by the Media Uploader):

<img src="/wp-content/uploads/moiré-pättern.png" alt="møiré pättern" />

It will be broken in Safari. But if we write the link like this:

<img src="/wp-content/uploads/moire%CC%81-pa%CC%88ttern.png" alt="møiré pättern" />

then it will display just fine, although it's exactly the same file.

Last edited 4 years ago by tar.gz (previous) (diff)

#16 @tar.gz
4 years ago

Some more information after further research:

  • The same issue obviously appears for any type of attachment, such as a PDF that would be linked from within a post. In Safari, linking to a PDF file with diacritics will lead to the 404 page.
  • On WordPress.com, a solution has been implemented: when uploading the "moiré-pättern.png" file from above, it gets renamed into "mc3b8irecc81-pacc88ttern.png". So there is some sanitization in place, which is lacking in WordPress.org. And it works in OSX.

Here are two more links from support forums, showing how this issue affects users around the world:

All this testing actually leads me to think that I should open another ticket, not about renaming the files, but about generating URL-encoded links when embedding files into a post (or generating galleries with the shortcode). According to RFC 3986 (URI Generic Syntax), accented characters should be percent-encoded. I guess the simple solution would be to cleanse those filenames with the rawurlencode() PHP function when generating the link.

With that strategy, we would have a pretty good fix, even without touching the upload / file renaming process.

#17 follow-up: @ryan
3 years ago

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

Last edited 3 years ago by ryan (previous) (diff)

#18 @ryan
3 years ago

  • Milestone changed from 3.6 to Future Release

#19 in reply to: ↑ 17 @tar.gz
3 years ago

Replying to ryan:

See #9416. We go through some lengths to not do these sorts of manipulations to file names.

On the other hand, such manipulations are applied when files are uploaded to WordPress.com.

Also, what do you think about percent-encoding the file URLs when inserting images into a post? (or when storing the file URL into the database - not sure how the process works internally). That would pretty much solve the issue, without changing the actual filename.

#20 @NumidWasNotAvailable
3 years ago

About sanitizing filenames

Modern browsers handle Unicode characters in URLs. Actually they perform an automatic translation using percent encoding. In some configurations it is quite transparent for the users.

RFC 3986 says that non ASCII charcaters are not allowed in URIs.

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser. This is understandable, just have a look at the statistics at StatCounter.

However, some users may encounter an issue since URLs are not RFC 3986 compliant. Furthermore, I would be interested to know what configurations are concerned by this issue so far.

I am coming up with the following patch: filenames are sanitized using sanitize_title as suggested by tar.gz.

About remove_accents

In this ticket we discovered that remove_accents was not handling Unicode combining characters at all.

I created a new ticket for this issue: #24661.

Last edited 3 years ago by NumidWasNotAvailable (previous) (diff)

#21 @NumidWasNotAvailable
3 years ago

  • Cc NumidWasNotAvailable added

#22 @tar.gz
3 years ago

Thanks for those patches!

I just tested 22363.3.rev24527.patch in combination with 24661.0.rev24527.patch from #24661 - works great, it fixes the problem. Here is my test page: http://sandbox.4o4.ch/wp-git/?p=506

By allowing accented characters in URLs, Wordpress relies on the fact that nowadays everyone is using a modern browser.

This is not entirely true, since for *Post URLs*, WP *does* remove accented characters.

Secondly, the reason I brought up this issue, is that one of the "modern browsers", Safari (on OSX and iOS) is failing to display files with combined characters in their name (unless the url is %-encoded). Depending on the country where you live, that can be a significant segment - for Switzerland, StatCounter ranks Safari at 20% (and 60% for mobile).

Finally, those files with combined diacritics are also poorly supported by FTP clients. Whenever I have to migrate a WP site, in a French or German speaking area, some files will allways be broken during FTP transfer, and will need manual replacement. Fixing them at upload would simply eliminate all those potential future issues.

#23 @NumidWasNotAvailable
3 years ago

I updated the patch because filenames without extension were not processed properly.

#24 @SergeyBiryukov
3 years ago

  • Keywords needs-unit-tests added

#25 @numidwasnotavailable
3 years ago

Tests fail to pass with 22363.4.rev24527.patch. Indeed sanitize_file_name no longer performs extension munging.

I propose a rollback to 22363.2.patch) considering the fix in #24661.

#26 @tar.gz
3 years ago

Is there anything I can do to advance this bug, in order to see the issue fixed in WP 3.7? More testing needed?

#27 @SergeyBiryukov
3 years ago

  • Milestone changed from Future Release to 3.7

Looks like 22363.2.patch is the way to go here, and we should look into removing combined accents in #24661.

@tar.gz
3 years ago

Same as 22363.2.patch, but updated to match the current context of default-filters.php

#28 @tar.gz
3 years ago

For the record, Julio Poitier, well respected French WP coder, is also advocating this fix, and gets much positive feedback for it:
http://www.geekpress.fr/wordpress/astuce/suppression-accents-media-1903/

#29 @juliobox
3 years ago

Thanks tar.gz ;)

#30 @SergeyBiryukov
3 years ago

  • Milestone changed from 3.7 to 3.8

#31 @GregLone
3 years ago

  • Cc greg@… added

#32 @p_enrique
3 years ago

I've been experimenting with this issue. Now, I know I'm repeating most of the things said above, but I'd like to confirm the following:

  1. As for filenames:
    • Non-ASCII filenames will break on Windows. This has to do with how PHP interacts with the Windows file system. There might be a workaround that allows filenames to contain characters contained on the current "code page", but other characters will break filenames.
    • *nix filesystems allow any characters in filenames (minus the reserved characters, of course).
    • WordPress allows uploaded files to have any characters in their names (minus some reserved characters).
    • UTF-8 filenames work OK on WordPress on a *nix platform.
    • URL-encoded (percent-encoded) filenames work on any platform, but make the filename 6 times longer in certain scripts, such as Cyrillic ('%D0%B6' for the character 'ж', for example).
    • The maximum file name length under NTFS is 259 characters including the path.
    • URL-encoded filenames are not a solution. <img src='ca%C3%B1o.jpg'/> will 404 even though the file 'ca%C3%B1o.jpg' exists, since the server decodes the request to 'caño.jpg'.
    • remove_accents() is only a partial solution since 1) it has no support at all for non-Latin based scripts and 2) it doesn't do anything with such things as curly quotes, dashes, copyright symbols or other similar things (which are included in sanitize_title_with_dashes())
    • sanitize_title() as such is no good for filenames, since it URL-encodes the non-ASCII parts.
  1. As for URLs:
    • The only allowed characters in URLs according to RFC3986 are only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes. Everything else should be encoded.
    • The attachment anchor href and img src attributes are not encoded on WP.
    • Not encoding the above may break things in some browsers.
    • Not encoding an URI may break other software that recognizes pasted URIs.
    • An encoded anchor href is handled transparently by modern browsers. For examples, see The Russian Wikipedia, where hovering on a link shows Cyrillic text but in the source, the URI is %-encoded.
Last edited 3 years ago by p_enrique (previous) (diff)

#33 @p_enrique
3 years ago

  • Cc heikki@… added

#34 @p_enrique
3 years ago

I'm not sure if I should open a new ticket for this, maybe someone will advise me. Anyway, I've been testing sanitize_file_name( 'X.jpg' ) where X is an Unicode character that is a number or a letter (matching regex /[\p{L}\p{N}]/u). Alarmingly, there are many rather common characters that will result in a malformed, broken string being returned:

(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er

A complete list is attached.

Edit:
The culprit is the preg_replace without a Unicode pattern modifier. I found that the U-modifier is used here and there in WordPress sources, usually without any checking. I'd write some patches now, but I don't know how to do it & don't have time to learn just now.

Last edited 3 years ago by p_enrique (previous) (diff)

@p_enrique
3 years ago

A list of Unicode characters that break sanitize_file_name()

#35 @tar.gz
3 years ago

Thanks p_enrique for the feedback.

If those characters you mention present an issue, how comes that they don't produce malformed post slugs when using them in a post title?

Basically, the aim of this ticket is that file names should be sanitized in the same manner as post slugs.

Another question: based on your observations on percent-encoding, would it make sense to open another ticket, asking for all permalinks in WP (including the file permalinks) to be percent-encoded? See #comment:16 above.

#36 @p_enrique
3 years ago

tar.gz: They don't produce malformed slugs because sanitize_post_title() urlencodes everything that's not ASCII. This works great for anything that's an URL. But the mechanism cannot be used for file names, since the browser cannot access a file that has a urlencoded file name as I point out in #comment:32.

#38 @p_enrique
3 years ago

My recent patch uses wp_strip_all_tags, adds filter remove_accents and relies on preg_replace (with possible UTF-8 modifier) to remove other characters with support for the sanitize_file_name_chars filter. Also converting the filename to lower case if mb_string extension is available.

Here are some test results with my recent patch:

sanitize_file_name( "Posyłają\tSzczegóły\r\n(Październik).jpg" );
 -> string(34) "posylaja-szczegoly-pazdziernik.jpg"

sanitize_file_name( "<strong>Mọi người đều có quyền <foo> tự do tham gia vào đời sống văn hoá của cộng đồn</strong>" );
 -> string(97) "mọi-nguòi-dèu-có-quyèn-tụ-do-tham-gia-vào-dòi-sóng-van-hoá-của-cọng-dòn"

sanitize_file_name( "-ÉŁè Šíæ 20 % —40° “über/Þøß” βαΣιΛειος à © Ñœijç.õs¿ & Kö.yr.ä = 10 €-.gif.gif" );
 -> string(75) "ele-siae-20-40-uber-thos-βασιλειος-a-noeijc.os-ko.yr_.a-10.gif.gif"

sanitize_file_name( "---Все люди 1) рождаются свободными и 2) равными в своем 2A) достоинстве и 2B) ПРАВАХ!!!---.jpg" );
 -> string(140) "все-люди-1-рождаются-свободными-и-2-равными-в-своем-2a-достоинстве-и-2b-правах.jpg"

sanitize_file_name( "ณ ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ โลกจะใช้เพียง Unicode เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน .jpg" );
 -> string(246) "ณ-ยามที่โลกต้องการเอ่ยถ้อยคำใดๆ-โลกจะใช้เพียง-unicode-เราจึงขอเชิญชวนท่านรีบลงทะเบียนงาน.jpg"

sanitize_file_name( "Invalid Unicode string: abc-\xff-def"
 -> string(0) ""

(There seem to be some Vietnamese characters not captured by remove_accents.)

Last edited 3 years ago by p_enrique (previous) (diff)

#39 @p_enrique
3 years ago

I have two more comments on this matter:

  1. I think there are plugins like cyr2lat, arabic-to-lat etc. that convert characters in Non-Latin scripts to ASCII. These will work if they hook to sanitize_file_name.
  1. There should be an extra filter for Windows-based servers that tries to convert everything possible to ASCII and then strips out the rest. Otherwise, the attachments and images will break under Windows. But such a filter would need a fallback filename for a case where converting/stripping leaves (practically) no ASCII at all. For example, the Russian file name (example 4 above) when stripping all non-ASCII and multiple dashes leaves 1-2-2a-2b.jpg How should such a case be handled?
Last edited 3 years ago by p_enrique (previous) (diff)

#40 @p_enrique
3 years ago

Patch updated. This now gives correct results:

sanitize_file_name( "A.String.With.Many.Dots.gif.gif.gif.ÖÑÉ.twõ.très.fö.ür" );
 --> string(61) "a.string.with_.many_.dots_.gif.gif.gif.one_.two_.tres_.fo_.ur"}}}

@p_enrique
3 years ago

Use preg_replace with possible UTF-8 modifier to strip characters, decode html entities, use remove_accents and mb_strtolower

#41 @p_enrique
3 years ago

Patch updated again. Added a call to html_entity_decode() if PCRE UTF-8 is available, fixed some typos, consolidated preg statements.

@p_enrique
3 years ago

Better check for PCRE UTF-8, updated comment block

@p_enrique
3 years ago

Updated unit test for sanitize_file_name()

#42 @p_enrique
3 years ago

  • Keywords needs-testing added; needs-unit-tests removed

I don't know if you're supposed to write unit tests for your own contributions, but I did anyway. Here they are. Please note that I modified one of the earlier tests since the earlier behaviour was to strip illegal characters, now they are replaced with dashes.

@p_enrique
3 years ago

Check for ASCII-only instead of seems_utf8() before strtolower()

@p_enrique
3 years ago

Be sensitive to original encoding that might differ from UTF-8

#43 @cristovaov
3 years ago

hi, I've been following this ticket and previously plugged patch 22363.2 (http://wp.me/KmdR)
Trying to apply patch 22363.8 also as a plugin but my technical knowledge is rather limited: getting 'Fatal error: Cannot redeclare sanitize_file_name()'

Could someone have a look? I've put my code as a gist here:
https://gist.github.com/cristovaov/7674225 --thanks

#44 @cristovaov
3 years ago

  • Cc cristovaov added

#45 @NumidWasNotAvailable
3 years ago

  • Cc NumidWasNotAvailable removed

@p_enrique
3 years ago

Correct typo: use get_bloginfo() instead of bloginfo() to get fallback encoding

#46 follow-up: @p_enrique
3 years ago

@cristovaov: You can't make a plugin out of a patch that modifies core WP files. In this case, my patch modifies the function sanitize_file_name() in /wp-includes/formatting.php. If you put the code in a plugin, you'll get precisely that error message since the function is already declared in the file I mentioned. If you want to experiment, you can try the SVN install as described in the Contributor Handbook and then applying the patch. I'd be grateful for any feedback about my patch.

#47 @SergeyBiryukov
3 years ago

The sanitize_file_name filter should stay where it is now. It should be possible to override the sanitization if needed.

@p_enrique
3 years ago

Put the filters back in the same place

#48 @p_enrique
3 years ago

  • Keywords dev-feedback added

#49 in reply to: ↑ 46 @cristovaov
3 years ago

@p_enrique: Thanks for clearing that, I was blindly assuming it was possible.
I'll look at doing an svn install--

Replying to p_enrique:

@cristovaov: You can't make a plugin out of a patch that modifies core WP files.

#50 @nacin
3 years ago

  • Keywords 3.9-early added

I think this is great but it needs a lot of time to be soaked. Let's try it out in 3.9.

#51 @nacin
3 years ago

  • Milestone changed from 3.8 to Future Release

#52 @markoheijnen
3 years ago

  • Milestone changed from Future Release to 3.9

Nacin can we give this a try this release?

This ticket was mentioned in IRC in #wordpress-dev by SergeyBiryukov. View the logs.


3 years ago

#54 follow-up: @mikeschroder
3 years ago

  • Keywords 4.0-early needs-unit-tests added; 3.9-early removed
  • Milestone changed from 3.9 to Future Release

We definitely want to do this, but it needs unit tests.

Of note, there is a lot more we can probably learn with regards to UTF8/PCRE from #22692

#55 @chriscct7
12 months ago

  • Keywords 4.0-early removed

#56 in reply to: ↑ 54 ; follow-up: @zodiac1978
9 months ago

Replying to mikeschroder:

We definitely want to do this, but it needs unit tests.

There *are* unit tests. Or not?
https://core.trac.wordpress.org/attachment/ticket/22363/test-SanitizeFileName.patch

This ticket was mentioned in Slack in #core-images by mike. View the logs.


8 months ago

This ticket was mentioned in Slack in #core by paaljoachim. View the logs.


8 months ago

#59 in reply to: ↑ 56 @mikeschroder
8 months ago

  • Keywords needs-refresh added; has-patch needs-unit-tests removed

Replying to zodiac1978:

Thre *are* unit tests. Or not?
https://core.trac.wordpress.org/attachment/ticket/22363/test-SanitizeFileName.patch

Right you are; my very late apologies. I'd like to take another look at this, but it needs a refresh by someone more knowledgable in formatting.php than I.

cc: @miqrogroove @pento Is this up your alley (in addition to those already on this ticket)?

This ticket was mentioned in Slack in #core-images by paaljoachim. View the logs.


8 months ago

#61 @markoheijnen
8 months ago

  • Keywords needs-refresh removed

I have updated it and changed some tests.

The filter sanitize_file_name_chars is almost not needed anymore with PCRE. The moment PCRE isn't available a lot of things still need to cleaned up. I assume this is a bug because output should always be the same (or close enough).

This ticket was mentioned in Slack in #core-images by paaljoachim. View the logs.


8 months ago

This ticket was mentioned in Slack in #core by chriscct7. View the logs.


8 months ago

#64 @chriscct7
8 months ago

From the core-images chat:
Marko:

First thing I want to do is split the unit tests so it runs with and without PCRE. Secondly looking at what to do with sanitize_file_name_chars. New code almost completely ignores it. Which could be fine.

https://wordpress.slack.com/archives/core-images/p1455214794000335

#65 @ocean90
8 months ago

#35809 was marked as a duplicate.

#66 @SergeyBiryukov
7 months ago

#35931 was marked as a duplicate.

#67 @Mte90
7 months ago

any hope for this in 4.5? yesterday we found a site with that problem on safari.

This ticket was mentioned in Slack in #core by mte90. View the logs.


6 months ago

#69 @SergeyBiryukov
5 months ago

#36645 was marked as a duplicate.

#70 @Mimergt
3 months ago

  • Component changed from Upload to Permalinks
  • Severity changed from normal to critical

Some patch or similar... on 4.5.3 the problem back to safari :(

I try to use the sanitize_file_name but nothing works.

This ticket was mentioned in Slack in #core-images by ocean90. View the logs.


3 months ago

#72 @jorbin
5 weeks ago

@swissspidy What are your thoughts on this one?

This ticket was mentioned in Slack in #core-images by mike. View the logs.


5 weeks ago

#74 @joemcgill
5 weeks ago

  • Milestone changed from Future Release to 4.7

Let's try to address this early in 4.7.

@swissspidy
5 weeks ago

#75 @swissspidy
5 weeks ago

  • Keywords has-patch added

I'd really like to get this into 4.7.

22363.diff makes the last patch apply cleanly against trunk again, so that should get us going. Worked well on my local install.

Replying to chriscct7:

From the core-images chat:
Marko:

First thing I want to do is split the unit tests so it runs with and without PCRE. Secondly looking at what to do with sanitize_file_name_chars. New code almost completely ignores it. Which could be fine.

https://wordpress.slack.com/archives/core-images/p1455214794000335

I agree with that. Unfortunately I'm not really familiar with all the PCRE stuff in PHP so I'd need some help there. IIRC @SergeyBiryukov has some knowledge there.

Looking at an earlier comment, @miqrogroove and @pento should be quite familiar with the formatting component, which can't hurt either.

#76 @DrProtocols
4 weeks ago

Whilst a little late to the game I would say resolving this issue (and perhaps if the resolution also extended to transliterating all international characters in a uploaded filenames and ensuring all lower-case) would be a massive win in terms of site portability since issues can often arise with character set support on different platforms and also case-sensitive vs case-insensitive filesystems.

The IEEE Portable Filename Character Set (http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html section 3.276) but with the added condition of lower-case only is an ideal goal for any uploaded media file.

To paraphrase Shakespeare, "A file by any other name still displays the same".

#77 @pento
4 weeks ago

  • Keywords dev-feedback removed

Playing with these parts of Core can cause things to break in strange ways. Let's find out what breaks!

22363.diff looks reasonable, I'm okay with #yolofriday-ing it.

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


3 weeks ago

#79 @gitlost
3 weeks ago

I think there's a number of problems with the latest proposed patch.

The encoding of the filename can only be guessed at as far as I know - it's just a string of bytes with no encoding info from the client and is not connected to the page's (or blog's) charset, so there should be no defaulting to the blog's charset. The default should just be nothing.

It's not clear to me why wp_strip_all_tags() and html_entity_decode() and entity stripping are being done, as opposed to the current solution of just stripping angle brackets and ampersands via special_chars, which seems to be as effective and is less drastic.

A small point is that the conversion of the special_chars filter to a concatenated character class regex is not strictly backward compatible, as people could have been doing multi-byte stripping, which would fail now.

The use of the $utf8_modifier is only needed when the regex contains a multi-byte pattern, and there's only 2 places in the proposed patch where that is (intentionally) the case, one of which is problematic [\s-]+ and the other arguable (?!\.)[^\p{L}\p{Nd}]+. A third use on ^[a-zA-Z]{2,5}\d?$ looks unintentional as \d will match \p{Nd} if UTF-8 mode is set and PCRE has UCP support (which unfortunately are not necessarily the same thing - see #22692#comment:39). Other uses are unneeded.

The [\s-]+ use is problematic and as it stands could actually re-introduce the bug that corrupts UTF-8 that was fixed up above (see #26094) if the filename is in UTF-8 and PCRE UTF-8 mode is unavailable (as $utf9_modifier would then be blank). Even if that's fixed it still has issues due to the difficulty in predicting what it will match - in single-byte mode it's PHP locale dependent, and in UTF-8 mode it's UCP dependent. Its introduction appears to be driven by wanting to encompass both the U+00A0 substitution (see #27281 but it's not particularly clear from that bug report whether the fix was appropriate) and the simple substitution [\r\n\t -]+. I think it would be better and more targeted to leave the current [\r\n\t -]+ as is, and to replace the current preg_match( "#\x{00a0}#siu", ' ', $filename ) with a straightforward str_replace( "\xc2\xa0", ' ', $filename ) (to be done only when the filename is in UTF-8).

The (?!\.)[^\p{L}\p{Nd}]+ use should only be done if the filename is in UTF-8, and also technically should check that UCP support is available, as a previous patch 22363.5.patch above does. This is the substantive change that fixes up the bug in remove_accents(), and works. However I think the bug in remove_accents() - which could be renamed transliterate_commonly_used_latin_to_ascii() - should be fixed in remove_accents(), as specified in the branched off bug #24661. Also to be picky(!) it removes too much, not just decomposed latin accents. The solutions mentioned in #24661 are more targeted.

The use of mb_strtolower() should probably also only be done if the filename is in UTF-8, and then explicitly pass UTF-8 as the encoding (see sanitize_title_with_dashes()), rather than using mb_detect_encoding(), which has issues.

#80 @joemcgill
2 weeks ago

Related: #37989. Looks like we need to consider use cases where people are expecting multi-word file names to be respected as the title in the media editor. 22363.diff converts the title of an attachment named This is a file.jpg to this-is-a-file-jpg, which is unexpected.

This ticket was mentioned in Slack in #core-images by mike. View the logs.


2 weeks ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


9 days ago

#83 @dustinbolton
9 days ago

While also a little late to the game, I also wanted to chime in that we regularly encounter users attempting to move their site to different servers only to be met with difficulty due to certain characters in filenames not playing nicely with their new host. It's an ongoing problem that we are seeing in greater numbers as international use of WordPress grows and more 'offending' files are uploaded into users' sites every day. Would love to see this make it in as it's been a growing known issue for 4+ years now. :)

Thanks,
Dustin Bolton
BackupBuddy Developer

This ticket was mentioned in Slack in #core-images by markoheijnen. View the logs.


8 days ago

This ticket was mentioned in Slack in #core-images by joemcgill. View the logs.


2 days ago

Note: See TracTickets for help on using tickets.