Make WordPress Core

Opened 14 years ago

Last modified 6 years ago

#15955 accepted defect (bug)

move_uploaded_file mangles non-ascii characters on Windows platforms

Reported by: t31os_'s profile t31os_ Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: Awaiting Review Priority: normal
Severity: major Version: 2.0
Component: Upload Keywords: close
Focuses: Cc:

Description

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

wp-includes/formatting - Line 677

$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", chr(0));

This array is not dealing with invalid entities that could be used in a filename, and the regular expression further down is not catching these either.

wp-includes/formatting - Line 700

if ( preg_match("/^[a-zA-Z]{2,5}\d?$/", $part) ) {

See attached images, i used 4 varying names with unusual entities in them(each a copy of a sample jpg image).

Using a filter on the valid chars array results in the extension getting stripped off but the file still makes it through the upload routine however(which is worrying).

I'm no file validation expert, so i'm not sure if this is a critical problem(marked as normal), i'll leave this for you chaps to decide.

NOTE: Ignore my hostname in the screenies, it's a 3.0.3 installation, i'm just lazy with updating my virtual host settings.

See screenshots for steps to reproduce(just create a file with some dodgy character entities and upload it basically).

Attachments (10)

image5.jpg (41.1 KB) - added by t31os_ 14 years ago.
image4.jpg (17.2 KB) - added by t31os_ 14 years ago.
image3.jpg (34.1 KB) - added by t31os_ 14 years ago.
image2.jpg (67.6 KB) - added by t31os_ 14 years ago.
image1.jpg (77.8 KB) - added by t31os_ 14 years ago.
15955.patch (986 bytes) - added by solarissmoke 14 years ago.
On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled
15955.2.patch (1001 bytes) - added by solarissmoke 14 years ago.
allow-utf8-filenames-on-windows.php (3.5 KB) - added by SergeyBiryukov 14 years ago.
A workaround for correct uploading of files with UTF-8 names on Windows systems.
allow-utf8-filenames-on-windows.2.php (3.9 KB) - added by Perepandel 12 years ago.
Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.
Remove_Special_Chars_from_filename__first_error_.patch (1.0 KB) - added by sebastian.pisula 8 years ago.

Download all attachments as: .zip

Change History (66)

@t31os_
14 years ago

@t31os_
14 years ago

@t31os_
14 years ago

@t31os_
14 years ago

@t31os_
14 years ago

#1 @morten.hauan
14 years ago

I have the same problems with image names containing the norwegian special characters æ, ø, å.

#2 in reply to: ↑ description @solarissmoke
14 years ago

Replying to t31os_:

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

I don't think this is where the problem is - sanitize_file_name only filters OS-reserved characters. All others are valid - including things like æ, ø, å - and will work fine on UNIX platforms.

The problem is a bug in PHP itself: http://bugs.php.net/bug.php?id=47096. On Windows platforms, the PHP function move_uploaded_file() converts non-ascii characters bitwise into CP1251 format. So the resultant filename isn't what WP asked it to be.

@solarissmoke
14 years ago

On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled

#3 @solarissmoke
14 years ago

  • Component changed from Media to Upload
  • Keywords has-patch needs-testing added; needs-patch removed

I'm not sure if blog_charset is what I should be using here. Otherwise this works for me.

#4 @solarissmoke
14 years ago

  • Summary changed from sanitize_file_name not removing invalid characters properly to move_uploaded_file mangles non-ascii characters on Windows platforms

#5 follow-up: @SergeyBiryukov
14 years ago

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms (at least in CP1251).

#6 in reply to: ↑ 5 @solarissmoke
14 years ago

Replying to SergeyBiryukov:

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms

Yup. Now it does.

#7 @SergeyBiryukov
14 years ago

The file with a Cyrillic name now appears in the uploads folder, however WordPress is unable to show a proper link for it (e.g. on Edit Media screen). I guess it still expects the name to be in UTF-8, at least in the database.

BTW, I'm not sure if it's always CP1251. There are different codepages on different systems. For Norway (as in the comment above), it's CP1252. See http://en.wikipedia.org/wiki/Windows_code_page.

Last edited 14 years ago by SergeyBiryukov (previous) (diff)

#8 @solarissmoke
14 years ago

Hmm, this is getting to be tricky. Maybe the best solution is to force ASCII in such situations (as per first patch) - because trying to cater for each Windows encoding looks quite impossible. Doing it this way would also avoid the database issue.

#9 follow-up: @SergeyBiryukov
14 years ago

There's actually no problem with regular characters, just with some special ones, I guess. Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

#10 in reply to: ↑ 9 @solarissmoke
14 years ago

Replying to SergeyBiryukov:

There's actually no problem with regular characters, just with some special ones, I guess.

Problem is, depending on what encoding Windows is using, the special ones could be pretty much anything - chinese, arabic, indian characters, all of these would potentially fall over here.

Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

remove_accents() looks like a throwback to the days when file names *had* to be ASCII - and when the list of non-ascii characters was small. Neither of those is true any more... I think there may actually be a case for removing it from sanitize_title at some point.

IMO sanitize_file_name should only strip OS-reserved characters. The rest are legal - and it's only because of the PHP bug that it doesn't work on Windows.

@SergeyBiryukov
14 years ago

A workaround for correct uploading of files with UTF-8 names on Windows systems.

#11 @SergeyBiryukov
14 years ago

Took more time to investigate. Actually it's not just some characters and not just move_uploaded_file(). The characters on the screenshots are perfectly valid on Windows, since NTFS uses UTF-16.

The problem is that on Windows, there's currently no way to tell PHP that we use UTF-8 in file names. It defaults to single-byte ANSI code page (Windows-1251, etc.) for console applications. Therefore abcdef²³.jpg becomes abcdefВІВі.jpg. Files with Cyrillic names get mangled too: Картинка.jpg becomes Картинка.jpg.

The names need to be converted to that ANSI charset not just in wp_handle_upload(), but also in other places like wp_generate_attachment_metadata(), wp_delete_attachment(), etc.

As a workaround, I wrote an experimental plugin that leverages current API.
Tested file uploading, deleting and image editing (the most tricky part) on the examples from the screenshots.

#12 @solarissmoke
14 years ago

Nice! This will need thorough testing but from trying a few things just now it looks to be working well.

#13 @SergeyBiryukov
13 years ago

Closed #18634 as a duplicate.

#14 @willmot
13 years ago

  • Cc willmot added

Can we not just call remove_accents in sanitize_file_name and be done with it?

#15 @pavelevap
13 years ago

  • Cc pavelevap@… added

#16 @SergeyBiryukov
13 years ago

Closed #19842 as a duplicate.

Replying to willmot:

Can we not just call remove_accents in sanitize_file_name and be done with it?

No, since remove_accents() doesn't affect Cyrillic, Arabic, Chinese characters, etc.

allow-utf8-filenames-on-windows.php still seems to be working. Would like to turn it into a patch.

#17 @SergeyBiryukov
13 years ago

  • Keywords has-patch needs-testing removed

#18 @knutsp
13 years ago

  • Cc knut@… added

#19 follow-up: @tar.gz
12 years ago

  • Cc code@… added

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP. It would be safer to sanitize those filenames during upload, with accented characters being turned into the non-accented version: èéê becoming e, à becoming a, etc.

For instance an image named image pièce-montée-236x177.jpg will display fine in Firefox or Chrome, as those browsers translate it into pie%CC%80ce-monte%CC%81e-236x177.jpg, but it's broken in Safari 5 (OSX).

#20 in reply to: ↑ 19 @SergeyBiryukov
12 years ago

Replying to tar.gz:

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP.

Not the same issue, but related. I couldn't find an existing ticket for removing accents from file names, feel free to create one.

#21 @tar.gz
12 years ago

New, related ticket: #22363 (Accents in attachment filenames should be sanitized).

#22 @Perepandel
12 years ago

Hi all,

I was testing the patch proposed by SergeyBiryukov but didn't work for me. I made some debugging and searched a bit here and there and found that maybe it's related to my PHP version. I use 5.4.8 in Win7 x64 and in http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php someone says it was working in PHP < 5.2.

So I made the proper modifications and now it's working fine for me. I'm attaching the modified file.

@Perepandel
12 years ago

Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.

#23 @Perepandel
12 years ago

I've discovered this patch has a side effect after applying it to another of my sites: my theme stopped displaying featured images of size 'thumbnail', although it was specified in the theme's code, thumbnails for the proper sizes were still being generated and it was working properly until then. I disabled the plugin and voilà - it works again. I suspect it has todo something with autfw_generate_attachment_metadata($metadata, $attachment_id) and autfw_update_attachment_metadata($metadata) but need more time to fix it.

#24 @SergeyBiryukov
12 years ago

#23588 was marked as a duplicate.

#25 @SergeyBiryukov
12 years ago

  • Keywords needs-patch added

#26 @alexvorn2
12 years ago

  • Cc alexvornoffice@… added

#27 @SergeyBiryukov
12 years ago

  • Version changed from 3.0.3 to 2.0

#28 @SergeyBiryukov
10 years ago

#28808 was marked as a duplicate.

#29 follow-up: @SergeyBiryukov
9 years ago

#32887 was marked as a duplicate.

#30 in reply to: ↑ 29 @odie2
9 years ago

Replying to SergeyBiryukov:

#32887 was marked as a duplicate.

So when it's going to be fixed? 5 years are really really much.
WordPress is well known globally, so this is little annoying to many people.

In my opinion just remove Latin characters (way from the post's and page's title slug - replace with simple character) before place entry with URL to database and rename uploaded filename.
This is the simplest way and easy to do (at least with plain PHP).

#31 @SergeyBiryukov
9 years ago

#35207 was marked as a duplicate.

#32 @swissspidy
9 years ago

#35316 was marked as a duplicate.

#33 @markoheijnen
9 years ago

Never knew about this. Would be great if we can find a solution for this in 4.5

#34 @oekarlsson
9 years ago

I have tried the latest code in this ticket on IIS8.5, PHP 7.01 and a multisite WordPress installation. It doesn't work as expected. The sanitize_file_name function that receives the reencoded filename from this code returns an empty string. I can make it all work if I comment out this line from the sanitize_file_name function:

$filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );

But that is not a practical solution...

Last edited 9 years ago by oekarlsson (previous) (diff)

#35 @Compute
8 years ago

This is really a problem in terms of sharing images on Facebook as they do not accept æøå in their filename.

#36 @odie2
8 years ago

  • Severity changed from normal to major

How long can we wait for this so important fix...?

So as @Compute saw, this is also the reason why some images uploaded by my author's have no image when sharing on Facebook...

Last edited 8 years ago by odie2 (previous) (diff)

#37 @Clorith
8 years ago

#37163 was marked as a duplicate.

#38 @sebastian.pisula
8 years ago

I have two problems during upload:

  1. I have filename: FitEasy-Konkurs-Koktajle-po-obróbce-6070.jpg. After upload file is saved in wp-content/uploads with invalid filename: FitEasy-Konkurs-Koktajle-po-obróbce-6070-300x199.jpg. Very important: this char: (ASCII: 111) not is a polish ó (ASCII: 195).
  1. So problem is with filename with polish chars: ĘÓĄŚŁŻŹĆŃęóąśłżźćń.jpg. After upload in uploads dir I can see ĘÓĄŚŁŻŹĆŃęóąśłżźćń.jpg. So I cant see preview because this can't find file in serwer.

Problem is in Windows 10, PHP Version 5.5.9, Apache/2.4.7 (Win32) OpenSSL/1.0.1e PHP/5.5.9

If I use Filenames to latin plugin second problem not exists but exists first problem.

I think that WordPress should take care that uploaded filenames should be valid without use other plugins. I think that this is important error with upload.

#39 @zosolino
8 years ago

I have the same problem.
Editor uploads a filename titled http://www.sereniti.ro/wp-content/uploads/2016/05/Talking-farm-horse-©-Anke-Van-Wyk-Dreamstime.com_.jpg, the server uploads it and then does not displays it. All weird characters must be removed.

#40 @SergeyBiryukov
8 years ago

  • Owner set to SergeyBiryukov
  • Status changed from new to accepted

#41 @SergeyBiryukov
8 years ago

#37745 was marked as a duplicate.

#42 @mostafas
8 years ago

Hi i have same issue on Persian(Farsi) language.
When upload image , for english name , image show good, but for persian name image, image uploaded , but nothings show.

I link my example uploaded Photo:
http://i.stack.imgur.com/Ub1Ig.png

  • wordpress.png (English)
  • وردپرس.png (Farsi-Persian)

#43 @r-a-y
7 years ago

Hi,

Came across this issue recently with uploaded files.

Perhaps we can consider the URLify library to help translit utf-8 filenames to ASCII?
https://github.com/jbroadway/urlify#usage

#44 @timon33
7 years ago

This has been fixed in php 7.1.

Version 0, edited 7 years ago by timon33 (next)

#45 @alexvorn2
7 years ago

I see in Wikipedia that 7.1.6 PHP version is the latest stable release so if this is fixed and everything is working we can close, for me I need to test and to see for my self.

#46 @timon33
7 years ago

Yes I tested under PHP 7.1.6 and Win10 with different locales.

Every time the result was correct.

Last edited 7 years ago by timon33 (previous) (diff)

#47 follow-up: @Clorith
7 years ago

Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.

#48 in reply to: ↑ 47 ; follow-up: @alexvorn2
7 years ago

Replying to Clorith:

Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.

So what is the solution then?

The problem is in Windows only with old php version, so the developers that use Windows should update their php version on their local web-server.

#49 in reply to: ↑ 48 @timon33
7 years ago

Replying to alexvorn2:

Replying to Clorith:

Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.

So what is the solution then?

The problem is in Windows only with old php version, so the developers that use Windows should update their php version on their local web-server.

This was a PHP bug under Windows in the first place.
The bug is fixed, so ...

#50 follow-up: @alexvorn2
7 years ago

  • Keywords close added; needs-patch removed

I tested with Cyrillic filename, uploading an image and it seems working, nice!

#51 in reply to: ↑ 50 @timon33
7 years ago

Replying to alexvorn2:

I tested with Cyrillic filename, uploading an image and it seems working, nice!

I've tested German, Greek, Nordic, Russian almost any language and all seems to work just fine(I'm migrating websites from Joomla to WP and was looking for this a long time ago).

I cannot confirm Right-to-Left Languages.

#52 @SergeyBiryukov
7 years ago

#43234 was marked as a duplicate.

#53 @SergeyBiryukov
7 years ago

#43610 was marked as a duplicate.

This ticket was mentioned in Slack in #core by clorith. View the logs.


6 years ago

#55 @SergeyBiryukov
6 years ago

#44547 was marked as a duplicate.

#56 @miyauchi
6 years ago

Hi,

A plugin WP Multibyte Patch that is shipped with Japanese package converts the filename in Japanese with the md5().

https://ja.wordpress.org/plugins/wp-multibyte-patch/

Actually, the Japanese version of the Windows is using SJIS charset.
If we download files which has UTF-8 non-ascii chars filename into Windows, the filename will be broken.

Any chance to send a patch based on WP Multibyte Patch?
The original filename will be displayed fine on wp-admin.

I'm working on removing this plugin from Japanese package, so this ticket is very important for Japanese users too.

Note: See TracTickets for help on using tickets.