Opened 14 years ago
Last modified 6 years ago
#15955 accepted defect (bug)
move_uploaded_file mangles non-ascii characters on Windows platforms
Reported by: | t31os_ | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | major | Version: | 2.0 |
Component: | Upload | Keywords: | close |
Focuses: | Cc: |
Description
The sanitize_file_name
function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.
wp-includes/formatting - Line 677
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", chr(0));
This array is not dealing with invalid entities that could be used in a filename, and the regular expression further down is not catching these either.
wp-includes/formatting - Line 700
if ( preg_match("/^[a-zA-Z]{2,5}\d?$/", $part) ) {
See attached images, i used 4 varying names with unusual entities in them(each a copy of a sample jpg image).
Using a filter on the valid chars array results in the extension getting stripped off but the file still makes it through the upload routine however(which is worrying).
I'm no file validation expert, so i'm not sure if this is a critical problem(marked as normal), i'll leave this for you chaps to decide.
NOTE: Ignore my hostname in the screenies, it's a 3.0.3 installation, i'm just lazy with updating my virtual host settings.
See screenshots for steps to reproduce(just create a file with some dodgy character entities and upload it basically).
Attachments (10)
Change History (66)
#2
in reply to:
↑ description
@
14 years ago
Replying to t31os_:
The
sanitize_file_name
function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.
I don't think this is where the problem is - sanitize_file_name
only filters OS-reserved characters. All others are valid - including things like æ, ø, å - and will work fine on UNIX platforms.
The problem is a bug in PHP itself: http://bugs.php.net/bug.php?id=47096. On Windows platforms, the PHP function move_uploaded_file()
converts non-ascii characters bitwise into CP1251 format. So the resultant filename isn't what WP asked it to be.
@
14 years ago
On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled
#3
@
14 years ago
- Component changed from Media to Upload
- Keywords has-patch needs-testing added; needs-patch removed
I'm not sure if blog_charset
is what I should be using here. Otherwise this works for me.
#4
@
14 years ago
- Summary changed from sanitize_file_name not removing invalid characters properly to move_uploaded_file mangles non-ascii characters on Windows platforms
#5
follow-up:
↓ 6
@
14 years ago
The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms (at least in CP1251).
#6
in reply to:
↑ 5
@
14 years ago
Replying to SergeyBiryukov:
The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms
Yup. Now it does.
#7
@
14 years ago
The file with a Cyrillic name now appears in the uploads folder, however WordPress is unable to show a proper link for it (e.g. on Edit Media screen). I guess it still expects the name to be in UTF-8, at least in the database.
BTW, I'm not sure if it's always CP1251. There are different codepages on different systems. For Norway (as in the comment above), it's CP1252. See http://en.wikipedia.org/wiki/Windows_code_page.
#8
@
14 years ago
Hmm, this is getting to be tricky. Maybe the best solution is to force ASCII in such situations (as per first patch) - because trying to cater for each Windows encoding looks quite impossible. Doing it this way would also avoid the database issue.
#9
follow-up:
↓ 10
@
14 years ago
There's actually no problem with regular characters, just with some special ones, I guess. Since we already have remove_accents()
in sanitize_title()
, perhaps we can use something similar in sanitize_file_name()
too.
#10
in reply to:
↑ 9
@
14 years ago
Replying to SergeyBiryukov:
There's actually no problem with regular characters, just with some special ones, I guess.
Problem is, depending on what encoding Windows is using, the special ones could be pretty much anything - chinese, arabic, indian characters, all of these would potentially fall over here.
Since we already have
remove_accents()
insanitize_title()
, perhaps we can use something similar insanitize_file_name()
too.
remove_accents()
looks like a throwback to the days when file names *had* to be ASCII - and when the list of non-ascii characters was small. Neither of those is true any more... I think there may actually be a case for removing it from sanitize_title
at some point.
IMO sanitize_file_name
should only strip OS-reserved characters. The rest are legal - and it's only because of the PHP bug that it doesn't work on Windows.
#11
@
14 years ago
Took more time to investigate. Actually it's not just some characters and not just move_uploaded_file()
. The characters on the screenshots are perfectly valid on Windows, since NTFS uses UTF-16.
The problem is that on Windows, there's currently no way to tell PHP that we use UTF-8 in file names. It defaults to single-byte ANSI code page (Windows-1251, etc.) for console applications. Therefore abcdef²³.jpg
becomes abcdefВІВі.jpg
. Files with Cyrillic names get mangled too: Картинка.jpg
becomes Картинка.jpg
.
The names need to be converted to that ANSI charset not just in wp_handle_upload()
, but also in other places like wp_generate_attachment_metadata()
, wp_delete_attachment()
, etc.
As a workaround, I wrote an experimental plugin that leverages current API.
Tested file uploading, deleting and image editing (the most tricky part) on the examples from the screenshots.
#12
@
14 years ago
Nice! This will need thorough testing but from trying a few things just now it looks to be working well.
#14
@
13 years ago
- Cc willmot added
Can we not just call remove_accents in sanitize_file_name and be done with it?
#16
@
13 years ago
Closed #19842 as a duplicate.
Replying to willmot:
Can we not just call remove_accents in sanitize_file_name and be done with it?
No, since remove_accents()
doesn't affect Cyrillic, Arabic, Chinese characters, etc.
allow-utf8-filenames-on-windows.php still seems to be working. Would like to turn it into a patch.
#19
follow-up:
↓ 20
@
12 years ago
- Cc code@… added
Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP. It would be safer to sanitize those filenames during upload, with accented characters being turned into the non-accented version: èéê becoming e, à becoming a, etc.
For instance an image named image pièce-montée-236x177.jpg will display fine in Firefox or Chrome, as those browsers translate it into pie%CC%80ce-monte%CC%81e-236x177.jpg, but it's broken in Safari 5 (OSX).
#20
in reply to:
↑ 19
@
12 years ago
Replying to tar.gz:
Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP.
Not the same issue, but related. I couldn't find an existing ticket for removing accents from file names, feel free to create one.
#21
@
12 years ago
New, related ticket: #22363 (Accents in attachment filenames should be sanitized).
#22
@
12 years ago
Hi all,
I was testing the patch proposed by SergeyBiryukov but didn't work for me. I made some debugging and searched a bit here and there and found that maybe it's related to my PHP version. I use 5.4.8 in Win7 x64 and in http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php someone says it was working in PHP < 5.2.
So I made the proper modifications and now it's working fine for me. I'm attaching the modified file.
@
12 years ago
Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.
#23
@
12 years ago
I've discovered this patch has a side effect after applying it to another of my sites: my theme stopped displaying featured images of size 'thumbnail', although it was specified in the theme's code, thumbnails for the proper sizes were still being generated and it was working properly until then. I disabled the plugin and voilà - it works again. I suspect it has todo something with autfw_generate_attachment_metadata($metadata, $attachment_id) and autfw_update_attachment_metadata($metadata) but need more time to fix it.
#30
in reply to:
↑ 29
@
9 years ago
Replying to SergeyBiryukov:
#32887 was marked as a duplicate.
So when it's going to be fixed? 5 years are really really much.
WordPress is well known globally, so this is little annoying to many people.
In my opinion just remove Latin characters (way from the post's and page's title slug - replace with simple character) before place entry with URL to database and rename uploaded filename.
This is the simplest way and easy to do (at least with plain PHP).
#34
@
9 years ago
I have tried the latest code in this ticket on IIS8.5, PHP 7.01 and a multisite WordPress installation. It doesn't work as expected. The sanitize_file_name function that receives the reencoded filename from this code returns an empty string. I can make it all work if I comment out this line from the sanitize_file_name function:
$filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );
But that is not a practical solution...
#35
@
8 years ago
This is really a problem in terms of sharing images on Facebook as they do not accept æøå in their filename.
#36
@
8 years ago
- Severity changed from normal to major
How long can we wait for this so important fix...?
So as @Compute saw, this is also the reason why some images uploaded by my author's have no image when sharing on Facebook...
#38
@
8 years ago
I have two problems during upload:
- I have filename:
FitEasy-Konkurs-Koktajle-po-obróbce-6070.jpg
. After upload file is saved in wp-content/uploads with invalid filename:FitEasy-Konkurs-Koktajle-po-obroĚbce-6070-300x199.jpg
. Very important: this char:ó
(ASCII: 111) not is a polishó
(ASCII: 195).
- So problem is with filename with polish chars:
ĘÓĄŚŁŻŹĆŃęóąśłżźćń.jpg
. After upload in uploads dir I can seeÄÓĄŚůŹĆĹęóąśłżźćń.jpg
. So I cant see preview because this can't find file in serwer.
Problem is in Windows 10, PHP Version 5.5.9, Apache/2.4.7 (Win32) OpenSSL/1.0.1e PHP/5.5.9
If I use Filenames to latin
plugin second problem not exists but exists first problem.
I think that WordPress should take care that uploaded filenames should be valid without use other plugins. I think that this is important error with upload.
#39
@
8 years ago
I have the same problem.
Editor uploads a filename titled http://www.sereniti.ro/wp-content/uploads/2016/05/Talking-farm-horse-©-Anke-Van-Wyk-Dreamstime.com_.jpg, the server uploads it and then does not displays it. All weird characters must be removed.
#42
@
8 years ago
Hi i have same issue on Persian(Farsi) language.
When upload image , for english name , image show good, but for persian name image, image uploaded , but nothings show.
I link my example uploaded Photo:
http://i.stack.imgur.com/Ub1Ig.png
- wordpress.png (English)
- وردپرس.png (Farsi-Persian)
#43
@
7 years ago
Hi,
Came across this issue recently with uploaded files.
Perhaps we can consider the URLify library to help translit utf-8 filenames to ASCII?
https://github.com/jbroadway/urlify#usage
#45
@
7 years ago
I see in Wikipedia that 7.1.6 PHP version is the latest stable release so if this is fixed and everything is working we can close, for me I need to test and to see for my self.
#46
@
7 years ago
Yes I tested under PHP 7.1.6 and Win10 with different locales.
Every time the result was correct.
#47
follow-up:
↓ 48
@
7 years ago
Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.
#48
in reply to:
↑ 47
;
follow-up:
↓ 49
@
7 years ago
Replying to Clorith:
Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.
So what is the solution then?
The problem is in Windows only with old php version, so the developers that use Windows should update their php version on their local web-server.
#49
in reply to:
↑ 48
@
7 years ago
Replying to alexvorn2:
Replying to Clorith:
Please note that just because it is fixed in PHP 7.1, WordPress currently supports PHP back to 5.2. PHP 7.1 installs are only at about 1.7% (at this time) of all WordPress installs, so closing this out as resolved because of that is not an option.
So what is the solution then?
The problem is in Windows only with old php version, so the developers that use Windows should update their php version on their local web-server.
This was a PHP bug under Windows in the first place.
The bug is fixed, so ...
#50
follow-up:
↓ 51
@
7 years ago
- Keywords close added; needs-patch removed
I tested with Cyrillic filename, uploading an image and it seems working, nice!
#51
in reply to:
↑ 50
@
7 years ago
Replying to alexvorn2:
I tested with Cyrillic filename, uploading an image and it seems working, nice!
I've tested German, Greek, Nordic, Russian almost any language and all seems to work just fine(I'm migrating websites from Joomla to WP and was looking for this a long time ago).
I cannot confirm Right-to-Left Languages.
This ticket was mentioned in Slack in #core by clorith. View the logs.
6 years ago
#56
@
6 years ago
Hi,
A plugin WP Multibyte Patch that is shipped with Japanese package converts the filename in Japanese with the md5()
.
https://ja.wordpress.org/plugins/wp-multibyte-patch/
Actually, the Japanese version of the Windows is using SJIS charset.
If we download files which has UTF-8 non-ascii chars filename into Windows, the filename will be broken.
Any chance to send a patch based on WP Multibyte Patch?
The original filename will be displayed fine on wp-admin.
I'm working on removing this plugin from Japanese package, so this ticket is very important for Japanese users too.
I have the same problems with image names containing the norwegian special characters æ, ø, å.