WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 5 weeks ago

#15955 new defect (bug)

move_uploaded_file mangles non-ascii characters on Windows platforms

Reported by: t31os_ Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 2.0
Component: Upload Keywords: needs-patch
Focuses: Cc:

Description

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

wp-includes/formatting - Line 677

$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", chr(0));

This array is not dealing with invalid entities that could be used in a filename, and the regular expression further down is not catching these either.

wp-includes/formatting - Line 700

if ( preg_match("/^[a-zA-Z]{2,5}\d?$/", $part) ) {

See attached images, i used 4 varying names with unusual entities in them(each a copy of a sample jpg image).

Using a filter on the valid chars array results in the extension getting stripped off but the file still makes it through the upload routine however(which is worrying).

I'm no file validation expert, so i'm not sure if this is a critical problem(marked as normal), i'll leave this for you chaps to decide.

NOTE: Ignore my hostname in the screenies, it's a 3.0.3 installation, i'm just lazy with updating my virtual host settings.

See screenshots for steps to reproduce(just create a file with some dodgy character entities and upload it basically).

Attachments (9)

image5.jpg (41.1 KB) - added by t31os_ 5 years ago.
image4.jpg (17.2 KB) - added by t31os_ 5 years ago.
image3.jpg (34.1 KB) - added by t31os_ 5 years ago.
image2.jpg (67.6 KB) - added by t31os_ 5 years ago.
image1.jpg (77.8 KB) - added by t31os_ 5 years ago.
15955.patch (986 bytes) - added by solarissmoke 5 years ago.
On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled
15955.2.patch (1001 bytes) - added by solarissmoke 5 years ago.
allow-utf8-filenames-on-windows.php (3.5 KB) - added by SergeyBiryukov 5 years ago.
A workaround for correct uploading of files with UTF-8 names on Windows systems.
allow-utf8-filenames-on-windows.2.php (3.9 KB) - added by Perepandel 3 years ago.
Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.

Download all attachments as: .zip

Change History (43)

@t31os_
5 years ago

@t31os_
5 years ago

@t31os_
5 years ago

@t31os_
5 years ago

@t31os_
5 years ago

#1 @morten.hauan
5 years ago

I have the same problems with image names containing the norwegian special characters æ, ø, å.

#2 in reply to: ↑ description @solarissmoke
5 years ago

Replying to t31os_:

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

I don't think this is where the problem is - sanitize_file_name only filters OS-reserved characters. All others are valid - including things like æ, ø, å - and will work fine on UNIX platforms.

The problem is a bug in PHP itself: http://bugs.php.net/bug.php?id=47096. On Windows platforms, the PHP function move_uploaded_file() converts non-ascii characters bitwise into CP1251 format. So the resultant filename isn't what WP asked it to be.

@solarissmoke
5 years ago

On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled

#3 @solarissmoke
5 years ago

  • Component changed from Media to Upload
  • Keywords has-patch needs-testing added; needs-patch removed

I'm not sure if blog_charset is what I should be using here. Otherwise this works for me.

#4 @solarissmoke
5 years ago

  • Summary changed from sanitize_file_name not removing invalid characters properly to move_uploaded_file mangles non-ascii characters on Windows platforms

#5 follow-up: @SergeyBiryukov
5 years ago

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms (at least in CP1251).

#6 in reply to: ↑ 5 @solarissmoke
5 years ago

Replying to SergeyBiryukov:

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms

Yup. Now it does.

#7 @SergeyBiryukov
5 years ago

The file with a Cyrillic name now appears in the uploads folder, however WordPress is unable to show a proper link for it (e.g. on Edit Media screen). I guess it still expects the name to be in UTF-8, at least in the database.

BTW, I'm not sure if it's always CP1251. There are different codepages on different systems. For Norway (as in the comment above), it's CP1252. See http://en.wikipedia.org/wiki/Windows_code_page.

Last edited 5 years ago by SergeyBiryukov (previous) (diff)

#8 @solarissmoke
5 years ago

Hmm, this is getting to be tricky. Maybe the best solution is to force ASCII in such situations (as per first patch) - because trying to cater for each Windows encoding looks quite impossible. Doing it this way would also avoid the database issue.

#9 follow-up: @SergeyBiryukov
5 years ago

There's actually no problem with regular characters, just with some special ones, I guess. Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

#10 in reply to: ↑ 9 @solarissmoke
5 years ago

Replying to SergeyBiryukov:

There's actually no problem with regular characters, just with some special ones, I guess.

Problem is, depending on what encoding Windows is using, the special ones could be pretty much anything - chinese, arabic, indian characters, all of these would potentially fall over here.

Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

remove_accents() looks like a throwback to the days when file names *had* to be ASCII - and when the list of non-ascii characters was small. Neither of those is true any more... I think there may actually be a case for removing it from sanitize_title at some point.

IMO sanitize_file_name should only strip OS-reserved characters. The rest are legal - and it's only because of the PHP bug that it doesn't work on Windows.

@SergeyBiryukov
5 years ago

A workaround for correct uploading of files with UTF-8 names on Windows systems.

#11 @SergeyBiryukov
5 years ago

Took more time to investigate. Actually it's not just some characters and not just move_uploaded_file(). The characters on the screenshots are perfectly valid on Windows, since NTFS uses UTF-16.

The problem is that on Windows, there's currently no way to tell PHP that we use UTF-8 in file names. It defaults to single-byte ANSI code page (Windows-1251, etc.) for console applications. Therefore abcdef²³.jpg becomes abcdefВІВі.jpg. Files with Cyrillic names get mangled too: Картинка.jpg becomes Картинка.jpg.

The names need to be converted to that ANSI charset not just in wp_handle_upload(), but also in other places like wp_generate_attachment_metadata(), wp_delete_attachment(), etc.

As a workaround, I wrote an experimental plugin that leverages current API.
Tested file uploading, deleting and image editing (the most tricky part) on the examples from the screenshots.

#12 @solarissmoke
5 years ago

Nice! This will need thorough testing but from trying a few things just now it looks to be working well.

#13 @SergeyBiryukov
4 years ago

Closed #18634 as a duplicate.

#14 @willmot
4 years ago

  • Cc willmot added

Can we not just call remove_accents in sanitize_file_name and be done with it?

#15 @pavelevap
4 years ago

  • Cc pavelevap@… added

#16 @SergeyBiryukov
4 years ago

Closed #19842 as a duplicate.

Replying to willmot:

Can we not just call remove_accents in sanitize_file_name and be done with it?

No, since remove_accents() doesn't affect Cyrillic, Arabic, Chinese characters, etc.

allow-utf8-filenames-on-windows.php still seems to be working. Would like to turn it into a patch.

#17 @SergeyBiryukov
4 years ago

  • Keywords has-patch needs-testing removed

#18 @knutsp
4 years ago

  • Cc knut@… added

#19 follow-up: @tar.gz
3 years ago

  • Cc code@… added

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP. It would be safer to sanitize those filenames during upload, with accented characters being turned into the non-accented version: èéê becoming e, à becoming a, etc.

For instance an image named image pièce-montée-236x177.jpg will display fine in Firefox or Chrome, as those browsers translate it into pie%CC%80ce-monte%CC%81e-236x177.jpg, but it's broken in Safari 5 (OSX).

#20 in reply to: ↑ 19 @SergeyBiryukov
3 years ago

Replying to tar.gz:

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP.

Not the same issue, but related. I couldn't find an existing ticket for removing accents from file names, feel free to create one.

#21 @tar.gz
3 years ago

New, related ticket: #22363 (Accents in attachment filenames should be sanitized).

#22 @Perepandel
3 years ago

Hi all,

I was testing the patch proposed by SergeyBiryukov but didn't work for me. I made some debugging and searched a bit here and there and found that maybe it's related to my PHP version. I use 5.4.8 in Win7 x64 and in http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php someone says it was working in PHP < 5.2.

So I made the proper modifications and now it's working fine for me. I'm attaching the modified file.

@Perepandel
3 years ago

Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.

#23 @Perepandel
3 years ago

I've discovered this patch has a side effect after applying it to another of my sites: my theme stopped displaying featured images of size 'thumbnail', although it was specified in the theme's code, thumbnails for the proper sizes were still being generated and it was working properly until then. I disabled the plugin and voilà - it works again. I suspect it has todo something with autfw_generate_attachment_metadata($metadata, $attachment_id) and autfw_update_attachment_metadata($metadata) but need more time to fix it.

#24 @SergeyBiryukov
3 years ago

#23588 was marked as a duplicate.

#25 @SergeyBiryukov
3 years ago

  • Keywords needs-patch added

#26 @alexvorn2
3 years ago

  • Cc alexvornoffice@… added

#27 @SergeyBiryukov
3 years ago

  • Version changed from 3.0.3 to 2.0

#28 @SergeyBiryukov
19 months ago

#28808 was marked as a duplicate.

#29 follow-up: @SergeyBiryukov
7 months ago

#32887 was marked as a duplicate.

#30 in reply to: ↑ 29 @odie2
7 months ago

Replying to SergeyBiryukov:

#32887 was marked as a duplicate.

So when it's going to be fixed? 5 years are really really much.
WordPress is well known globally, so this is little annoying to many people.

In my opinion just remove Latin characters (way from the post's and page's title slug - replace with simple character) before place entry with URL to database and rename uploaded filename.
This is the simplest way and easy to do (at least with plain PHP).

#31 @SergeyBiryukov
8 weeks ago

#35207 was marked as a duplicate.

#32 @swissspidy
6 weeks ago

#35316 was marked as a duplicate.

#33 @markoheijnen
6 weeks ago

Never knew about this. Would be great if we can find a solution for this in 4.5

#34 @oekarlsson
5 weeks ago

I have tried the latest code in this ticket on IIS8.5, PHP 7.01 and a multisite WordPress installation. It doesn't work as expected. The sanitize_file_name function that receives the reencoded filename from this code returns an empty string. I can make it all work if I comment out this line from the sanitize_file_name function:

$filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );

But that is not a practical solution...

Last edited 5 weeks ago by oekarlsson (previous) (diff)
Note: See TracTickets for help on using tickets.