WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 14 months ago

#15955 new defect (bug)

move_uploaded_file mangles non-ascii characters on Windows platforms

Reported by: t31os_ Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 2.0
Component: Upload Keywords: needs-patch
Focuses: Cc:

Description

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

wp-includes/formatting - Line 677

$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", chr(0));

This array is not dealing with invalid entities that could be used in a filename, and the regular expression further down is not catching these either.

wp-includes/formatting - Line 700

if ( preg_match("/^[a-zA-Z]{2,5}\d?$/", $part) ) {

See attached images, i used 4 varying names with unusual entities in them(each a copy of a sample jpg image).

Using a filter on the valid chars array results in the extension getting stripped off but the file still makes it through the upload routine however(which is worrying).

I'm no file validation expert, so i'm not sure if this is a critical problem(marked as normal), i'll leave this for you chaps to decide.

NOTE: Ignore my hostname in the screenies, it's a 3.0.3 installation, i'm just lazy with updating my virtual host settings.

See screenshots for steps to reproduce(just create a file with some dodgy character entities and upload it basically).

Attachments (9)

image5.jpg (41.1 KB) - added by t31os_ 3 years ago.
image4.jpg (17.2 KB) - added by t31os_ 3 years ago.
image3.jpg (34.1 KB) - added by t31os_ 3 years ago.
image2.jpg (67.6 KB) - added by t31os_ 3 years ago.
image1.jpg (77.8 KB) - added by t31os_ 3 years ago.
15955.patch (986 bytes) - added by solarissmoke 3 years ago.
On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled
15955.2.patch (1001 bytes) - added by solarissmoke 3 years ago.
allow-utf8-filenames-on-windows.php (3.5 KB) - added by SergeyBiryukov 3 years ago.
A workaround for correct uploading of files with UTF-8 names on Windows systems.
allow-utf8-filenames-on-windows.2.php (3.9 KB) - added by Perepandel 16 months ago.
Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.

Download all attachments as: .zip

Change History (36)

t31os_3 years ago

t31os_3 years ago

t31os_3 years ago

t31os_3 years ago

t31os_3 years ago

comment:1 morten.hauan3 years ago

I have the same problems with image names containing the norwegian special characters æ, ø, å.

comment:2 in reply to: ↑ description solarissmoke3 years ago

Replying to t31os_:

The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.

I don't think this is where the problem is - sanitize_file_name only filters OS-reserved characters. All others are valid - including things like æ, ø, å - and will work fine on UNIX platforms.

The problem is a bug in PHP itself: http://bugs.php.net/bug.php?id=47096. On Windows platforms, the PHP function move_uploaded_file() converts non-ascii characters bitwise into CP1251 format. So the resultant filename isn't what WP asked it to be.

solarissmoke3 years ago

On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled

comment:3 solarissmoke3 years ago

  • Component changed from Media to Upload
  • Keywords has-patch needs-testing added; needs-patch removed

I'm not sure if blog_charset is what I should be using here. Otherwise this works for me.

comment:4 solarissmoke3 years ago

  • Summary changed from sanitize_file_name not removing invalid characters properly to move_uploaded_file mangles non-ascii characters on Windows platforms

comment:5 follow-up: SergeyBiryukov3 years ago

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms (at least in CP1251).

comment:6 in reply to: ↑ 5 solarissmoke3 years ago

Replying to SergeyBiryukov:

The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms

Yup. Now it does.

solarissmoke3 years ago

comment:7 SergeyBiryukov3 years ago

The file with a Cyrillic name now appears in the uploads folder, however WordPress is unable to show a proper link for it (e.g. on Edit Media screen). I guess it still expects the name to be in UTF-8, at least in the database.

BTW, I'm not sure if it's always CP1251. There are different codepages on different systems. For Norway (as in the comment above) it's CP1252. See http://en.wikipedia.org/wiki/Windows_code_page.

Version 1, edited 3 years ago by SergeyBiryukov (previous) (next) (diff)

comment:8 solarissmoke3 years ago

Hmm, this is getting to be tricky. Maybe the best solution is to force ASCII in such situations (as per first patch) - because trying to cater for each Windows encoding looks quite impossible. Doing it this way would also avoid the database issue.

comment:9 follow-up: SergeyBiryukov3 years ago

There's actually no problem with regular characters, just with some special ones, I guess. Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

comment:10 in reply to: ↑ 9 solarissmoke3 years ago

Replying to SergeyBiryukov:

There's actually no problem with regular characters, just with some special ones, I guess.

Problem is, depending on what encoding Windows is using, the special ones could be pretty much anything - chinese, arabic, indian characters, all of these would potentially fall over here.

Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.

remove_accents() looks like a throwback to the days when file names *had* to be ASCII - and when the list of non-ascii characters was small. Neither of those is true any more... I think there may actually be a case for removing it from sanitize_title at some point.

IMO sanitize_file_name should only strip OS-reserved characters. The rest are legal - and it's only because of the PHP bug that it doesn't work on Windows.

SergeyBiryukov3 years ago

A workaround for correct uploading of files with UTF-8 names on Windows systems.

comment:11 SergeyBiryukov3 years ago

Took more time to investigate. Actually it's not just some characters and not just move_uploaded_file(). The characters on the screenshots are perfectly valid on Windows, since NTFS uses UTF-16.

The problem is that on Windows, there's currently no way to tell PHP that we use UTF-8 in file names. It defaults to single-byte ANSI code page (Windows-1251, etc.) for console applications. Therefore abcdef²³.jpg becomes abcdefВІВі.jpg. Files with Cyrillic names get mangled too: Картинка.jpg becomes Картинка.jpg.

The names need to be converted to that ANSI charset not just in wp_handle_upload(), but also in other places like wp_generate_attachment_metadata(), wp_delete_attachment(), etc.

As a workaround, I wrote an experimental plugin that leverages current API.
Tested file uploading, deleting and image editing (the most tricky part) on the examples from the screenshots.

comment:12 solarissmoke3 years ago

Nice! This will need thorough testing but from trying a few things just now it looks to be working well.

comment:13 SergeyBiryukov3 years ago

Closed #18634 as a duplicate.

comment:14 willmot2 years ago

  • Cc willmot added

Can we not just call remove_accents in sanitize_file_name and be done with it?

comment:15 pavelevap2 years ago

  • Cc pavelevap@… added

comment:16 SergeyBiryukov2 years ago

Closed #19842 as a duplicate.

Replying to willmot:

Can we not just call remove_accents in sanitize_file_name and be done with it?

No, since remove_accents() doesn't affect Cyrillic, Arabic, Chinese characters, etc.

allow-utf8-filenames-on-windows.php still seems to be working. Would like to turn it into a patch.

comment:17 SergeyBiryukov2 years ago

  • Keywords has-patch needs-testing removed

comment:18 knutsp2 years ago

  • Cc knut@… added

comment:19 follow-up: tar.gz19 months ago

  • Cc code@… added

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP. It would be safer to sanitize those filenames during upload, with accented characters being turned into the non-accented version: èéê becoming e, à becoming a, etc.

For instance an image named image pièce-montée-236x177.jpg will display fine in Firefox or Chrome, as those browsers translate it into pie%CC%80ce-monte%CC%81e-236x177.jpg, but it's broken in Safari 5 (OSX).

comment:20 in reply to: ↑ 19 SergeyBiryukov18 months ago

Replying to tar.gz:

Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP.

Not the same issue, but related. I couldn't find an existing ticket for removing accents from file names, feel free to create one.

comment:21 tar.gz18 months ago

New, related ticket: #22363 (Accents in attachment filenames should be sanitized).

comment:22 Perepandel16 months ago

Hi all,

I was testing the patch proposed by SergeyBiryukov but didn't work for me. I made some debugging and searched a bit here and there and found that maybe it's related to my PHP version. I use 5.4.8 in Win7 x64 and in http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php someone says it was working in PHP < 5.2.

So I made the proper modifications and now it's working fine for me. I'm attaching the modified file.

Perepandel16 months ago

Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.

comment:23 Perepandel15 months ago

I've discovered this patch has a side effect after applying it to another of my sites: my theme stopped displaying featured images of size 'thumbnail', although it was specified in the theme's code, thumbnails for the proper sizes were still being generated and it was working properly until then. I disabled the plugin and voilà - it works again. I suspect it has todo something with autfw_generate_attachment_metadata($metadata, $attachment_id) and autfw_update_attachment_metadata($metadata) but need more time to fix it.

comment:24 SergeyBiryukov14 months ago

#23588 was marked as a duplicate.

comment:25 SergeyBiryukov14 months ago

  • Keywords needs-patch added

comment:26 alexvorn214 months ago

  • Cc alexvornoffice@… added

comment:27 SergeyBiryukov14 months ago

  • Version changed from 3.0.3 to 2.0
Note: See TracTickets for help on using tickets.