Opened 2 years ago
Last modified 3 months ago
#15955 new defect (bug)
move_uploaded_file mangles non-ascii characters on Windows platforms
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | Upload | Version: | 2.0 |
| Severity: | normal | Keywords: | needs-patch |
| Cc: | willmot, pavelevap@…, knut@…, code@…, alexvornoffice@… |
Description
The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.
wp-includes/formatting - Line 677
$special_chars = array("?", "[", "]", "/", "\\", "=", "<", ">", ":", ";", ",", "'", "\"", "&", "$", "#", "*", "(", ")", "|", "~", "`", "!", "{", "}", chr(0));
This array is not dealing with invalid entities that could be used in a filename, and the regular expression further down is not catching these either.
wp-includes/formatting - Line 700
if ( preg_match("/^[a-zA-Z]{2,5}\d?$/", $part) ) {
See attached images, i used 4 varying names with unusual entities in them(each a copy of a sample jpg image).
Using a filter on the valid chars array results in the extension getting stripped off but the file still makes it through the upload routine however(which is worrying).
I'm no file validation expert, so i'm not sure if this is a critical problem(marked as normal), i'll leave this for you chaps to decide.
NOTE: Ignore my hostname in the screenies, it's a 3.0.3 installation, i'm just lazy with updating my virtual host settings.
See screenshots for steps to reproduce(just create a file with some dodgy character entities and upload it basically).
Attachments (9)
Change History (36)
comment:1
morten.hauan — 2 years ago
comment:2
in reply to:
↑ description
solarissmoke — 2 years ago
Replying to t31os_:
The sanitize_file_name function is not filtering alot of character entities like the degree symbol, this results in invalid media item paths, see the attached images.
I don't think this is where the problem is - sanitize_file_name only filters OS-reserved characters. All others are valid - including things like æ, ø, å - and will work fine on UNIX platforms.
The problem is a bug in PHP itself: http://bugs.php.net/bug.php?id=47096. On Windows platforms, the PHP function move_uploaded_file() converts non-ascii characters bitwise into CP1251 format. So the resultant filename isn't what WP asked it to be.
solarissmoke — 2 years ago
On Windows platforms, transliterate non-ascii characters into ascii best-match so that file names aren't mangled
comment:3
solarissmoke — 2 years ago
- Component changed from Media to Upload
- Keywords has-patch needs-testing added; needs-patch removed
I'm not sure if blog_charset is what I should be using here. Otherwise this works for me.
comment:4
solarissmoke — 2 years ago
- Summary changed from sanitize_file_name not removing invalid characters properly to move_uploaded_file mangles non-ascii characters on Windows platforms
comment:5
follow-up:
↓ 6
SergeyBiryukov — 2 years ago
The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms (at least in CP1251).
comment:6
in reply to:
↑ 5
solarissmoke — 2 years ago
Replying to SergeyBiryukov:
The patch doesn't allow Cyrillic characters, which are still valid on Windows platforms
Yup. Now it does.
solarissmoke — 2 years ago
comment:7
SergeyBiryukov — 2 years ago
The file with a Cyrillic name now appears in the uploads folder, however WordPress is unable to show a proper link for it (e.g. on Edit Media screen). I guess it still expects the name to be in UTF-8, at least in the database.
BTW, I'm not sure if it's always CP1251. There are different codepages on different systems. For Norway (as in the comment above) it's CP1252. See http://en.wikipedia.org/wiki/Windows_code_page.
comment:8
solarissmoke — 2 years ago
Hmm, this is getting to be tricky. Maybe the best solution is to force ASCII in such situations (as per first patch) - because trying to cater for each Windows encoding looks quite impossible. Doing it this way would also avoid the database issue.
comment:9
follow-up:
↓ 10
SergeyBiryukov — 2 years ago
There's actually no problem with regular characters, just with some special ones, I guess. Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.
comment:10
in reply to:
↑ 9
solarissmoke — 2 years ago
Replying to SergeyBiryukov:
There's actually no problem with regular characters, just with some special ones, I guess.
Problem is, depending on what encoding Windows is using, the special ones could be pretty much anything - chinese, arabic, indian characters, all of these would potentially fall over here.
Since we already have remove_accents() in sanitize_title(), perhaps we can use something similar in sanitize_file_name() too.
remove_accents() looks like a throwback to the days when file names *had* to be ASCII - and when the list of non-ascii characters was small. Neither of those is true any more... I think there may actually be a case for removing it from sanitize_title at some point.
IMO sanitize_file_name should only strip OS-reserved characters. The rest are legal - and it's only because of the PHP bug that it doesn't work on Windows.
SergeyBiryukov — 2 years ago
A workaround for correct uploading of files with UTF-8 names on Windows systems.
comment:11
SergeyBiryukov — 2 years ago
Took more time to investigate. Actually it's not just some characters and not just move_uploaded_file(). The characters on the screenshots are perfectly valid on Windows, since NTFS uses UTF-16.
The problem is that on Windows, there's currently no way to tell PHP that we use UTF-8 in file names. It defaults to single-byte ANSI code page (Windows-1251, etc.) for console applications. Therefore abcdef²³.jpg becomes abcdefВІВі.jpg. Files with Cyrillic names get mangled too: Картинка.jpg becomes Картинка.jpg.
The names need to be converted to that ANSI charset not just in wp_handle_upload(), but also in other places like wp_generate_attachment_metadata(), wp_delete_attachment(), etc.
As a workaround, I wrote an experimental plugin that leverages current API.
Tested file uploading, deleting and image editing (the most tricky part) on the examples from the screenshots.
comment:12
solarissmoke — 2 years ago
Nice! This will need thorough testing but from trying a few things just now it looks to be working well.
Closed #18634 as a duplicate.
comment:14
willmot — 17 months ago
- Cc willmot added
Can we not just call remove_accents in sanitize_file_name and be done with it?
comment:15
pavelevap — 17 months ago
- Cc pavelevap@… added
Closed #19842 as a duplicate.
Replying to willmot:
Can we not just call remove_accents in sanitize_file_name and be done with it?
No, since remove_accents() doesn't affect Cyrillic, Arabic, Chinese characters, etc.
allow-utf8-filenames-on-windows.php still seems to be working. Would like to turn it into a patch.
- Keywords has-patch needs-testing removed
comment:18
knutsp — 16 months ago
- Cc knut@… added
comment:19
follow-up:
↓ 20
tar.gz — 8 months ago
- Cc code@… added
Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP. It would be safer to sanitize those filenames during upload, with accented characters being turned into the non-accented version: èéê becoming e, à becoming a, etc.
For instance an image named image pièce-montée-236x177.jpg will display fine in Firefox or Chrome, as those browsers translate it into pie%CC%80ce-monte%CC%81e-236x177.jpg, but it's broken in Safari 5 (OSX).
comment:20
in reply to:
↑ 19
SergeyBiryukov — 7 months ago
Replying to tar.gz:
Not sure if this is the same issue but: filenames with french accents in them - such as pièce-montée.jpg - are left as they are by WP.
Not the same issue, but related. I couldn't find an existing ticket for removing accents from file names, feel free to create one.
comment:21
tar.gz — 7 months ago
New, related ticket: #22363 (Accents in attachment filenames should be sanitized).
comment:22
Perepandel — 5 months ago
Hi all,
I was testing the patch proposed by SergeyBiryukov but didn't work for me. I made some debugging and searched a bit here and there and found that maybe it's related to my PHP version. I use 5.4.8 in Win7 x64 and in http://stackoverflow.com/questions/13788415/how-to-retrieve-the-current-windows-codepage-in-php someone says it was working in PHP < 5.2.
So I made the proper modifications and now it's working fine for me. I'm attaching the modified file.
Perepandel — 5 months ago
Modification to SergeyBiryukov workaround in order to work when setlocale is returning "C" as Windows code page, which seems to be the case in recent PHP versions.
comment:23
Perepandel — 4 months ago
I've discovered this patch has a side effect after applying it to another of my sites: my theme stopped displaying featured images of size 'thumbnail', although it was specified in the theme's code, thumbnails for the proper sizes were still being generated and it was working properly until then. I disabled the plugin and voilà - it works again. I suspect it has todo something with autfw_generate_attachment_metadata($metadata, $attachment_id) and autfw_update_attachment_metadata($metadata) but need more time to fix it.
comment:24
SergeyBiryukov — 3 months ago
#23588 was marked as a duplicate.
comment:25
SergeyBiryukov — 3 months ago
- Keywords needs-patch added
comment:26
alexvorn2 — 3 months ago
- Cc alexvornoffice@… added
comment:27
SergeyBiryukov — 3 months ago
- Version changed from 3.0.3 to 2.0

I have the same problems with image names containing the norwegian special characters æ, ø, å.