Opened 2 years ago
Last modified 2 months ago
#16330 reopened defect (bug)
media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Future Release |
| Component: | Media | Version: | 3.1 |
| Severity: | major | Keywords: | has-patch needs-testing 3.5-early needs-unit-tests |
| Cc: | Azuaron, d@…, krembo99, bananastalktome@…, ryelle, davouch, alexvornoffice@… |
Description
I'm using the media_sideload_image() method in the upcoming version of my e107 Importer script (see: http://github.com/kdeldycke/e107-importer/blob/b7925fdac6aa43db4be5b7925265a83d95fc62ad/e107-importer.php#L277 ) to upload remote images into WordPress.
This method work as expected with lots of images from a lot of different sources, but fail on URLs containing spaces.
Let me illustrate this bug with an example.
When trying to upload the image located at
http://home.nordnet.fr/francois.jankowski/pochette avant thumb.jpg
the result looks like this on the file system: http://twitpic.com/3s0dk7 . As you can see, image file names are clean. But in the Media Manager, here is what you have: http://twitpic.com/3s0e5d . No thumbnails seems to have been created.
Now, trying to fix this, I modified the original URL before calling media_sideload_image() with the following code:
$img_url = str_replace(' ', '%20', html_entity_decode($img_url));
$new_tag = media_sideload_image($img_url, $post_id);
With this patch, here is the result on the filesystem: http://twitpic.com/3s0ets . I was surprised by WordPress not sanitizing URLs. Is that normal ?
But the most surprising stuff is in the Media Manager: http://twitpic.com/3s0hup . It looks like thanks to this hack, WordPress somehow succeeded downloading the remote file but messed with filesystem naming. What let me think this ? The Media Manager, get the right image thumbnail dimensions but not the binary payload of the thumbnail (contrary to the case above were no binary nor dimensions are available about the thumbnail).
All of this was tested in WordPress 3.1-RC2.
As for the idea of the patch above, it come from a very old version of my plugin (v0.9) that was based on WordPress 2.3.2. There, I somehow found the root cause of the problem, according the comment I wrote 3 years ago:
// The fopen() function in wp_remote_fopen() don't like URLs with space chars not translated to html entities
I should have posted this bug report sooner, as now I've forgotten everything about this issue... :(
Attachments (4)
Change History (32)
comment:2
Coolkevman — 2 years ago
For information, and after looking at WordPress source code:
- media_sideload_image() call download_url(),
- download_url() call wp_remote_get(),
- wp_remote_get() call (here I'm lost).
comment:3
Coolkevman — 2 years ago
It looks like spaces in URLs are allowed, but considered unsafe according RFC 1738 (source: http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them/497972#497972 ), and should be percent-encoded.
Anyway, if the WordPress project consider URLs containing spaces as invalids, what's left of this bug report seems to be: does URLs containing %20 are supported by WordPress ?. In which case it doesn't seems to work (which may be a problem on the remote server side and not on WordPress's).
- Milestone Awaiting Review deleted
- Resolution set to invalid
- Status changed from new to closed
WordPress will support encoding the space as %20. Per RFC 1738, spaces are deemed unsafe and as such "All unsafe characters must always be encoded within a URL". Consider using urlencode or similar to encode unsafe characters in parts of the url.
Thank you for taking the time to report the issue.
As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.
However as you wrote you might still have a related problem.
To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.
If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.
comment:7
in reply to:
↑ 5
;
follow-up:
↓ 8
Coolkevman — 2 years ago
- Keywords reporter-feedback removed
- Resolution invalid deleted
- Status changed from closed to reopened
Replying to hakre:
As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.
Thanks for this clarification. From now on I'll assume URLs with spaces are invalid and should be expressed with percent encoding.
However as you wrote you might still have a related problem.
To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.
Yes, that's exactly what I've done as a second test in my original bug report. That's the part when I use these PHP statements:
$img_url = str_replace(' ', '%20', html_entity_decode($img_url));
$new_tag = media_sideload_image($img_url, $post_id);
I just retried this test with the latest WordPress 3.1 (the final release, not the RCs) and it still have the same behaviour (look at my screenshots: http://twitpic.com/3s0ets and http://twitpic.com/3s0hup ).
If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.
I just tried to use this plugin. It complained of missing Curl. So I just installed the curl and php5-curl package (I'm using kubuntu 10.10) and restarted my Apache to have it work.
Then I've redone my tests (the one replacing spaces by %20) and I still have the exact same error.
Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...
comment:8
in reply to:
↑ 7
Coolkevman — 2 years ago
Replying to Coolkevman:
Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...
Here is why I say this. Look at this empty pochette%20avant%20thumb.tmp temporary file lying in wp-content folder:
kevin@kev-laptop$ ls -lah ./wp-content total 24K drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:54 ./ drwxrwxrwx 7 www-data www-data 4.0K 2011-02-28 14:38 ../ -rwxr-xr-x 1 www-data www-data 30 2007-05-04 23:48 index.php* drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:33 plugins/ -rw-r--r-- 1 www-data www-data 0 2011-02-28 14:46 pochette%20avant%20thumb.tmp drwxr-xr-x 3 www-data www-data 4.0K 2011-02-23 15:53 themes/ drwxrwxrwx 3 www-data www-data 4.0K 2011-01-23 23:35 uploads/ $
comment:10
kawauso — 2 years ago
- Keywords has-patch added; close removed
I just ran into this and traced the issue back to the base filename passed by media_sideload_image() to media_handle_sideload(), which is still URL encoded. As such, you get the final image filename with URL encoding (%20) in it which is then double-encoded by the attachment functions, breaking things rather badly.
To reproduce, try loading a image URL with a correctly encoded space through Press This.
Following patch may not address the fundamental issue, but it appears to work at least.
comment:11
Azuaron — 2 years ago
- Cc Azuaron added
The same error occurs when other special characters are present in the URL (I've been sideloading images from a site which has the unfortunate habit of putting a + in the URLs).
- Cc d@… added
comment:13
Azuaron — 19 months ago
- Component changed from HTTP to Media
- Keywords needs-patch added; has-patch removed
- Severity changed from normal to major
- Summary changed from media_sideload_image() broken with URLs containing spaces to media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
- Version changed from 3.1 to 3.2.1
From what I can tell, WP has no problem sideloading images with strange (% +) characters, but has a problem serving an image with a strange character that has been sideloaded. I believe the easiest fix would be to add a couple additional lines at line 467 of wp-admin/includes/file.php
465 $filename = str_replace('?','-', $filename);
466 $filename = str_replace('&','-', $filename);
467 $filename = str_replace('%','-', $filename); //New line
468 $filename = str_replace('+','-', $filename); //New line
A more complete solution would be to use preg_replace to remove all non-approved characters from the filename instead of using str_replace:
465 //Remove line: $filename = str_replace('?','-', $filename);
466 //Remove line: $filename = str_replace('&','-', $filename);
467 $filename = preg_replace("/[^a-zA-Z0-9_\-.]/", '-', $filename); //New line
This should ensure the file, once saved, has a filename with only compatible characters.
- Version changed from 3.2.1 to 3.1
Version number is used to track when the bug was initially introduced/reported.
comment:15
follow-up:
↓ 16
kawauso — 19 months ago
- Keywords has-patch needs-testing added; needs-patch removed
Sanitization is already partly addressed by sanitize_file_name(), run as part of wp_unique_filename() on line 462. It looks like adding % and + to that covers all the test files I've tried, though it will have a wider impact as well.
Attaching patch that adds those characters, decodes sideloaded filenames (else you end up with sanitized encoded characters) and removes the two str_replace() lines which as far as I can tell serve no purpose given that sanitize_file_name() already removes those characters (see r8192).
comment:16
in reply to:
↑ 15
dancriel — 16 months ago
kawauso's patch worked for me in WP 3.3.1, thanks
comment:17
khag7 — 15 months ago
I have been using kawauso's patch on WP 3.3.1 with no problems. Thank you!
comment:18
kawauso — 15 months ago
- Milestone changed from Awaiting Review to 3.4
comment:19
nacin — 13 months ago
- Keywords 3.5-early needs-unit-tests added
- Milestone changed from 3.4 to Future Release
#21116 closed as a duplicate.
Closed #21244 as a duplicate.
comment:22
krembo99 — 10 months ago
- Cc krembo99 added
Closed #16226 as duplicate
The problem is not only in media_sideload_image() but in every upload. it begins and ends with the sanitize_file_name() , which is not as extensive as it should be .
It will also create another problem with files that have dots ( "." ) inside the filename (example : file.name.3.4.dif.old.jpg)
In that case , and depending on other characters in the name , it will trim the extension.
The following code ,even if little elegant , resolves 90% of the problems on my tests.
However , I would recommend maybe think of a a different strategy , maybe with php filters like in the code example (commented)
add_filter('sanitize_file_name', 'k99_sanitize_file_name', 1);
function k99_sanitize_file_name($filename){
$filename = preg_replace( '/[^a-z0-9_.\-]/','-',$filename);
//I used Preg_replace . but maybe the whole function should go to filters ... example :
// $filename = preg_replace('/[^a-zA-Z0-9._-]/','',$filename);
// $filename = filter_var($filename, FILTER_SANITIZE_EMAIL);// very powerful for this purpose.
// $filename= sanitize_title_with_dashes($filename);
//$filename = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
//$filename = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
// important in some countries ! If anyone knows any others - please add.
$filename = strtr($filename, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyyAAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
// important in some other countries ! If anyone knows any others - please add.
$filename = strtr($filename, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));
return $filename;
}
Related: #21241
Refreshed 16330.2.diff after [21219].
ticket:16226:16226.2.diff might be worth testing as well.
- Cc bananastalktome@… added
comment:25
ryelle — 6 months ago
- Cc ryelle added
comment:26
davouch — 4 months ago
- Cc davouch added
comment:27
SergeyBiryukov — 2 months ago
#23777 was marked as a duplicate.
comment:28
alexvorn2 — 2 months ago
- Cc alexvornoffice@… added

Spaces aren't valid in a URL, so that part makes sense. If we discard that aspect of the bug report, what's left? The conflict in the media library?