Opened 14 years ago
Closed 10 years ago
#16330 closed defect (bug) (fixed)
media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
Reported by: | anonymized_154007 | Owned by: | wonderboymusic |
---|---|---|---|
Milestone: | 4.0 | Priority: | normal |
Severity: | major | Version: | 3.1 |
Component: | Media | Keywords: | commit has-patch has-unit-tests |
Focuses: | Cc: |
Description
I'm using the media_sideload_image()
method in the upcoming version of my e107 Importer script (see: http://github.com/kdeldycke/e107-importer/blob/b7925fdac6aa43db4be5b7925265a83d95fc62ad/e107-importer.php#L277 ) to upload remote images into WordPress.
This method work as expected with lots of images from a lot of different sources, but fail on URLs containing spaces.
Let me illustrate this bug with an example.
When trying to upload the image located at
http://home.nordnet.fr/francois.jankowski/pochette avant thumb.jpg
the result looks like this on the file system: http://twitpic.com/3s0dk7 . As you can see, image file names are clean. But in the Media Manager, here is what you have: http://twitpic.com/3s0e5d . No thumbnails seems to have been created.
Now, trying to fix this, I modified the original URL before calling media_sideload_image()
with the following code:
$img_url = str_replace(' ', '%20', html_entity_decode($img_url)); $new_tag = media_sideload_image($img_url, $post_id);
With this patch, here is the result on the filesystem: http://twitpic.com/3s0ets . I was surprised by WordPress not sanitizing URLs. Is that normal ?
But the most surprising stuff is in the Media Manager: http://twitpic.com/3s0hup . It looks like thanks to this hack, WordPress somehow succeeded downloading the remote file but messed with filesystem naming. What let me think this ? The Media Manager, get the right image thumbnail dimensions but not the binary payload of the thumbnail (contrary to the case above were no binary nor dimensions are available about the thumbnail).
All of this was tested in WordPress 3.1-RC2.
As for the idea of the patch above, it come from a very old version of my plugin (v0.9) that was based on WordPress 2.3.2. There, I somehow found the root cause of the problem, according the comment I wrote 3 years ago:
// The fopen() function in wp_remote_fopen() don't like URLs with space chars not translated to html entities
I should have posted this bug report sooner, as now I've forgotten everything about this issue... :(
Attachments (8)
Change History (56)
#2
@
14 years ago
For information, and after looking at WordPress source code:
media_sideload_image()
calldownload_url()
,download_url()
callwp_remote_get()
,wp_remote_get()
call (here I'm lost).
#3
@
14 years ago
It looks like spaces in URLs are allowed, but considered unsafe according RFC 1738 (source: http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them/497972#497972 ), and should be percent-encoded.
Anyway, if the WordPress project consider URLs containing spaces as invalids, what's left of this bug report seems to be: does URLs containing %20
are supported by WordPress ?. In which case it doesn't seems to work (which may be a problem on the remote server side and not on WordPress's).
#4
@
14 years ago
- Milestone Awaiting Review deleted
- Resolution set to invalid
- Status changed from new to closed
WordPress will support encoding the space as %20. Per RFC 1738, spaces are deemed unsafe and as such "All unsafe characters must always be encoded within a URL". Consider using urlencode or similar to encode unsafe characters in parts of the url.
#5
follow-up:
↓ 7
@
14 years ago
Thank you for taking the time to report the issue.
As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.
However as you wrote you might still have a related problem.
To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.
If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.
#7
in reply to:
↑ 5
;
follow-up:
↓ 8
@
14 years ago
- Keywords reporter-feedback removed
- Resolution invalid deleted
- Status changed from closed to reopened
Replying to hakre:
As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.
Thanks for this clarification. From now on I'll assume URLs with spaces are invalid and should be expressed with percent encoding.
However as you wrote you might still have a related problem.
To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.
Yes, that's exactly what I've done as a second test in my original bug report. That's the part when I use these PHP statements:
$img_url = str_replace(' ', '%20', html_entity_decode($img_url)); $new_tag = media_sideload_image($img_url, $post_id);
I just retried this test with the latest WordPress 3.1 (the final release, not the RCs) and it still have the same behaviour (look at my screenshots: http://twitpic.com/3s0ets and http://twitpic.com/3s0hup ).
If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.
I just tried to use this plugin. It complained of missing Curl. So I just installed the curl
and php5-curl
package (I'm using kubuntu 10.10) and restarted my Apache to have it work.
Then I've redone my tests (the one replacing spaces by %20
) and I still have the exact same error.
Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads
folder. That's just a guess, I still have to find hard evidences in the code...
#8
in reply to:
↑ 7
@
14 years ago
Replying to Coolkevman:
Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the
uploads
folder. That's just a guess, I still have to find hard evidences in the code...
Here is why I say this. Look at this empty pochette%20avant%20thumb.tmp
temporary file lying in wp-content
folder:
kevin@kev-laptop$ ls -lah ./wp-content total 24K drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:54 ./ drwxrwxrwx 7 www-data www-data 4.0K 2011-02-28 14:38 ../ -rwxr-xr-x 1 www-data www-data 30 2007-05-04 23:48 index.php* drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:33 plugins/ -rw-r--r-- 1 www-data www-data 0 2011-02-28 14:46 pochette%20avant%20thumb.tmp drwxr-xr-x 3 www-data www-data 4.0K 2011-02-23 15:53 themes/ drwxrwxrwx 3 www-data www-data 4.0K 2011-01-23 23:35 uploads/ $
#10
@
14 years ago
- Keywords has-patch added; close removed
I just ran into this and traced the issue back to the base filename passed by media_sideload_image()
to media_handle_sideload()
, which is still URL encoded. As such, you get the final image filename with URL encoding (%20
) in it which is then double-encoded by the attachment functions, breaking things rather badly.
To reproduce, try loading a image URL with a correctly encoded space through Press This.
Following patch may not address the fundamental issue, but it appears to work at least.
#11
@
14 years ago
- Cc Azuaron added
The same error occurs when other special characters are present in the URL (I've been sideloading images from a site which has the unfortunate habit of putting a + in the URLs).
#13
@
13 years ago
- Component changed from HTTP to Media
- Keywords needs-patch added; has-patch removed
- Severity changed from normal to major
- Summary changed from media_sideload_image() broken with URLs containing spaces to media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
- Version changed from 3.1 to 3.2.1
From what I can tell, WP has no problem sideloading images with strange (% +) characters, but has a problem serving an image with a strange character that has been sideloaded. I believe the easiest fix would be to add a couple additional lines at line 467 of wp-admin/includes/file.php
465 $filename = str_replace('?','-', $filename);
466 $filename = str_replace('&','-', $filename);
467 $filename = str_replace('%','-', $filename); //New line
468 $filename = str_replace('+','-', $filename); //New line
A more complete solution would be to use preg_replace to remove all non-approved characters from the filename instead of using str_replace:
465 //Remove line: $filename = str_replace('?','-', $filename);
466 //Remove line: $filename = str_replace('&','-', $filename);
467 $filename = preg_replace("/[^a-zA-Z0-9_-\.]/", '-', $filename); //New line
This should ensure the file, once saved, has a filename with only compatible characters.
#14
@
13 years ago
- Version changed from 3.2.1 to 3.1
Version number is used to track when the bug was initially introduced/reported.
#15
follow-up:
↓ 16
@
13 years ago
- Keywords has-patch needs-testing added; needs-patch removed
Sanitization is already partly addressed by sanitize_file_name()
, run as part of wp_unique_filename()
on line 462. It looks like adding %
and +
to that covers all the test files I've tried, though it will have a wider impact as well.
Attaching patch that adds those characters, decodes sideloaded filenames (else you end up with sanitized encoded characters) and removes the two str_replace()
lines which as far as I can tell serve no purpose given that sanitize_file_name()
already removes those characters (see r8192).
#19
@
13 years ago
- Keywords 3.5-early needs-unit-tests added
- Milestone changed from 3.4 to Future Release
#22
@
12 years ago
- Cc krembo99 added
Closed #16226 as duplicate
The problem is not only in media_sideload_image() but in every upload. it begins and ends with the sanitize_file_name() , which is not as extensive as it should be .
It will also create another problem with files that have dots ( "." ) inside the filename (example : file.name.3.4.dif.old.jpg)
In that case , and depending on other characters in the name , it will trim the extension.
The following code ,even if little elegant , resolves 90% of the problems on my tests.
However , I would recommend maybe think of a a different strategy , maybe with php filters like in the code example (commented)
add_filter('sanitize_file_name', 'k99_sanitize_file_name', 1); function k99_sanitize_file_name($filename){ $filename = preg_replace( '/[^a-z0-9_.\-]/','-',$filename); //I used Preg_replace . but maybe the whole function should go to filters ... example : // $filename = preg_replace('/[^a-zA-Z0-9._-]/','',$filename); // $filename = filter_var($filename, FILTER_SANITIZE_EMAIL);// very powerful for this purpose. // $filename= sanitize_title_with_dashes($filename); //$filename = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH); //$filename = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW); // important in some countries ! If anyone knows any others - please add. $filename = strtr($filename, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyyAAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo'); // important in some other countries ! If anyone knows any others - please add. $filename = strtr($filename, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u')); return $filename; }
#23
@
12 years ago
Related: #21241
Refreshed 16330.2.diff after [21219].
ticket:16226:16226.2.diff might be worth testing as well.
#30
@
11 years ago
It's been almost two years since any activity on this ticket and the bug still exists. If I upload a file with a plus sign in the file name, it fails. Any planned progress on this?
#31
@
11 years ago
I arrived here after experiencing the same issue with + signs as well. I can upload the file, WordPress creates images of sizes as expected, but none of them are actually visible on the site (though pulling them down lets me see them on my machine).
Note that the problem, for me, only exists on subsites in my MultiSite network. That is, if site id 1 uploads photo+1.jpg, and it goes to /wp-content/uploads/ and is accessed directly, the image is served with no issue.
If I upload the same file to any other site, and thus it ends up in /wp-content/blogs.dir/ and is served via ms-files.php, the image is not served, and rather, I'm given a 404.
I'd happily help test a patch, but am weary of spending time on a patch that's 2 years old now.
#33
@
11 years ago
I have run into this bug a couple of times trying to write a custom importers for old content full of file urls with spaces in them.
- In order to sideload an image that has a space in the filename, you need to correctly encode the spaces first.
- WordPress will then create a file with a filename that contains
%20;
- Problem is... when you try and visit this, the browser interprets this as a space. And image is broken.
I tested SergeyBiryukov's patch (16330.3.diff) This works well and solves the problem. I updated the patch to merge cleanly with current trunk, and updated sanitizeFileName unit test.
This ticket was mentioned in IRC in #wordpress-dev by Matth_eu. View the logs.
11 years ago
This ticket was mentioned in IRC in #wordpress-dev by doughamlin. View the logs.
11 years ago
#36
@
10 years ago
I've run into this trying to copy a site from a test machine to a production machine. Any media files with +
the filename do not appear on the production machine. They work fine on the test machine. I suspect it's a difference in my Apache server's security levels between test and production.
#37
@
10 years ago
chatted to johnbillion about this - decided it might be better to replace %20
and +
with dashes along with other whitespace.
I have updated the regex that handles replacing spaces and dashes to do this. Also updated unit tests.
Note that the removal of the // Strip the query strings.
code from file.php
Looked through the codebase for possible backwards compatablity issues. This is used by XML-RPC when uploading files to check whether the file exists already. This would break this however in that case the image itself would be broken anyway.
#38
@
10 years ago
I have updated the regex that handles replacing spaces and dashes to do this. Also updated unit tests.
Actually - used str_replace to handle this before the regex to handle dashes/whitespace
#39
follow-up:
↓ 42
@
10 years ago
- Keywords commit added; has-patch needs-testing 3.5-early removed
- Milestone changed from Future Release to 4.0
Replacing +
and %20
with a dash is the best approach. As mattheu mentioned, the backwards compat issue with the overwrite
flag in wp.uploadFile
is a moot issue because the image will be broken anyway.
#42
in reply to:
↑ 39
@
10 years ago
Replying to johnbillion:
Keywords commit added;
So what exactly is holding this ticket up at this point?
This ticket was mentioned in IRC in #wordpress-dev by ericmann. View the logs.
10 years ago
This ticket was mentioned in IRC in #wordpress-dev by stephdau. View the logs.
10 years ago
#45
@
10 years ago
- Keywords commit removed
Removing "commit" keyword since better tests are still needed.
#46
@
10 years ago
- Keywords commit has-patch has-unit-tests added; needs-unit-tests removed
Latest patch builds on previous ones - replacing both %20
and +
with -
. It adds multiple test conditions for spaces, though, testing unencoded spaces, encoded spaces, and a mix of all three possible versions (%20
, +
, and
).
Spaces aren't valid in a URL, so that part makes sense. If we discard that aspect of the bug report, what's left? The conflict in the media library?