Make WordPress Core

Opened 14 years ago

Closed 10 years ago

#16330 closed defect (bug) (fixed)

media_sideload_image() broken with filenames containing strange characters (e.g., +, %)

Reported by: anonymized_154007's profile anonymized_154007 Owned by: wonderboymusic's profile wonderboymusic
Milestone: 4.0 Priority: normal
Severity: major Version: 3.1
Component: Media Keywords: commit has-patch has-unit-tests
Focuses: Cc:

Description

I'm using the media_sideload_image() method in the upcoming version of my e107 Importer script (see: http://github.com/kdeldycke/e107-importer/blob/b7925fdac6aa43db4be5b7925265a83d95fc62ad/e107-importer.php#L277 ) to upload remote images into WordPress.

This method work as expected with lots of images from a lot of different sources, but fail on URLs containing spaces.

Let me illustrate this bug with an example.

When trying to upload the image located at

http://home.nordnet.fr/francois.jankowski/pochette avant thumb.jpg

the result looks like this on the file system: http://twitpic.com/3s0dk7 . As you can see, image file names are clean. But in the Media Manager, here is what you have: http://twitpic.com/3s0e5d . No thumbnails seems to have been created.

Now, trying to fix this, I modified the original URL before calling media_sideload_image() with the following code:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

With this patch, here is the result on the filesystem: http://twitpic.com/3s0ets . I was surprised by WordPress not sanitizing URLs. Is that normal ?

But the most surprising stuff is in the Media Manager: http://twitpic.com/3s0hup . It looks like thanks to this hack, WordPress somehow succeeded downloading the remote file but messed with filesystem naming. What let me think this ? The Media Manager, get the right image thumbnail dimensions but not the binary payload of the thumbnail (contrary to the case above were no binary nor dimensions are available about the thumbnail).

All of this was tested in WordPress 3.1-RC2.

As for the idea of the patch above, it come from a very old version of my plugin (v0.9) that was based on WordPress 2.3.2. There, I somehow found the root cause of the problem, according the comment I wrote 3 years ago:

 // The fopen() function in wp_remote_fopen() don't like URLs with space chars not translated to html entities

I should have posted this bug report sooner, as now I've forgotten everything about this issue... :(

Attachments (8)

16330.diff (578 bytes) - added by kawauso 13 years ago.
urldecode() the end filename
16330.2.diff (2.0 KB) - added by kawauso 13 years ago.
% and + to sanitize_file_name(), urldecode() sideloaded filenames
sideload-test.php (1.6 KB) - added by kawauso 13 years ago.
Quick sideload test script
16330.3.diff (1.9 KB) - added by SergeyBiryukov 12 years ago.
Refreshed
16330.4.diff (1.9 KB) - added by mattheu 11 years ago.
Refreshed to merge cleanly
16330.tests.php (805 bytes) - added by mattheu 11 years ago.
Update SanitizeFileName test
16330.5.diff (2.0 KB) - added by mattheu 10 years ago.
16330.6.diff (2.1 KB) - added by ericmann 10 years ago.
Refresh patch, add multiple unit test cases.

Download all attachments as: .zip

Change History (56)

#1 @nacin
14 years ago

Spaces aren't valid in a URL, so that part makes sense. If we discard that aspect of the bug report, what's left? The conflict in the media library?

#2 @anonymized_154007
14 years ago

For information, and after looking at WordPress source code:

  1. media_sideload_image() call download_url(),
  2. download_url() call wp_remote_get(),
  3. wp_remote_get() call (here I'm lost).

#3 @anonymized_154007
14 years ago

It looks like spaces in URLs are allowed, but considered unsafe according RFC 1738 (source: http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them/497972#497972 ), and should be percent-encoded.

Anyway, if the WordPress project consider URLs containing spaces as invalids, what's left of this bug report seems to be: does URLs containing %20 are supported by WordPress ?. In which case it doesn't seems to work (which may be a problem on the remote server side and not on WordPress's).

#4 @sivel
14 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to invalid
  • Status changed from new to closed

WordPress will support encoding the space as %20. Per RFC 1738, spaces are deemed unsafe and as such "All unsafe characters must always be encoded within a URL". Consider using urlencode or similar to encode unsafe characters in parts of the url.

#5 follow-up: @hakre
14 years ago

Thank you for taking the time to report the issue.

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

#6 @hakre
14 years ago

  • Keywords reporter-feedback added

#7 in reply to: ↑ 5 ; follow-up: @anonymized_154007
14 years ago

  • Keywords reporter-feedback removed
  • Resolution invalid deleted
  • Status changed from closed to reopened

Replying to hakre:

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

Thanks for this clarification. From now on I'll assume URLs with spaces are invalid and should be expressed with percent encoding.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

Yes, that's exactly what I've done as a second test in my original bug report. That's the part when I use these PHP statements:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

I just retried this test with the latest WordPress 3.1 (the final release, not the RCs) and it still have the same behaviour (look at my screenshots: http://twitpic.com/3s0ets and http://twitpic.com/3s0hup ).

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

I just tried to use this plugin. It complained of missing Curl. So I just installed the curl and php5-curl package (I'm using kubuntu 10.10) and restarted my Apache to have it work.

Then I've redone my tests (the one replacing spaces by %20) and I still have the exact same error.

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

#8 in reply to: ↑ 7 @anonymized_154007
14 years ago

Replying to Coolkevman:

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

Here is why I say this. Look at this empty pochette%20avant%20thumb.tmp temporary file lying in wp-content folder:

kevin@kev-laptop$ ls -lah ./wp-content
total 24K
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:54 ./
drwxrwxrwx 7 www-data www-data 4.0K 2011-02-28 14:38 ../
-rwxr-xr-x 1 www-data www-data   30 2007-05-04 23:48 index.php*
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:33 plugins/
-rw-r--r-- 1 www-data www-data    0 2011-02-28 14:46 pochette%20avant%20thumb.tmp
drwxr-xr-x 3 www-data www-data 4.0K 2011-02-23 15:53 themes/
drwxrwxrwx 3 www-data www-data 4.0K 2011-01-23 23:35 uploads/
$ 

#9 @nacin
14 years ago

  • Keywords close added
  • Milestone set to Awaiting Review

#10 @kawauso
13 years ago

  • Keywords has-patch added; close removed

I just ran into this and traced the issue back to the base filename passed by media_sideload_image() to media_handle_sideload(), which is still URL encoded. As such, you get the final image filename with URL encoding (%20) in it which is then double-encoded by the attachment functions, breaking things rather badly.

To reproduce, try loading a image URL with a correctly encoded space through Press This.

Following patch may not address the fundamental issue, but it appears to work at least.

@kawauso
13 years ago

urldecode() the end filename

#11 @Azuaron
13 years ago

  • Cc Azuaron added

The same error occurs when other special characters are present in the URL (I've been sideloading images from a site which has the unfortunate habit of putting a + in the URLs).

#12 @danielbachhuber
13 years ago

  • Cc d@… added

#13 @Azuaron
13 years ago

  • Component changed from HTTP to Media
  • Keywords needs-patch added; has-patch removed
  • Severity changed from normal to major
  • Summary changed from media_sideload_image() broken with URLs containing spaces to media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
  • Version changed from 3.1 to 3.2.1

From what I can tell, WP has no problem sideloading images with strange (% +) characters, but has a problem serving an image with a strange character that has been sideloaded. I believe the easiest fix would be to add a couple additional lines at line 467 of wp-admin/includes/file.php

465 $filename = str_replace('?','-', $filename);
466 $filename = str_replace('&','-', $filename);
467 $filename = str_replace('%','-', $filename); //New line
468 $filename = str_replace('+','-', $filename); //New line

A more complete solution would be to use preg_replace to remove all non-approved characters from the filename instead of using str_replace:

465 //Remove line: $filename = str_replace('?','-', $filename);
466 //Remove line: $filename = str_replace('&','-', $filename);
467 $filename = preg_replace("/[^a-zA-Z0-9_\-.]/", '-', $filename); //New line

This should ensure the file, once saved, has a filename with only compatible characters.

Last edited 13 years ago by Azuaron (previous) (diff)

#14 @SergeyBiryukov
13 years ago

  • Version changed from 3.2.1 to 3.1

Version number is used to track when the bug was initially introduced/reported.

#15 follow-up: @kawauso
13 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Sanitization is already partly addressed by sanitize_file_name(), run as part of wp_unique_filename() on line 462. It looks like adding % and + to that covers all the test files I've tried, though it will have a wider impact as well.

Attaching patch that adds those characters, decodes sideloaded filenames (else you end up with sanitized encoded characters) and removes the two str_replace() lines which as far as I can tell serve no purpose given that sanitize_file_name() already removes those characters (see r8192).

@kawauso
13 years ago

% and + to sanitize_file_name(), urldecode() sideloaded filenames

@kawauso
13 years ago

Quick sideload test script

#16 in reply to: ↑ 15 @dancriel
13 years ago

kawauso's patch worked for me in WP 3.3.1, thanks

#17 @khag7
13 years ago

I have been using kawauso's patch on WP 3.3.1 with no problems. Thank you!

#18 @kawauso
13 years ago

  • Milestone changed from Awaiting Review to 3.4

#19 @nacin
12 years ago

  • Keywords 3.5-early needs-unit-tests added
  • Milestone changed from 3.4 to Future Release

#20 @SergeyBiryukov
12 years ago

#21116 closed as a duplicate.

#21 @SergeyBiryukov
12 years ago

Closed #21244 as a duplicate.

#22 @krembo99
12 years ago

  • Cc krembo99 added

Closed #16226 as duplicate

The problem is not only in media_sideload_image() but in every upload. it begins and ends with the sanitize_file_name() , which is not as extensive as it should be .
It will also create another problem with files that have dots ( "." ) inside the filename (example : file.name.3.4.dif.old.jpg)
In that case , and depending on other characters in the name , it will trim the extension.

The following code ,even if little elegant , resolves 90% of the problems on my tests.

However , I would recommend maybe think of a a different strategy , maybe with php filters like in the code example (commented)

add_filter('sanitize_file_name', 'k99_sanitize_file_name', 1);

function k99_sanitize_file_name($filename){

			$filename = preg_replace( '/[^a-z0-9_.\-]/','-',$filename);
			
			//I used Preg_replace . but maybe the whole function should go to filters ... example :
			
			// $filename = preg_replace('/[^a-zA-Z0-9._-]/','',$filename);
			
			// $filename = filter_var($filename, FILTER_SANITIZE_EMAIL);// very powerful for this purpose.
		
			// $filename= sanitize_title_with_dashes($filename);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
			
			
			// important in some countries ! If anyone knows any others - please add.
			
			$filename = strtr($filename, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyyAAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
               
                        // important in some other countries ! If anyone knows any others - please add.
			$filename = strtr($filename, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

			return $filename;
}

 
Last edited 12 years ago by krembo99 (previous) (diff)

@SergeyBiryukov
12 years ago

Refreshed

#23 @SergeyBiryukov
12 years ago

Related: #21241

Refreshed 16330.2.diff after [21219].

ticket:16226:16226.2.diff might be worth testing as well.

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

#24 @bananastalktome
12 years ago

  • Cc bananastalktome@… added

#25 @ryelle
12 years ago

  • Cc ryelle added

#26 @davouch
12 years ago

  • Cc davouch added

#27 @SergeyBiryukov
12 years ago

#23777 was marked as a duplicate.

#28 @alexvorn2
12 years ago

  • Cc alexvornoffice@… added

#29 @Otto42
11 years ago

  • Cc Otto42 added

#30 @Justin.Chris.Cook
11 years ago

It's been almost two years since any activity on this ticket and the bug still exists. If I upload a file with a plus sign in the file name, it fails. Any planned progress on this?

#31 @MadtownLems
11 years ago

I arrived here after experiencing the same issue with + signs as well. I can upload the file, WordPress creates images of sizes as expected, but none of them are actually visible on the site (though pulling them down lets me see them on my machine).

Note that the problem, for me, only exists on subsites in my MultiSite network. That is, if site id 1 uploads photo+1.jpg, and it goes to /wp-content/uploads/ and is accessed directly, the image is served with no issue.

If I upload the same file to any other site, and thus it ends up in /wp-content/blogs.dir/ and is served via ms-files.php, the image is not served, and rather, I'm given a 404.

I'd happily help test a patch, but am weary of spending time on a patch that's 2 years old now.

Last edited 11 years ago by MadtownLems (previous) (diff)

@mattheu
11 years ago

Refreshed to merge cleanly

@mattheu
11 years ago

Update SanitizeFileName test

#33 @mattheu
11 years ago

I have run into this bug a couple of times trying to write a custom importers for old content full of file urls with spaces in them.

  • In order to sideload an image that has a space in the filename, you need to correctly encode the spaces first.
  • WordPress will then create a file with a filename that contains %20;
  • Problem is... when you try and visit this, the browser interprets this as a space. And image is broken.

I tested SergeyBiryuko's patch (16330.3.diff) This works well and solves the problem. I updated the patch to merge cleanly with current trunk, and updated sanitizeFileName unit test.

Version 0, edited 11 years ago by mattheu (next)

This ticket was mentioned in IRC in #wordpress-dev by Matth_eu. View the logs.


11 years ago

This ticket was mentioned in IRC in #wordpress-dev by doughamlin. View the logs.


10 years ago

#36 @pjohanneson
10 years ago

I've run into this trying to copy a site from a test machine to a production machine. Any media files with + the filename do not appear on the production machine. They work fine on the test machine. I suspect it's a difference in my Apache server's security levels between test and production.

#37 @mattheu
10 years ago

chatted to johnbillion about this - decided it might be better to replace %20 and + with dashes along with other whitespace.

I have updated the regex that handles replacing spaces and dashes to do this. Also updated unit tests.

Note that the removal of the // Strip the query strings. code from file.php

Looked through the codebase for possible backwards compatablity issues. This is used by XML-RPC when uploading files to check whether the file exists already. This would break this however in that case the image itself would be broken anyway.

@mattheu
10 years ago

#38 @mattheu
10 years ago

I have updated the regex that handles replacing spaces and dashes to do this. Also updated unit tests.

Actually - used str_replace to handle this before the regex to handle dashes/whitespace

#39 follow-up: @johnbillion
10 years ago

  • Keywords commit added; has-patch needs-testing 3.5-early removed
  • Milestone changed from Future Release to 4.0

Replacing + and %20 with a dash is the best approach. As mattheu mentioned, the backwards compat issue with the overwrite flag in wp.uploadFile is a moot issue because the image will be broken anyway.

#40 @ericlewis
10 years ago

Probably related: #18412

#41 @johnbillion
10 years ago

#18412 was marked as a duplicate.

#42 in reply to: ↑ 39 @ericmann
10 years ago

Replying to johnbillion:

Keywords commit added;

So what exactly is holding this ticket up at this point?

This ticket was mentioned in IRC in #wordpress-dev by ericmann. View the logs.


10 years ago

This ticket was mentioned in IRC in #wordpress-dev by stephdau. View the logs.


10 years ago

#45 @ericmann
10 years ago

  • Keywords commit removed

Removing "commit" keyword since better tests are still needed.

@ericmann
10 years ago

Refresh patch, add multiple unit test cases.

#46 @ericmann
10 years ago

  • Keywords commit has-patch has-unit-tests added; needs-unit-tests removed

Latest patch builds on previous ones - replacing both %20 and + with -. It adds multiple test conditions for spaces, though, testing unencoded spaces, encoded spaces, and a mix of all three possible versions (%20, +, and ).

#47 @SergeyBiryukov
10 years ago

The replacements in _wp_handle_upload() (in the first part of the patch) are redundant since [11178], so removing them makes sense.

16330.6.diff looks good to me.

#48 @wonderboymusic
10 years ago

  • Owner set to wonderboymusic
  • Resolution set to fixed
  • Status changed from reopened to closed

In 29290:

In sanitize_file_name(), replace %20 and + with dashes. Remove unnecessary code from _wp_handle_upload().

Adds unit tests.

Props ericmann.
Fixes #16330.

Note: See TracTickets for help on using tickets.