WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 9 days ago

#16330 reopened defect (bug)

media_sideload_image() broken with filenames containing strange characters (e.g., +, %)

Reported by: Coolkevman Owned by:
Milestone: Future Release Priority: normal
Severity: major Version: 3.1
Component: Media Keywords: has-patch needs-testing 3.5-early needs-unit-tests
Focuses: Cc:

Description

I'm using the media_sideload_image() method in the upcoming version of my e107 Importer script (see: http://github.com/kdeldycke/e107-importer/blob/b7925fdac6aa43db4be5b7925265a83d95fc62ad/e107-importer.php#L277 ) to upload remote images into WordPress.

This method work as expected with lots of images from a lot of different sources, but fail on URLs containing spaces.

Let me illustrate this bug with an example.

When trying to upload the image located at

http://home.nordnet.fr/francois.jankowski/pochette avant thumb.jpg

the result looks like this on the file system: http://twitpic.com/3s0dk7 . As you can see, image file names are clean. But in the Media Manager, here is what you have: http://twitpic.com/3s0e5d . No thumbnails seems to have been created.

Now, trying to fix this, I modified the original URL before calling media_sideload_image() with the following code:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

With this patch, here is the result on the filesystem: http://twitpic.com/3s0ets . I was surprised by WordPress not sanitizing URLs. Is that normal ?

But the most surprising stuff is in the Media Manager: http://twitpic.com/3s0hup . It looks like thanks to this hack, WordPress somehow succeeded downloading the remote file but messed with filesystem naming. What let me think this ? The Media Manager, get the right image thumbnail dimensions but not the binary payload of the thumbnail (contrary to the case above were no binary nor dimensions are available about the thumbnail).

All of this was tested in WordPress 3.1-RC2.

As for the idea of the patch above, it come from a very old version of my plugin (v0.9) that was based on WordPress 2.3.2. There, I somehow found the root cause of the problem, according the comment I wrote 3 years ago:

 // The fopen() function in wp_remote_fopen() don't like URLs with space chars not translated to html entities

I should have posted this bug report sooner, as now I've forgotten everything about this issue... :(

Attachments (6)

16330.diff (578 bytes) - added by kawauso 3 years ago.
urldecode() the end filename
16330.2.diff (2.0 KB) - added by kawauso 2 years ago.
% and + to sanitize_file_name(), urldecode() sideloaded filenames
sideload-test.php (1.6 KB) - added by kawauso 2 years ago.
Quick sideload test script
16330.3.diff (1.9 KB) - added by SergeyBiryukov 22 months ago.
Refreshed
16330.4.diff (1.9 KB) - added by mattheu 10 days ago.
Refreshed to merge cleanly
16330.tests.php (805 bytes) - added by mattheu 10 days ago.
Update SanitizeFileName test

Download all attachments as: .zip

Change History (40)

comment:1 nacin3 years ago

Spaces aren't valid in a URL, so that part makes sense. If we discard that aspect of the bug report, what's left? The conflict in the media library?

comment:2 Coolkevman3 years ago

For information, and after looking at WordPress source code:

  1. media_sideload_image() call download_url(),
  2. download_url() call wp_remote_get(),
  3. wp_remote_get() call (here I'm lost).

comment:3 Coolkevman3 years ago

It looks like spaces in URLs are allowed, but considered unsafe according RFC 1738 (source: http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them/497972#497972 ), and should be percent-encoded.

Anyway, if the WordPress project consider URLs containing spaces as invalids, what's left of this bug report seems to be: does URLs containing %20 are supported by WordPress ?. In which case it doesn't seems to work (which may be a problem on the remote server side and not on WordPress's).

comment:4 sivel3 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to invalid
  • Status changed from new to closed

WordPress will support encoding the space as %20. Per RFC 1738, spaces are deemed unsafe and as such "All unsafe characters must always be encoded within a URL". Consider using urlencode or similar to encode unsafe characters in parts of the url.

comment:5 follow-up: hakre3 years ago

Thank you for taking the time to report the issue.

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

comment:6 hakre3 years ago

  • Keywords reporter-feedback added

comment:7 in reply to: ↑ 5 ; follow-up: Coolkevman3 years ago

  • Keywords reporter-feedback removed
  • Resolution invalid deleted
  • Status changed from closed to reopened

Replying to hakre:

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

Thanks for this clarification. From now on I'll assume URLs with spaces are invalid and should be expressed with percent encoding.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

Yes, that's exactly what I've done as a second test in my original bug report. That's the part when I use these PHP statements:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

I just retried this test with the latest WordPress 3.1 (the final release, not the RCs) and it still have the same behaviour (look at my screenshots: http://twitpic.com/3s0ets and http://twitpic.com/3s0hup ).

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

I just tried to use this plugin. It complained of missing Curl. So I just installed the curl and php5-curl package (I'm using kubuntu 10.10) and restarted my Apache to have it work.

Then I've redone my tests (the one replacing spaces by %20) and I still have the exact same error.

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

comment:8 in reply to: ↑ 7 Coolkevman3 years ago

Replying to Coolkevman:

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

Here is why I say this. Look at this empty pochette%20avant%20thumb.tmp temporary file lying in wp-content folder:

kevin@kev-laptop$ ls -lah ./wp-content
total 24K
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:54 ./
drwxrwxrwx 7 www-data www-data 4.0K 2011-02-28 14:38 ../
-rwxr-xr-x 1 www-data www-data   30 2007-05-04 23:48 index.php*
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:33 plugins/
-rw-r--r-- 1 www-data www-data    0 2011-02-28 14:46 pochette%20avant%20thumb.tmp
drwxr-xr-x 3 www-data www-data 4.0K 2011-02-23 15:53 themes/
drwxrwxrwx 3 www-data www-data 4.0K 2011-01-23 23:35 uploads/
$ 

comment:9 nacin3 years ago

  • Keywords close added
  • Milestone set to Awaiting Review

comment:10 kawauso3 years ago

  • Keywords has-patch added; close removed

I just ran into this and traced the issue back to the base filename passed by media_sideload_image() to media_handle_sideload(), which is still URL encoded. As such, you get the final image filename with URL encoding (%20) in it which is then double-encoded by the attachment functions, breaking things rather badly.

To reproduce, try loading a image URL with a correctly encoded space through Press This.

Following patch may not address the fundamental issue, but it appears to work at least.

kawauso3 years ago

urldecode() the end filename

comment:11 Azuaron3 years ago

  • Cc Azuaron added

The same error occurs when other special characters are present in the URL (I've been sideloading images from a site which has the unfortunate habit of putting a + in the URLs).

comment:12 danielbachhuber3 years ago

  • Cc d@… added

comment:13 Azuaron2 years ago

  • Component changed from HTTP to Media
  • Keywords needs-patch added; has-patch removed
  • Severity changed from normal to major
  • Summary changed from media_sideload_image() broken with URLs containing spaces to media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
  • Version changed from 3.1 to 3.2.1

From what I can tell, WP has no problem sideloading images with strange (% +) characters, but has a problem serving an image with a strange character that has been sideloaded. I believe the easiest fix would be to add a couple additional lines at line 467 of wp-admin/includes/file.php

465 $filename = str_replace('?','-', $filename);
466 $filename = str_replace('&','-', $filename);
467 $filename = str_replace('%','-', $filename); //New line
468 $filename = str_replace('+','-', $filename); //New line

A more complete solution would be to use preg_replace to remove all non-approved characters from the filename instead of using str_replace:

465 //Remove line: $filename = str_replace('?','-', $filename);
466 //Remove line: $filename = str_replace('&','-', $filename);
467 $filename = preg_replace("/[^a-zA-Z0-9_-\.]/", '-', $filename); //New line

This should ensure the file, once saved, has a filename with only compatible characters.

Version 2, edited 2 years ago by Azuaron (previous) (next) (diff)

comment:14 SergeyBiryukov2 years ago

  • Version changed from 3.2.1 to 3.1

Version number is used to track when the bug was initially introduced/reported.

comment:15 follow-up: kawauso2 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Sanitization is already partly addressed by sanitize_file_name(), run as part of wp_unique_filename() on line 462. It looks like adding % and + to that covers all the test files I've tried, though it will have a wider impact as well.

Attaching patch that adds those characters, decodes sideloaded filenames (else you end up with sanitized encoded characters) and removes the two str_replace() lines which as far as I can tell serve no purpose given that sanitize_file_name() already removes those characters (see r8192).

kawauso2 years ago

% and + to sanitize_file_name(), urldecode() sideloaded filenames

kawauso2 years ago

Quick sideload test script

comment:16 in reply to: ↑ 15 dancriel2 years ago

kawauso's patch worked for me in WP 3.3.1, thanks

comment:17 khag72 years ago

I have been using kawauso's patch on WP 3.3.1 with no problems. Thank you!

comment:18 kawauso2 years ago

  • Milestone changed from Awaiting Review to 3.4

comment:19 nacin2 years ago

  • Keywords 3.5-early needs-unit-tests added
  • Milestone changed from 3.4 to Future Release

comment:20 SergeyBiryukov22 months ago

#21116 closed as a duplicate.

comment:21 SergeyBiryukov22 months ago

Closed #21244 as a duplicate.

comment:22 krembo9922 months ago

  • Cc krembo99 added

Closed #16226 as duplicate

The problem is not only in media_sideload_image() but in every upload. it begins and ends with the sanitize_file_name() , which is not as extensive as it should be .
It will also create another problem with files that have dots ( "." ) inside the filename (example : file.name.3.4.dif.old.jpg)
In that case , and depending on other characters in the name , it will trim the extension.

The following code ,even if little elegant , resolves 90% of the problems on my tests.

However , I would recommend maybe think of a a different strategy , maybe with php filters like in the code example (commented)

add_filter('sanitize_file_name', 'k99_sanitize_file_name', 1);

function k99_sanitize_file_name($filename){

			$filename = preg_replace( '/[^a-z0-9_.\-]/','-',$filename);
			
			//I used Preg_replace . but maybe the whole function should go to filters ... example :
			
			// $filename = preg_replace('/[^a-zA-Z0-9._-]/','',$filename);
			
			// $filename = filter_var($filename, FILTER_SANITIZE_EMAIL);// very powerful for this purpose.
		
			// $filename= sanitize_title_with_dashes($filename);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
			
			
			// important in some countries ! If anyone knows any others - please add.
			
			$filename = strtr($filename, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyyAAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
               
                        // important in some other countries ! If anyone knows any others - please add.
			$filename = strtr($filename, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

			return $filename;
}

 
Last edited 22 months ago by krembo99 (previous) (diff)

SergeyBiryukov22 months ago

Refreshed

comment:23 SergeyBiryukov22 months ago

Related: #21241

Refreshed 16330.2.diff after [21219].

ticket:16226:16226.2.diff might be worth testing as well.

Last edited 22 months ago by SergeyBiryukov (previous) (diff)

comment:24 bananastalktome20 months ago

  • Cc bananastalktome@… added

comment:25 ryelle18 months ago

  • Cc ryelle added

comment:26 davouch15 months ago

  • Cc davouch added

comment:27 SergeyBiryukov13 months ago

#23777 was marked as a duplicate.

comment:28 alexvorn213 months ago

  • Cc alexvornoffice@… added

comment:29 Otto4210 months ago

  • Cc Otto42 added

comment:30 Justin.Chris.Cook4 weeks ago

It's been almost two years since any activity on this ticket and the bug still exists. If I upload a file with a plus sign in the file name, it fails. Any planned progress on this?

comment:31 MadtownLems4 weeks ago

I arrived here after experiencing the same issue with + signs as well. I can upload the file, WordPress creates images of sizes as expected, but none of them are actually visible on the site (though pulling them down lets me see them on my machine).

Note that the problem, for me, only exists on subsites in my MultiSite network. That is, if site id 1 uploads photo+1.jpg, and it goes to /wp-content/uploads/ and is accessed directly, the image is served with no issue.

If I upload the same file to any other site, and thus it ends up in /wp-content/blogs.dir/ and is served via ms-files.php, the image is not served, and rather, I'm given a 404.

I'd happily help test a patch, but am weary of spending time on a patch that's 2 years old now.

Last edited 4 weeks ago by MadtownLems (previous) (diff)

mattheu10 days ago

Refreshed to merge cleanly

mattheu10 days ago

Update SanitizeFileName test

comment:33 mattheu10 days ago

I have run into this bug a couple of times trying to write a custom importers for old content full of file urls with spaces in them.

  • In order to sideload an image that has a space in the filename, you need to correctly encode the spaces first.
  • WordPress will then create a file with a filename that contains %20;
  • Problem is... when you try and visit this, the browser interprets this as a space. And image is broken.

I tested SergeyBiryuko's patch (16330.3.diff) This works well and solves the problem. I updated the patch to merge cleanly with current trunk, and updated sanitizeFileName unit test.

comment:34 ircbot9 days ago

This ticket was mentioned in IRC in #wordpress-dev by Matth_eu. View the logs.

Note: See TracTickets for help on using tickets.