WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 3 months ago

#16330 reopened defect (bug)

media_sideload_image() broken with filenames containing strange characters (e.g., +, %)

Reported by: Coolkevman Owned by:
Priority: normal Milestone: Future Release
Component: Media Version: 3.1
Severity: major Keywords: has-patch needs-testing 3.5-early needs-unit-tests
Cc: Azuaron, d@…, krembo99, bananastalktome@…, ryelle, davouch, alexvornoffice@…

Description

I'm using the media_sideload_image() method in the upcoming version of my e107 Importer script (see: http://github.com/kdeldycke/e107-importer/blob/b7925fdac6aa43db4be5b7925265a83d95fc62ad/e107-importer.php#L277 ) to upload remote images into WordPress.

This method work as expected with lots of images from a lot of different sources, but fail on URLs containing spaces.

Let me illustrate this bug with an example.

When trying to upload the image located at

http://home.nordnet.fr/francois.jankowski/pochette avant thumb.jpg

the result looks like this on the file system: http://twitpic.com/3s0dk7 . As you can see, image file names are clean. But in the Media Manager, here is what you have: http://twitpic.com/3s0e5d . No thumbnails seems to have been created.

Now, trying to fix this, I modified the original URL before calling media_sideload_image() with the following code:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

With this patch, here is the result on the filesystem: http://twitpic.com/3s0ets . I was surprised by WordPress not sanitizing URLs. Is that normal ?

But the most surprising stuff is in the Media Manager: http://twitpic.com/3s0hup . It looks like thanks to this hack, WordPress somehow succeeded downloading the remote file but messed with filesystem naming. What let me think this ? The Media Manager, get the right image thumbnail dimensions but not the binary payload of the thumbnail (contrary to the case above were no binary nor dimensions are available about the thumbnail).

All of this was tested in WordPress 3.1-RC2.

As for the idea of the patch above, it come from a very old version of my plugin (v0.9) that was based on WordPress 2.3.2. There, I somehow found the root cause of the problem, according the comment I wrote 3 years ago:

 // The fopen() function in wp_remote_fopen() don't like URLs with space chars not translated to html entities

I should have posted this bug report sooner, as now I've forgotten everything about this issue... :(

Attachments (4)

16330.diff (578 bytes) - added by kawauso 2 years ago.
urldecode() the end filename
16330.2.diff (2.0 KB) - added by kawauso 20 months ago.
% and + to sanitize_file_name(), urldecode() sideloaded filenames
sideload-test.php (1.6 KB) - added by kawauso 20 months ago.
Quick sideload test script
16330.3.diff (1.9 KB) - added by SergeyBiryukov 11 months ago.
Refreshed

Download all attachments as: .zip

Change History (32)

comment:1 nacin2 years ago

Spaces aren't valid in a URL, so that part makes sense. If we discard that aspect of the bug report, what's left? The conflict in the media library?

comment:2 Coolkevman2 years ago

For information, and after looking at WordPress source code:

  1. media_sideload_image() call download_url(),
  2. download_url() call wp_remote_get(),
  3. wp_remote_get() call (here I'm lost).

comment:3 Coolkevman2 years ago

It looks like spaces in URLs are allowed, but considered unsafe according RFC 1738 (source: http://stackoverflow.com/questions/497908/are-urls-allowed-to-have-a-space-in-them/497972#497972 ), and should be percent-encoded.

Anyway, if the WordPress project consider URLs containing spaces as invalids, what's left of this bug report seems to be: does URLs containing %20 are supported by WordPress ?. In which case it doesn't seems to work (which may be a problem on the remote server side and not on WordPress's).

comment:4 sivel2 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to invalid
  • Status changed from new to closed

WordPress will support encoding the space as %20. Per RFC 1738, spaces are deemed unsafe and as such "All unsafe characters must always be encoded within a URL". Consider using urlencode or similar to encode unsafe characters in parts of the url.

comment:5 follow-up: hakre2 years ago

Thank you for taking the time to report the issue.

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

comment:6 hakre2 years ago

  • Keywords reporter-feedback added

comment:7 in reply to: ↑ 5 ; follow-up: Coolkevman2 years ago

  • Keywords reporter-feedback removed
  • Resolution invalid deleted
  • Status changed from closed to reopened

Replying to hakre:

As sivel wrote, URLs containing spaces are considered invalid. That's by the standards, so not a wordpress issue.

Thanks for this clarification. From now on I'll assume URLs with spaces are invalid and should be expressed with percent encoding.

However as you wrote you might still have a related problem.

To further test, please replace the spaces inside the URL with %20 (that's the proper way how to write a space inside a URL) and try to fetch that file.

Yes, that's exactly what I've done as a second test in my original bug report. That's the part when I use these PHP statements:

  $img_url = str_replace(' ', '%20', html_entity_decode($img_url));
  $new_tag = media_sideload_image($img_url, $post_id);

I just retried this test with the latest WordPress 3.1 (the final release, not the RCs) and it still have the same behaviour (look at my screenshots: http://twitpic.com/3s0ets and http://twitpic.com/3s0hup ).

If this does not work, you need to find out which error it is returning and you need to find out which transport you're using. To reduce noise, you can install a plugin like 11499-curlforce.php that will enforce the use of curl for debugging purposes, so the curl based transport will be always used.

I just tried to use this plugin. It complained of missing Curl. So I just installed the curl and php5-curl package (I'm using kubuntu 10.10) and restarted my Apache to have it work.

Then I've redone my tests (the one replacing spaces by %20) and I still have the exact same error.

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

comment:8 in reply to: ↑ 7 Coolkevman2 years ago

Replying to Coolkevman:

Now I think the problem lies somewhere in remote downloading code of WordPress, when it first download the remote file to a temporary location before moving it to the uploads folder. That's just a guess, I still have to find hard evidences in the code...

Here is why I say this. Look at this empty pochette%20avant%20thumb.tmp temporary file lying in wp-content folder:

kevin@kev-laptop$ ls -lah ./wp-content
total 24K
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:54 ./
drwxrwxrwx 7 www-data www-data 4.0K 2011-02-28 14:38 ../
-rwxr-xr-x 1 www-data www-data   30 2007-05-04 23:48 index.php*
drwxr-xr-x 5 www-data www-data 4.0K 2011-02-28 14:33 plugins/
-rw-r--r-- 1 www-data www-data    0 2011-02-28 14:46 pochette%20avant%20thumb.tmp
drwxr-xr-x 3 www-data www-data 4.0K 2011-02-23 15:53 themes/
drwxrwxrwx 3 www-data www-data 4.0K 2011-01-23 23:35 uploads/
$ 

comment:9 nacin2 years ago

  • Keywords close added
  • Milestone set to Awaiting Review

comment:10 kawauso2 years ago

  • Keywords has-patch added; close removed

I just ran into this and traced the issue back to the base filename passed by media_sideload_image() to media_handle_sideload(), which is still URL encoded. As such, you get the final image filename with URL encoding (%20) in it which is then double-encoded by the attachment functions, breaking things rather badly.

To reproduce, try loading a image URL with a correctly encoded space through Press This.

Following patch may not address the fundamental issue, but it appears to work at least.

kawauso2 years ago

urldecode() the end filename

comment:11 Azuaron2 years ago

  • Cc Azuaron added

The same error occurs when other special characters are present in the URL (I've been sideloading images from a site which has the unfortunate habit of putting a + in the URLs).

comment:12 danielbachhuber23 months ago

  • Cc d@… added

comment:13 Azuaron20 months ago

  • Component changed from HTTP to Media
  • Keywords needs-patch added; has-patch removed
  • Severity changed from normal to major
  • Summary changed from media_sideload_image() broken with URLs containing spaces to media_sideload_image() broken with filenames containing strange characters (e.g., +, %)
  • Version changed from 3.1 to 3.2.1

From what I can tell, WP has no problem sideloading images with strange (% +) characters, but has a problem serving an image with a strange character that has been sideloaded. I believe the easiest fix would be to add a couple additional lines at line 467 of wp-admin/includes/file.php

465 $filename = str_replace('?','-', $filename);
466 $filename = str_replace('&','-', $filename);
467 $filename = str_replace('%','-', $filename); //New line
468 $filename = str_replace('+','-', $filename); //New line

A more complete solution would be to use preg_replace to remove all non-approved characters from the filename instead of using str_replace:

465 //Remove line: $filename = str_replace('?','-', $filename);
466 //Remove line: $filename = str_replace('&','-', $filename);
467 $filename = preg_replace("/[^a-zA-Z0-9_\-.]/", '-', $filename); //New line

This should ensure the file, once saved, has a filename with only compatible characters.

Last edited 20 months ago by Azuaron (previous) (diff)

comment:14 SergeyBiryukov20 months ago

  • Version changed from 3.2.1 to 3.1

Version number is used to track when the bug was initially introduced/reported.

comment:15 follow-up: kawauso20 months ago

  • Keywords has-patch needs-testing added; needs-patch removed

Sanitization is already partly addressed by sanitize_file_name(), run as part of wp_unique_filename() on line 462. It looks like adding % and + to that covers all the test files I've tried, though it will have a wider impact as well.

Attaching patch that adds those characters, decodes sideloaded filenames (else you end up with sanitized encoded characters) and removes the two str_replace() lines which as far as I can tell serve no purpose given that sanitize_file_name() already removes those characters (see r8192).

kawauso20 months ago

% and + to sanitize_file_name(), urldecode() sideloaded filenames

kawauso20 months ago

Quick sideload test script

comment:16 in reply to: ↑ 15 dancriel17 months ago

kawauso's patch worked for me in WP 3.3.1, thanks

comment:17 khag716 months ago

I have been using kawauso's patch on WP 3.3.1 with no problems. Thank you!

comment:18 kawauso16 months ago

  • Milestone changed from Awaiting Review to 3.4

comment:19 nacin14 months ago

  • Keywords 3.5-early needs-unit-tests added
  • Milestone changed from 3.4 to Future Release

comment:20 SergeyBiryukov12 months ago

#21116 closed as a duplicate.

comment:21 SergeyBiryukov11 months ago

Closed #21244 as a duplicate.

comment:22 krembo9911 months ago

  • Cc krembo99 added

Closed #16226 as duplicate

The problem is not only in media_sideload_image() but in every upload. it begins and ends with the sanitize_file_name() , which is not as extensive as it should be .
It will also create another problem with files that have dots ( "." ) inside the filename (example : file.name.3.4.dif.old.jpg)
In that case , and depending on other characters in the name , it will trim the extension.

The following code ,even if little elegant , resolves 90% of the problems on my tests.

However , I would recommend maybe think of a a different strategy , maybe with php filters like in the code example (commented)

add_filter('sanitize_file_name', 'k99_sanitize_file_name', 1);

function k99_sanitize_file_name($filename){

			$filename = preg_replace( '/[^a-z0-9_.\-]/','-',$filename);
			
			//I used Preg_replace . but maybe the whole function should go to filters ... example :
			
			// $filename = preg_replace('/[^a-zA-Z0-9._-]/','',$filename);
			
			// $filename = filter_var($filename, FILTER_SANITIZE_EMAIL);// very powerful for this purpose.
		
			// $filename= sanitize_title_with_dashes($filename);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);
		
			//$filename  = filter_var($filename , FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_LOW);
			
			
			// important in some countries ! If anyone knows any others - please add.
			
			$filename = strtr($filename, 'ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåçèéêëìíîïñòóôõöøùúûüýÿÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħĨĩĪīĬĭĮįİıIJijĴĵĶķĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŌōŎŏŐőŒœŔŕŖŗŘřŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƒƠơƯưǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǺǻǼǽǾǿ', 'SZszYAAAAAACEEEEIIIINOOOOOOUUUUYaaaaaaceeeeiiiinoooooouuuuyyAAAAAAAECEEEEIIIIDNOOOOOOUUUUYsaaaaaaaeceeeeiiiinoooooouuuuyyAaAaAaCcCcCcCcDdDdEeEeEeEeEeGgGgGgGgHhHhIiIiIiIiIiIJijJjKkLlLlLlLlllNnNnNnnOoOoOoOEoeRrRrRrSsSsSsSsTtTtTtUuUuUuUuUuUuWwYyYZzZzZzsfOoUuAaIiOoUuUuUuUuUuAaAEaeOo');
               
                        // important in some other countries ! If anyone knows any others - please add.
			$filename = strtr($filename, array('Þ' => 'TH', 'þ' => 'th', 'Ð' => 'DH', 'ð' => 'dh', 'ß' => 'ss', 'Œ' => 'OE', 'œ' => 'oe', 'Æ' => 'AE', 'æ' => 'ae', 'µ' => 'u'));

			return $filename;
}

 
Last edited 11 months ago by krembo99 (previous) (diff)

SergeyBiryukov11 months ago

Refreshed

comment:23 SergeyBiryukov11 months ago

Related: #21241

Refreshed 16330.2.diff after [21219].

ticket:16226:16226.2.diff might be worth testing as well.

Last edited 11 months ago by SergeyBiryukov (previous) (diff)

comment:24 bananastalktome9 months ago

  • Cc bananastalktome@… added

comment:25 ryelle7 months ago

  • Cc ryelle added

comment:26 davouch5 months ago

  • Cc davouch added

comment:27 SergeyBiryukov3 months ago

#23777 was marked as a duplicate.

comment:28 alexvorn23 months ago

  • Cc alexvornoffice@… added
Note: See TracTickets for help on using tickets.