Make WordPress Core

Opened 16 years ago

Closed 16 years ago

Last modified 15 years ago

#9416 closed enhancement (fixed)

Better file name sanitization for wp_unique_filename

Reported by: sivel's profile sivel Owned by: sivel's profile sivel
Milestone: 2.8 Priority: normal
Severity: normal Version: 2.7.1
Component: Upload Keywords: has-patch early tested commit
Focuses: Cc:

Description

sanitize_title_with_dashes seems to cause problems for some people who want the filename of their files to not be modified when uploaded.

Lets give those users a filter that can be removed from wp_unique_filename.

Attachments (3)

9416.1.diff (2.7 KB) - added by sivel 16 years ago.
9416.2.diff (3.3 KB) - added by sivel 16 years ago.
9416.3.diff (3.3 KB) - added by sivel 16 years ago.

Download all attachments as: .zip

Change History (40)

@sivel
16 years ago

#1 @sivel
16 years ago

  • Cc matt@… added
  • Keywords has-patch added

#2 @hakre
16 years ago

I like the Idea but I tend to create an action that has all that in by default so that it can be replaced in full because I ran over other Errors there as well, for example, UTF-8 encoding was not supported in the process to sanitize the filename.

double byte chars were reduced to a single byte "a" that was absolutely not fitting and it looks like this all not done very properly.

I vote for getting the whole "change the filename that was submitted to get the filename on disk" re-worked sothat it

a) works properly (handles character encoding well)
b) is extend-, replace- or reduceable

a) can be done by checking the used functions
b) can be done by creating an action/filter

both fit perfectly together.

#3 @sivel
16 years ago

  • Keywords tested added

#4 follow-up: @ryan
16 years ago

Talked to sivel in IRC about adding santiize_filename() that just removes some special chars, trims trailing and leading dots, and replaces spaces with underscores. This is similar to what Drupal does.

#5 @sivel
16 years ago

  • Keywords needs-patch added; has-patch tested removed
  • Owner set to sivel
  • Summary changed from Allow removal of sanitize_title_with_dashes from wp_unique_filename to Better file name sanitization for wp_unique_filename

I'm working an updated patch in the coming days. Assigning the ticket to myself.

#6 @sivel
16 years ago

  • Status changed from new to assigned

#8 in reply to: ↑ 4 @Denis-de-Bernardy
16 years ago

Replying to ryan:

Talked to sivel in IRC about adding santiize_filename() that just removes some special chars, trims trailing and leading dots, and replaces spaces with underscores.

I can understand leading dots, but why the rest? why does this need to be "sanitized" in the first place? Does'nt it get url encoded before landing into the url? I mean, as long as no two files have the same name, we're fine no?

#9 @ryan
16 years ago

The drupal sanitizer looks like it is removing shell metachars only.

#10 @sivel
16 years ago

There are numerous characters that are not allowed in file names. And in addition there are characters that are allowed in windows but not in *nix and vice versa. In addition if you use a special character that would need escaping there are certain instances where the escape character '\' would be added to the filename. I could go on, but my opinion is this is needed.

#11 @Denis-de-Bernardy
16 years ago

@Sivle: Granted. But there isn't any need need to sanitize beyond that.

For what it's worth, I've a (heavily used) plugin that deals with file names to name podcasts, and the only sanitization I did was to remove (forward- and back-) slash characters. The rest (including forbidden characters) I found were irrelevant in practice, since they get escaped by file_exists() et al anyway. So basically, really isn't any issue I can think of that should disallow the likes of:

$file_name = 'foo%&#$?\|\'"~!@*bar.mp3';

as long as it gets urlencoded, it not only works fine in flv players (which are messy in their own right), but also works fine in the browser without the slightest (baring php bugs of course) security problem.

#12 follow-up: @sivel
16 years ago

What happens if someone downloads a file that was named on Windows to a Mac or Nix box and cannot figure out how to delete the file because it had a strange character. Or cannot download and save the file because the file has characters that are illegal for other file systems?

Or the filename has a * in it and when the users deletes it from the file system they take out more files than they intended?

Another case is if the file has all common delimiters for preg_* functions and it becomes increasingly difficult for a plugin to do something.

Another case is we allow backticks and we find out there is a vulnerability that allows users to execute code.

Just a few things that I thought about when weighing the needs to sanitize in the first place.

#13 in reply to: ↑ 12 @Denis-de-Bernardy
16 years ago

Replying to sivel:

What happens if someone downloads a file that was named on Windows to a Mac or Nix box and cannot figure out how to delete the file because it had a strange character. Or cannot download and save the file because the file has characters that are illegal for other file systems?

that is why I remove the slashes

Or the filename has a * in it and when the users deletes it from the file system they take out more files than they intended?

these are escaped by file handling functions

Another case is if the file has all common delimiters for preg_* functions and it becomes increasingly difficult for a plugin to do something.

that's what preg_quote() is for

Another case is we allow backticks and we find out there is a vulnerability that allows users to execute code.

it gets escaped by file handling functions too

Just a few things that I thought about when weighing the needs to sanitize in the first place.

you're right on paper, but we're being a bit too overzealous here.

#14 follow-up: @sivel
16 years ago

I think most of what I am referring to is command line level. When the files are named and stored in the file system they won't be escaped will they?

#15 in reply to: ↑ 14 @Denis-de-Bernardy
16 years ago

Replying to sivel:

I think most of what I am referring to is command line level. When the files are named and stored in the file system they won't be escaped will they?

they definitely will.

#16 @Denis-de-Bernardy
16 years ago

or rather, they'll list, when you run, say, ls, as:

foo bar

but to use them, you'd need to go:

foo\ bar

D.

#17 @DD32
16 years ago

While Files will be created properly no matter what the characters (As long as they're valid on the system), The filename could potentially contain items which need to be escaped to be manipulated.

Its a pain to have to work out the escape sequence on *unix sometimes for files.. (i've reverted to rm foo*.ext before)

Also, many filesystems treat files differently as mentioned.. The current sanitization is pretty good from what i've seen, However, taking a cautious road could prevent issues in the future.. even if only just for 1 person...

Maybe a list of border-line cases which can occur with the current code would be useful? To explain the need for the changes to be made in the first place. (ie. outline the cases where the current sanitization fails)

#18 @sivel
16 years ago

As for the current sanitization it uses sanitize_title_with_dashes and part of this is removing periods from the filename other than the one before the file extension. In addition I have also received complaints that wp_unique_filename converts filenames to all lower case. Changing the way WordPress handles filename case should not cause issues with unique filenames.

While talking to Ryan today he suggested that we start using a function to better facilitate sanitizing filenames rather than using a function intended for another use.

There is already a function in wordpress sanitize_file_name that can be used here but may need some updating as it won't work properly with non ASCII filenames.

#19 follow-up: @Denis-de-Bernardy
16 years ago

Replying to DD32:
pretty good from what i've seen, However, taking a cautious road could prevent issues in the future.. even if only just for 1 person...

So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:

  • Isn't this neat?
  • Something on "blah"!
  • Why 1 < 2 and 2 > 1
  • Email mickey @ mouse.com
  • Make $$$ because stocks will rise 10%

As long as we get those right, we've a few thousand podcasters who can vouch things work great.

I've never seen * used in a title, and I suspect it's due to the fact that Windows will reject it. I assume other windows special characters apply here too (\ in particular). I *have* seen all sorts of weird punctuation characters though.

Just my $.02. :-)

#20 @Denis-de-Bernardy
16 years ago

oh, and... btw. I may be wrong, but I'm 99% that those who complained about the sanitization were using titles that fit in either or all of the categories I outlined (quotes, greater/lesser than, @, dollar and percent).

#21 follow-up: @sivel
16 years ago

I do plan on adding a filter that passes both the sanitized and raw filenames so it sanitization could always be "removed" by a plugin without much issue.

#22 @DD32
16 years ago

+1 for replacing a slug sanitizer with a filename sanitizer, But like i said, It'll need some exact character tests, Dont want to take a step back from what we already have.

#23 in reply to: ↑ 21 @Denis-de-Bernardy
16 years ago

Replying to sivel:

I do plan on adding a filter that passes both the sanitized and raw filenames so it sanitization could always be "removed" by a plugin without much issue.

Imo, we'll be better off with the "correct" filter out of the box: what the user reasonably expects to work (my above examples are, I think, quite representative of english speaking audiences).

#24 in reply to: ↑ 19 ; follow-up: @filosofo
16 years ago

Replying to Denis-de-Bernardy:

So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:

But those file names wouldn't be rejected--just possibly sanitized to something that they wouldn't even interact with directly. Why should any users care what the actual sanitized file name is, so long as they have the UI to locate it by an arbitrary title (such as in your examples)?

#25 in reply to: ↑ 24 @Denis-de-Bernardy
16 years ago

Replying to filosofo:

Replying to Denis-de-Bernardy:

So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:

But those file names wouldn't be rejected--just possibly sanitized to something that they wouldn't even interact with directly. Why should any users care what the actual sanitized file name is, so long as they have the UI to locate it by an arbitrary title (such as in your examples)?

oh, in the url, it's not much of a big deal. but when you use the file name as it's title (as I currently do in mediacaster), it makes all the difference in the world.

my main interest in this ticket is I'd like to rewrite that plugin in such a way that it uses the WP uploader instead.

#26 @sivel
16 years ago

Here is my first go. I've updated the phpdoc comments for the sanitize_file_name function so read the comments for what we are doing. wp_unique_filename no longer converts filenames to all lowercase.

sanitize_file_name was previously only used by xmlrpc and APP. I don't use those functionalities so it would be appreciated if someone could test and make sure the changes to sanitize_file_name don't break anything.

2 filters have been added, 1 for filtering the sanitization characters and one for the returned file name.

@sivel
16 years ago

#27 @Denis-de-Bernardy
16 years ago

isn't there an issue with this regex?

$filename = 'foo - bar';
$filename = preg_replace('(\s+|-+)', '-', $filename); // error, foo---bar if it had delimiters

potential replacements:

$filename = preg_replace('/[\s\-]+/', '-', $filename);
$filename = preg_replace('/(?:\s|-)+/', '-', $filename);

also, wouldn't it be more desirable to turn chars such as punctation and dashes into dashes, rather than eliminating them outright?

lastly, there are potential issues here if the file name contains accents or chinese chars, no?

#28 @sivel
16 years ago

What potential problems do you see with foreign characters? Another thing that I have been asked to do by a user was keep foreign characters in the filename. I could of course run remove_accents on it to take care of the accents. I seem to have no problem with foreign characters in filenames on my system.

My only thought about replacing punctuation with dashes is that I would like to keep the number of dashes limited. My personal opinion is that it makes for better readability if punctuation is removed as opposed to being replaced with dashes.

And yes I do see the error in the regex there. I'll fix that soon.

#29 @filosofo
16 years ago

Don't need to escape a hyphen if it's the last or first part of a character class.
{{{/[\s-]+/}}

@sivel
16 years ago

#30 @sivel
16 years ago

  • Keywords has-patch early dev-feedback needs-testing added; needs-patch removed

#31 @ryan
16 years ago

Maintaining strtolower on the extension might be desirable. JPG always becomes jpg. A possible way to avoid dupes that differe only in case is:

WHERE meta_key = '_wp_attached_file' AND LCASE( meta_value ) = LCASE( %s )

Suggested by Mark.

#32 @hakre
16 years ago

Last patch looks good to me. If you want to handle strtolower properly for 7bit ascii chars (I assume that _is_ safe), checkout the split_utf8() function I used in a patch for utf8 css classnames. you can strtolower all single-byte squences (= ascii 7bit) with ease with that function #8446 / 8446-valid-utf8-classnames-generation.patch.

#33 @Denis-de-Bernardy
16 years ago

  • Keywords dev-feedback removed

#34 @Denis-de-Bernardy
16 years ago

  • Keywords tested commit added; needs-testing removed

commit? wontfix?

#35 @ryan
16 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [11178]) sanitize_file_name() improvements. Props sivel. fixes #9416

#37 @hakre
15 years ago

The to this ticket related changeset [11178] did break a test of the testsuite (wp_unique_filename) because it:

  1. removes dot in front
  2. does not removes % any longer
  3. does not remove chars out of the 7bit ascii range any longer

The problem has been reporte to sivel as well. I might open a new ticket related to that problem.

Note: See TracTickets for help on using tickets.