#9416 closed enhancement (fixed)
Better file name sanitization for wp_unique_filename
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 2.8 | Priority: | normal |
Severity: | normal | Version: | 2.7.1 |
Component: | Upload | Keywords: | has-patch early tested commit |
Focuses: | Cc: |
Description
sanitize_title_with_dashes seems to cause problems for some people who want the filename of their files to not be modified when uploaded.
Lets give those users a filter that can be removed from wp_unique_filename.
Attachments (3)
Change History (40)
#4
follow-up:
↓ 8
@
16 years ago
Talked to sivel in IRC about adding santiize_filename() that just removes some special chars, trims trailing and leading dots, and replaces spaces with underscores. This is similar to what Drupal does.
#5
@
16 years ago
- Keywords needs-patch added; has-patch tested removed
- Owner set to sivel
- Summary changed from Allow removal of sanitize_title_with_dashes from wp_unique_filename to Better file name sanitization for wp_unique_filename
I'm working an updated patch in the coming days. Assigning the ticket to myself.
#8
in reply to:
↑ 4
@
16 years ago
Replying to ryan:
Talked to sivel in IRC about adding santiize_filename() that just removes some special chars, trims trailing and leading dots, and replaces spaces with underscores.
I can understand leading dots, but why the rest? why does this need to be "sanitized" in the first place? Does'nt it get url encoded before landing into the url? I mean, as long as no two files have the same name, we're fine no?
#10
@
16 years ago
There are numerous characters that are not allowed in file names. And in addition there are characters that are allowed in windows but not in *nix and vice versa. In addition if you use a special character that would need escaping there are certain instances where the escape character '\' would be added to the filename. I could go on, but my opinion is this is needed.
#11
@
16 years ago
@Sivle: Granted. But there isn't any need need to sanitize beyond that.
For what it's worth, I've a (heavily used) plugin that deals with file names to name podcasts, and the only sanitization I did was to remove (forward- and back-) slash characters. The rest (including forbidden characters) I found were irrelevant in practice, since they get escaped by file_exists() et al anyway. So basically, really isn't any issue I can think of that should disallow the likes of:
$file_name = 'foo%&#$?\|\'"~!@*bar.mp3';
as long as it gets urlencoded, it not only works fine in flv players (which are messy in their own right), but also works fine in the browser without the slightest (baring php bugs of course) security problem.
#12
follow-up:
↓ 13
@
16 years ago
What happens if someone downloads a file that was named on Windows to a Mac or Nix box and cannot figure out how to delete the file because it had a strange character. Or cannot download and save the file because the file has characters that are illegal for other file systems?
Or the filename has a * in it and when the users deletes it from the file system they take out more files than they intended?
Another case is if the file has all common delimiters for preg_* functions and it becomes increasingly difficult for a plugin to do something.
Another case is we allow backticks and we find out there is a vulnerability that allows users to execute code.
Just a few things that I thought about when weighing the needs to sanitize in the first place.
#13
in reply to:
↑ 12
@
16 years ago
Replying to sivel:
What happens if someone downloads a file that was named on Windows to a Mac or Nix box and cannot figure out how to delete the file because it had a strange character. Or cannot download and save the file because the file has characters that are illegal for other file systems?
that is why I remove the slashes
Or the filename has a * in it and when the users deletes it from the file system they take out more files than they intended?
these are escaped by file handling functions
Another case is if the file has all common delimiters for preg_* functions and it becomes increasingly difficult for a plugin to do something.
that's what preg_quote() is for
Another case is we allow backticks and we find out there is a vulnerability that allows users to execute code.
it gets escaped by file handling functions too
Just a few things that I thought about when weighing the needs to sanitize in the first place.
you're right on paper, but we're being a bit too overzealous here.
#14
follow-up:
↓ 15
@
16 years ago
I think most of what I am referring to is command line level. When the files are named and stored in the file system they won't be escaped will they?
#15
in reply to:
↑ 14
@
16 years ago
Replying to sivel:
I think most of what I am referring to is command line level. When the files are named and stored in the file system they won't be escaped will they?
they definitely will.
#16
@
16 years ago
or rather, they'll list, when you run, say, ls, as:
foo bar
but to use them, you'd need to go:
foo\ bar
D.
#17
@
16 years ago
While Files will be created properly no matter what the characters (As long as they're valid on the system), The filename could potentially contain items which need to be escaped to be manipulated.
Its a pain to have to work out the escape sequence on *unix sometimes for files.. (i've reverted to rm foo*.ext before)
Also, many filesystems treat files differently as mentioned.. The current sanitization is pretty good from what i've seen, However, taking a cautious road could prevent issues in the future.. even if only just for 1 person...
Maybe a list of border-line cases which can occur with the current code would be useful? To explain the need for the changes to be made in the first place. (ie. outline the cases where the current sanitization fails)
#18
@
16 years ago
As for the current sanitization it uses sanitize_title_with_dashes and part of this is removing periods from the filename other than the one before the file extension. In addition I have also received complaints that wp_unique_filename converts filenames to all lower case. Changing the way WordPress handles filename case should not cause issues with unique filenames.
While talking to Ryan today he suggested that we start using a function to better facilitate sanitizing filenames rather than using a function intended for another use.
There is already a function in wordpress sanitize_file_name that can be used here but may need some updating as it won't work properly with non ASCII filenames.
#19
follow-up:
↓ 24
@
16 years ago
Replying to DD32:
pretty good from what i've seen, However, taking a cautious road could prevent issues in the future.. even if only just for 1 person...
So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:
- Isn't this neat?
- Something on "blah"!
- Why 1 < 2 and 2 > 1
- Email mickey @ mouse.com
- Make $$$ because stocks will rise 10%
As long as we get those right, we've a few thousand podcasters who can vouch things work great.
I've never seen * used in a title, and I suspect it's due to the fact that Windows will reject it. I assume other windows special characters apply here too (\ in particular). I *have* seen all sorts of weird punctuation characters though.
Just my $.02. :-)
#20
@
16 years ago
oh, and... btw. I may be wrong, but I'm 99% that those who complained about the sanitization were using titles that fit in either or all of the categories I outlined (quotes, greater/lesser than, @, dollar and percent).
#21
follow-up:
↓ 23
@
16 years ago
I do plan on adding a filter that passes both the sanitized and raw filenames so it sanitization could always be "removed" by a plugin without much issue.
#22
@
16 years ago
+1 for replacing a slug sanitizer with a filename sanitizer, But like i said, It'll need some exact character tests, Dont want to take a step back from what we already have.
#23
in reply to:
↑ 21
@
16 years ago
Replying to sivel:
I do plan on adding a filter that passes both the sanitized and raw filenames so it sanitization could always be "removed" by a plugin without much issue.
Imo, we'll be better off with the "correct" filter out of the box: what the user reasonably expects to work (my above examples are, I think, quite representative of english speaking audiences).
#24
in reply to:
↑ 19
;
follow-up:
↓ 25
@
16 years ago
Replying to Denis-de-Bernardy:
So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:
But those file names wouldn't be rejected--just possibly sanitized to something that they wouldn't even interact with directly. Why should any users care what the actual sanitized file name is, so long as they have the UI to locate it by an arbitrary title (such as in your examples)?
#25
in reply to:
↑ 24
@
16 years ago
Replying to filosofo:
Replying to Denis-de-Bernardy:
So, user input time. My customers (many podcasters) would hate it if the following podcast titles get rejected:
But those file names wouldn't be rejected--just possibly sanitized to something that they wouldn't even interact with directly. Why should any users care what the actual sanitized file name is, so long as they have the UI to locate it by an arbitrary title (such as in your examples)?
oh, in the url, it's not much of a big deal. but when you use the file name as it's title (as I currently do in mediacaster), it makes all the difference in the world.
my main interest in this ticket is I'd like to rewrite that plugin in such a way that it uses the WP uploader instead.
#26
@
16 years ago
Here is my first go. I've updated the phpdoc comments for the sanitize_file_name function so read the comments for what we are doing. wp_unique_filename no longer converts filenames to all lowercase.
sanitize_file_name was previously only used by xmlrpc and APP. I don't use those functionalities so it would be appreciated if someone could test and make sure the changes to sanitize_file_name don't break anything.
2 filters have been added, 1 for filtering the sanitization characters and one for the returned file name.
#27
@
16 years ago
isn't there an issue with this regex?
$filename = 'foo - bar'; $filename = preg_replace('(\s+|-+)', '-', $filename); // error, foo---bar if it had delimiters
potential replacements:
$filename = preg_replace('/[\s\-]+/', '-', $filename); $filename = preg_replace('/(?:\s|-)+/', '-', $filename);
also, wouldn't it be more desirable to turn chars such as punctation and dashes into dashes, rather than eliminating them outright?
lastly, there are potential issues here if the file name contains accents or chinese chars, no?
#28
@
16 years ago
What potential problems do you see with foreign characters? Another thing that I have been asked to do by a user was keep foreign characters in the filename. I could of course run remove_accents on it to take care of the accents. I seem to have no problem with foreign characters in filenames on my system.
My only thought about replacing punctuation with dashes is that I would like to keep the number of dashes limited. My personal opinion is that it makes for better readability if punctuation is removed as opposed to being replaced with dashes.
And yes I do see the error in the regex there. I'll fix that soon.
#29
@
16 years ago
Don't need to escape a hyphen if it's the last or first part of a character class.
{{{/[\s-]+/}}
#31
@
16 years ago
Maintaining strtolower on the extension might be desirable. JPG always becomes jpg. A possible way to avoid dupes that differe only in case is:
WHERE meta_key = '_wp_attached_file' AND LCASE( meta_value ) = LCASE( %s )
Suggested by Mark.
#32
@
16 years ago
Last patch looks good to me. If you want to handle strtolower properly for 7bit ascii chars (I assume that _is_ safe), checkout the split_utf8() function I used in a patch for utf8 css classnames. you can strtolower all single-byte squences (= ascii 7bit) with ease with that function #8446 / 8446-valid-utf8-classnames-generation.patch.
#37
@
15 years ago
The to this ticket related changeset [11178] did break a test of the testsuite (wp_unique_filename) because it:
- removes dot in front
- does not removes % any longer
- does not remove chars out of the 7bit ascii range any longer
The problem has been reporte to sivel as well. I might open a new ticket related to that problem.
I like the Idea but I tend to create an action that has all that in by default so that it can be replaced in full because I ran over other Errors there as well, for example, UTF-8 encoding was not supported in the process to sanitize the filename.
double byte chars were reduced to a single byte "a" that was absolutely not fitting and it looks like this all not done very properly.
I vote for getting the whole "change the filename that was submitted to get the filename on disk" re-worked sothat it
a) works properly (handles character encoding well)
b) is extend-, replace- or reduceable
a) can be done by checking the used functions
b) can be done by creating an action/filter
both fit perfectly together.