Opened 8 years ago
Last modified 7 years ago
#39791 new enhancement
sanitize_file_name() optimizations
Reported by: | mgutt | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | 4.8 |
Component: | Media | Keywords: | |
Focuses: | Cc: |
Description (last modified by )
This changeset: [29290]
added this line:
$filename = str_replace( array( '%20', '+' ), '-', $filename );
But because of this changeset it can be removed as those chars aren't present anymore: [35122]
Additional proposals
1.) After many years new special characters are added step-by-step to sanitize_file_name(). Now almost all characters of the reserved file system, reserved URI and unsafe URL characters lists are part of it, except of:
reserved file system chars (https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words)
chr(0), ..., chr(32)
the reserved URI char (https://tools.ietf.org/html/rfc3986#section-2.2):
@
the unsafe URL char (https://www.ietf.org/rfc/rfc1738.txt):
^
non-printing DEL:
chr(127)
Finally you should add all these chars to avoid future bug reports:
$special_chars = array(
// file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
'<', '>', ':', '"', '/', '\\', '|', '?', '*',
// control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
// note: \t, \n and \r are chr(9), chr(10) and chr(13)
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing character <DEL>
chr(127),
// non-breaking space
chr(160),
// URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
'#', '[', ']', '@', '!', '$', '&', "'", '(', ')', '+', ',', ';', '=',
// URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
'{', '}', '^', '~', '`'
);
If you do that, do not forget to change this line:
$filename = preg_replace( '/[\r\n\t -]+/', '-', $filename );
to that (because we replaced the other chars already):
$filename = preg_replace( '/[ -]+/', '-', $filename );
and remove this line because we cover it already through chr(160):
$filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );
Source: https://en.wikipedia.org/wiki/Whitespace_character#Unicode
2.) mb_strtolower() could be used to raise windows/unix interoperability (when downloading ftp backups or moving the host) because of their different behaviour in case-sensitivity.
Change History (6)
#3
@
8 years ago
This one should be added as well:
// maximum filename length of 255 bytes http://serverfault.com/a/9548/44086 // note: mb_strcut() relies as strlen() on bytes and not chars if ($ext = pathinfo($filename, PATHINFO_EXTENSION)) { $filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - strlen($ext) - 1, mb_detect_encoding($filename)) . '.' . $ext; } else { $filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255, mb_detect_encoding($filename)); }
But it should be executed at the end after all the filename changes have been done (removing whitespace, removing consecutive characters, etc.) so it contains as much as possible before we cut it.
Could be added, too:
// lowercase for better windows/unix interoperability http://support.microsoft.com/kb/100625 $filename = mb_strtolower($filename, mb_detect_encoding($filename));
I benchmarked a str_replace() with the $special_chars. If it is realized it should use an preg_replace() instead. Its 3x times faster: