WordPress.org

Make WordPress Core

Opened 17 months ago

Last modified 5 months ago

#39791 new enhancement

sanitize_file_name() optimizations

Reported by: mgutt Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 4.8
Component: Media Keywords:
Focuses: Cc:

Description (last modified by SergeyBiryukov)

This changeset: [29290]

added this line:

$filename = str_replace( array( '%20', '+' ), '-', $filename );

But because of this changeset it can be removed as those chars aren't present anymore: [35122]

Additional proposals

1.) After many years new special characters are added step-by-step to sanitize_file_name(). Now almost all characters of the reserved file system, reserved URI and unsafe URL characters lists are part of it, except of:

reserved file system chars (https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words)

chr(0), ..., chr(32)

the reserved URI char (https://tools.ietf.org/html/rfc3986#section-2.2):

@

the unsafe URL char (https://www.ietf.org/rfc/rfc1738.txt):

^

non-printing DEL:

chr(127)

Finally you should add all these chars to avoid future bug reports:

$special_chars = array(
        // file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
        '<', '>', ':', '"', '/', '\\', '|', '?', '*',
        // control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
        // note: \t, \n and \r are chr(9), chr(10) and chr(13)
        chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
        chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
        chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
        chr(31),
        // non-printing character <DEL>
        chr(127),
        // non-breaking space
        chr(160),
        // URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
        '#', '[', ']', '@', '!', '$', '&', "'", '(', ')', '+', ',', ';', '=',
        // URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
        '{', '}', '^', '~', '`'
);

If you do that, do not forget to change this line:

$filename = preg_replace( '/[\r\n\t -]+/', '-', $filename );

to that (because we replaced the other chars already):

$filename = preg_replace( '/[ -]+/', '-', $filename );

and remove this line because we cover it already through chr(160):

$filename = preg_replace( "#\x{00a0}#siu", ' ', $filename );

Source: https://en.wikipedia.org/wiki/Whitespace_character#Unicode

2.) mb_strtolower() could be used to raise windows/unix interoperability (when downloading ftp backups or moving the host) because of their different behaviour in case-sensitivity.

Change History (6)

#1 @SergeyBiryukov
17 months ago

  • Component changed from General to Media
  • Description modified (diff)

#2 @mgutt
17 months ago

I benchmarked a str_replace() with the $special_chars. If it is realized it should use an preg_replace() instead. Its 3x times faster:

$filename = ltrim(preg_replace(
		'~
		[<>:"/\\|?*]|            # file system reserved https://en.wikipedia.org/wiki/Filename#Reserved_characters_and_words
		[\x00-\x1F]|             # control characters http://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
		[\x7F\xA0\xAD]|          # non-printing characters DEL, NO-BREAK SPACE, SOFT HYPHEN
		[#\[\]@!$&\'()+,;=]|     # URI reserved https://tools.ietf.org/html/rfc3986#section-2.2
		[{}^\~`]                 # URL unsafe characters https://www.ietf.org/rfc/rfc1738.txt
		~x',
	// last step: ltrim() avoids ".", ".." or ".hiddenFiles"
	'-', $filename), '.-'));
}
Last edited 17 months ago by mgutt (previous) (diff)

#3 @mgutt
17 months ago

This one should be added as well:

// maximum filename length of 255 bytes http://serverfault.com/a/9548/44086
// note: mb_strcut() relies as strlen() on bytes and not chars
if ($ext = pathinfo($filename, PATHINFO_EXTENSION)) {
	$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255 - strlen($ext) - 1, mb_detect_encoding($filename)) . '.' . $ext;
}
else {
	$filename = mb_strcut(pathinfo($filename, PATHINFO_FILENAME), 0, 255, mb_detect_encoding($filename));
}

But it should be executed at the end after all the filename changes have been done (removing whitespace, removing consecutive characters, etc.) so it contains as much as possible before we cut it.

Could be added, too:

// lowercase for better windows/unix interoperability http://support.microsoft.com/kb/100625
$filename = mb_strtolower($filename, mb_detect_encoding($filename));

This ticket was mentioned in Slack in #core-media by paaljoachim. View the logs.


5 months ago

This ticket was mentioned in Slack in #core-media by paaljoachim. View the logs.


5 months ago

This ticket was mentioned in Slack in #core-i18n by paaljoachim. View the logs.


5 months ago

Note: See TracTickets for help on using tickets.