Make WordPress Core

#26094 closed defect (bug) (fixed)

sanitize_file_name() breaks some UTF-8 strings

Reported by: p_enrique Owned by: SergeyBiryukov
Milestone: 4.1 Priority: normal
Severity: normal Version: 2.1
Component: Formatting Keywords: has-patch commit 4.1-early
Focuses: Cc:


I've been testing sanitize_file_name( 'X.jpg' ) where X is an Unicode character that is a number or a letter (matching regex /[\p{L}\p{N}]/u). Alarmingly, there are many rather common characters that will result in a malformed, broken string being returned:

(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er

The problem seems to be caused by the preg_replace function without a Unicode pattern modifier.

Attachments (3)

26094.patch (1.1 KB) - added by p_enrique 23 months ago.
Add /u pattern modifier to preg_replace
26094.2.patch (1.1 KB) - added by p_enrique 23 months ago.
Add /u pattern modifier to preg_replace with error suppression
26094.3.patch (602 bytes) - added by p_enrique 22 months ago.
use '/[\r\n\t -]+/' instead of '/[\s-]+/'

Download all attachments as: .zip

Change History (13)

@p_enrique23 months ago

Add /u pattern modifier to preg_replace

comment:1 @p_enrique23 months ago

  • Keywords has-patch added
  • Version set to trunk

comment:3 @SergeyBiryukov23 months ago

I think \p{Z} might cause a warning if PCRE was compiled without Unicode property support:

preg_match() [function.preg-match]: Compilation failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X.

#22692 has a related discussion: ticket:22692:39.

@p_enrique23 months ago

Add /u pattern modifier to preg_replace with error suppression

@p_enrique22 months ago

use '/[\r\n\t -]+/' instead of '/[\s-]+/'

comment:4 @p_enrique22 months ago

  • Cc heikki@… added
  • Keywords dev-feedback added

The latest patch has the same solution as in #24001: Avoid the '\s' in the regexp.

Version 0, edited 22 months ago by p_enrique (next)

comment:5 @SergeyBiryukov22 months ago

  • Keywords commit 3.9-early added; dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release
  • Version changed from trunk to 2.7

Introduced in [4710].

comment:6 @SergeyBiryukov22 months ago

  • Version changed from 2.7 to 2.1

comment:7 @lenasterg13 months ago

I just want to remind that the problem in sanitize_file_name for the
(U+03A0) : Π Greek capital letter pi

still exists in wordpress version 3.9.2
It can be solved by replacing the

$filename = preg_replace( '/[\s-]+/' , '-' , $filename ) ;


$filename = preg_replace( '/[\s-]+/u' , '-' , $filename ) ;

but I can't be sure what side effects in has for other special cases.

comment:8 @miqrogroove13 months ago

  • Keywords 4.1-early added; 3.9-early removed

comment:9 @SergeyBiryukov13 months ago

  • Milestone changed from Future Release to 4.1

comment:10 @SergeyBiryukov13 months ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 29715:

Check for [\r\n\t ] instead of \s in sanitize_file_name() to avoid UTF-8 issues.

props p_enrique.
fixes #26094.

Note: See TracTickets for help on using tickets.