Make WordPress Core

Opened 11 years ago

Closed 10 years ago

#26094 closed defect (bug) (fixed)

sanitize_file_name() breaks some UTF-8 strings

Reported by: p_enrique's profile p_enrique Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 4.1 Priority: normal
Severity: normal Version: 2.1
Component: Formatting Keywords: has-patch commit 4.1-early
Focuses: Cc:

Description

I've been testing sanitize_file_name( 'X.jpg' ) where X is an Unicode character that is a number or a letter (matching regex /[\p{L}\p{N}]/u). Alarmingly, there are many rather common characters that will result in a malformed, broken string being returned:

(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er

The problem seems to be caused by the preg_replace function without a Unicode pattern modifier.

Attachments (3)

26094.patch (1.1 KB) - added by p_enrique 11 years ago.
Add /u pattern modifier to preg_replace
26094.2.patch (1.1 KB) - added by p_enrique 11 years ago.
Add /u pattern modifier to preg_replace with error suppression
26094.3.patch (602 bytes) - added by p_enrique 11 years ago.
use '/[\r\n\t -]+/' instead of '/[\s-]+/'

Download all attachments as: .zip

Change History (13)

@p_enrique
11 years ago

Add /u pattern modifier to preg_replace

#1 @p_enrique
11 years ago

  • Keywords has-patch added
  • Version set to trunk

#3 @SergeyBiryukov
11 years ago

I think \p{Z} might cause a warning if PCRE was compiled without Unicode property support:

preg_match() [function.preg-match]: Compilation failed: PCRE does not support \L, \l, \N, \P, \p, \U, \u, or \X.

#22692 has a related discussion: ticket:22692:39.

@p_enrique
11 years ago

Add /u pattern modifier to preg_replace with error suppression

@p_enrique
11 years ago

use '/[\r\n\t -]+/' instead of '/[\s-]+/'

#4 @p_enrique
11 years ago

  • Cc heikki@… added
  • Keywords dev-feedback added

The latest patch has the same solution as in #24001: Avoid the '\s' in the regexp.

Last edited 11 years ago by SergeyBiryukov (previous) (diff)

#5 @SergeyBiryukov
11 years ago

  • Keywords commit 3.9-early added; dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release
  • Version changed from trunk to 2.7

Introduced in [4710].

#6 @SergeyBiryukov
11 years ago

  • Version changed from 2.7 to 2.1

#7 @lenasterg
10 years ago

Hi.
I just want to remind that the problem in sanitize_file_name for the
(U+03A0) : Π Greek capital letter pi

still exists in wordpress version 3.9.2
It can be solved by replacing the

$filename = preg_replace( '/[\s-]+/' , '-' , $filename ) ;

with

$filename = preg_replace( '/[\s-]+/u' , '-' , $filename ) ;

but I can't be sure what side effects in has for other special cases.

#8 @miqrogroove
10 years ago

  • Keywords 4.1-early added; 3.9-early removed

#9 @SergeyBiryukov
10 years ago

  • Milestone changed from Future Release to 4.1

#10 @SergeyBiryukov
10 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 29715:

Check for [\r\n\t ] instead of \s in sanitize_file_name() to avoid UTF-8 issues.

props p_enrique.
fixes #26094.

Note: See TracTickets for help on using tickets.