Make WordPress Core

Opened 6 years ago

Closed 3 years ago

#47763 closed defect (bug) (fixed)

Uploaded files that meet certain conditions do not hit in media search

Reported by: dxd5001's profile dxd5001 Owned by: audrasjb's profile audrasjb
Milestone: 6.1 Priority: normal
Severity: normal Version: 5.2.2
Component: Media Keywords: has-patch
Focuses: administration Cc:

Description

I upload a media file, but it does not appear on the media page when searching the file by the filename.

In my observation, it only happens when this condition has gathered:

  1. The file created with a macOS X
  2. The file named in Japanese like “ワードプレス.pdf”
  3. The name included a sonant mark and/or P‐sound consonant mark
  4. Uploaded from a web browser except for Safari

I think this is caused by Unicode Normalization and APFS and/or HFS+ (both are a file system of macOS X).

The file system uses Normalization Form D (decomposition) for naming, but when I type the name in search window from a browser except Safari, it behaves as Normalization Form C (composition), so these characters don't match.

プ - The character with P‐sound consonant mark added in a filename

Unicode: U+30D5 U+309A, UTF-8: E3 83 95 E3 82 9A

プ - The character with P‐sound consonant mark typed in the search window from a browser (Chrome)

KATAKANA LETTER PU
Unicode: U+30D7, UTF-8: E3 83 97

These characters look the same but not the same.
You can check easily by copy & paste above characters to macOS character viewer.
Right-click the character and copy to get detail information.

Fortunately, there is a normalizer class in PHP (https://www.php.net/manual/en/class.normalizer.php).

So I tried using this class in the function wp_unique_filename(wp-includes/functions.php) and the results are good.

I added this code in wp-includes/functions.php line 2257:

<?php
        // Unicode Normalization: Normalize Form D (decomposition) to Form C (composition).
        if ( Normalizer::isNormalized( $filename, Normalizer::FORM_D ) ) {
                $filename = Normalizer::normalize( $filename, Normalizer::FORM_C );
        }

The file appears in search results on the media page. And also a page that file attached to the content area will hit by text search from the front-end search box.

Although we can deal with this problem using “wp_unique_filename” filter and above class, I think it’s better to handle it in the core file.


Test Environment:

WordPress 5.2.2
PHP 7.2.17
MySQL 5.7.16
macOS X 10.14.5 (MacBook Air)
File system: Apple File System (APFS)
Chrome 75.0.3770.142
Safari 12.1.1 (14607.2.6.1.1)
Firefox 68.0.1

Attachments (2)

functions.php (214.5 KB) - added by dxd5001 6 years ago.
Added Unicode Normalization code to wp_unique_filename function.
ワードプレス.pdf (12.8 KB) - added by dxd5001 6 years ago.
A sample of a media file.

Download all attachments as: .zip

Change History (7)

@dxd5001
6 years ago

Added Unicode Normalization code to wp_unique_filename function.

@dxd5001
6 years ago

A sample of a media file.

#3 @azaozz
3 years ago

  • Milestone changed from Awaiting Review to 6.1

Milestone to 6.1 as the PR on https://core.trac.wordpress.org/ticket/24661 will (most likely) fix this too.

Please test it!

#4 @azaozz
3 years ago

  • Keywords has-patch added

#5 @audrasjb
3 years ago

  • Owner set to audrasjb
  • Resolution set to fixed
  • Status changed from new to closed

In 53754:

Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the normalizer_normalize() PHP function which is available with the recommended intl PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

Note: See TracTickets for help on using tickets.