Make WordPress Core

Opened 8 years ago

Closed 2 years ago

Last modified 21 months ago

#35951 closed defect (bug) (fixed)

remove_accents() doesn't escape Unicode NFD characters

Reported by: onnimonni's profile onnimonni Owned by: audrasjb's profile audrasjb
Milestone: 6.1 Priority: normal
Severity: normal Version:
Component: Charset Keywords: has-patch add-to-field-guide
Focuses: Cc:

Description

OS X filesystem HFS uses unicode NFD instead of NFC. This causes all sorts of problems when uploaded files with accents are moved between environments or if the site is developed in OS X machine and then pushed to production.

I'm trying to solve this problem using remove_accents() function and sanitizing all uploaded files. But in my test machine remove_accents() doesn't do anything for NFD characters.

It should use something like Normalizer::normalize() to avoid this. But sadly Normalizer isn't available in all systems.

I included zip file which contains nfd characters. If you open it in linux machine you can see a small difference between the characters and "normal" utf-8 accented characters like: öäå.

Try to copy the contents and run it through remove_accents('content') and you can see that nothing is changed.

If you have Normalizer available you can test that remove_accent() if characters are first filtered by running Normalizer for example: remove_accents(Normalizer::normalize('content'))

I realize this doesn't concern native english speaking countries but it's really big annoyance for the rest of us.

Attachments (2)

nfd-characters.zip (199 bytes) - added by onnimonni 8 years ago.
ZIP which contains txt file with NFD encoded characters
normalize_before_removing_accents.patch (486 bytes) - added by rodrigosevero 3 years ago.
Patch to solve expanded characters not being substituted using PHP's own normalizer_normalize function.

Download all attachments as: .zip

Change History (24)

@onnimonni
8 years ago

ZIP which contains txt file with NFD encoded characters

#1 follow-up: @johnbillion
8 years ago

  • Focuses accessibility removed
  • Keywords reporter-feedback added
  • Version 4.4.2 deleted

Out of interest, how are you interacting with an HFS filesystem? Are you copying files from a volume that uses HFS? AFAIK, you cannot run OS X on HFS.

I would hazard a guess that this is a limited edge case (not saying it's not a bug, just that it's likely a very uncommon situation to find yourself in).

#2 @ocean90
8 years ago

Related: #30130

#4 in reply to: ↑ 1 @zodiac1978
8 years ago

Replying to johnbillion:

Out of interest, how are you interacting with an HFS filesystem? Are you copying files from a volume that uses HFS? AFAIK, you cannot run OS X on HFS.

I would hazard a guess that this is a limited edge case (not saying it's not a bug, just that it's likely a very uncommon situation to find yourself in).

The problem with NFD is true for HFS and HFS+, see:
https://en.wikipedia.org/wiki/HFS_Plus#Design

And HFS+ is the default filesystem for all Mac OS X computers.

This patch would fix it, as it adds the mentioned normalizer function from PHP (if it exists) to the sanitize_filename filter:
https://core.trac.wordpress.org/attachment/ticket/30130/30130.2.diff

#5 @johnbillion
8 years ago

  • Keywords needs-patch added; reporter-feedback removed
  • Milestone changed from Awaiting Review to 4.7
  • Owner set to johnbillion
  • Status changed from new to assigned

#6 @jorbin
8 years ago

  • Keywords needs-unit-tests added

In addition to a patch, this will need some unit tests if it is going to be fixed in 4.7

#7 @gitlost
8 years ago

This is a duplicate really of #24661.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 years ago

#9 @johnbillion
8 years ago

  • Milestone changed from 4.7 to Future Release

#10 @gitlost
8 years ago

I think it would be more apt to mark this as a duplicate of #24661.

#12 @zodiac1978
6 years ago

  • Status changed from assigned to reopened

Like all the other tickets mentioned in this ticket this is still true and needs to be fixed.

I've submitted a talk for WordCamp Europe where I will try to summarize all these tickets, possible patches and why this still is a problem for most if not all European languages.

#13 @SergeyBiryukov
5 years ago

  • Milestone set to Future Release

#14 @johnbillion
4 years ago

  • Owner johnbillion deleted
  • Status changed from reopened to assigned

#15 @rodrigosevero
3 years ago

Well, we are at 5.7.2 and it seems that this issue haven't been solved yet.

Either the solution proposed here or the solution proposed at https://core.trac.wordpress.org/ticket/24661 seems to work.

How can I help move this issue forward?

@rodrigosevero
3 years ago

Patch to solve expanded characters not being substituted using PHP's own normalizer_normalize function.

#16 @azaozz
2 years ago

  • Keywords has-patch added; needs-patch needs-unit-tests removed
  • Milestone changed from Future Release to 6.1

Milestone to 6.1 as the PR on https://core.trac.wordpress.org/ticket/24661 will (most likely) fix this too.

Please test it!

#17 @audrasjb
2 years ago

  • Owner set to audrasjb
  • Resolution set to fixed
  • Status changed from assigned to closed

In 53754:

Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the normalizer_normalize() PHP function which is available with the recommended intl PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

#18 @milana_cap
22 months ago

  • Keywords add-to-field-guide added

#19 follow-up: @peterfabian1000
21 months ago

Hi, I just wanted to leave a comment here that it seems some people are having problems with not having Normalizer class available in their environment, which is problematic for some plugins that call remove_accents more often (like WooCommerce, ref issue https://github.com/woocommerce/woocommerce/issues/35471).

Besides, there are some core workflows that may end up with a fatal error as well. E.g. when the admin tries to create a new user with accented characters, sanitize_user calls remove_accents and if they don't have Normalizer, it will throw a fatal.

I suppose it should be possible to fail more gracefully, or only use Normalizer if get_loaded_extensions includes intl?

#20 in reply to: ↑ 19 @zodiac1978
21 months ago

Replying to peterfabian1000:

I suppose it should be possible to fail more gracefully, or only use Normalizer if get_loaded_extensions includes intl?

There is a function exist check there:
https://github.com/WordPress/WordPress/blob/823517e1deabef111a8a1b49bce466eada1162c9/wp-includes/formatting.php#L1604

<?php
if ( function_exists( 'normalizer_normalize' ) ) {

If this check is true but the function call produces a fatal error, we maybe need to switch to check with get_loaded_extensions and intl instead.

I would be interested in the environment (which PHP version and how does the check can be true without a working normalize function).

#21 follow-up: @peterfabian1000
21 months ago

I agree it's weird. Maybe some polyfill is adding the function?

#22 in reply to: ↑ 21 ; follow-up: @zodiac1978
21 months ago

Replying to peterfabian1000:

I agree it's weird. Maybe some polyfill is adding the function?

The reporter has Updraft Plus installed and this is indeed using a polyfill.
See: https://wpdirectory.net/search/01GGYGZXMMWWG4FDD3781EKW0W

Could the polyfill be not accessible and still producing a true on a function exists check!?

#23 in reply to: ↑ 22 @SergeyBiryukov
21 months ago

Replying to zodiac1978:

The reporter has Updraft Plus installed and this is indeed using a polyfill.
See: https://wpdirectory.net/search/01GGYGZXMMWWG4FDD3781EKW0W

Could the polyfill be not accessible and still producing a true on a function exists check!?

It appears that the plugin polyfills the normalizer_normalize() function, but the Normalizer class has a different namespace there, p\Normalizer, so Normalizer::FORM_C produces a fatal error.

Since this ticket was closed on a completed milestone, I've created a new one for 6.1.1: #56980, let's continue the discussion there.

Note: See TracTickets for help on using tickets.