Make WordPress Core

Opened 3 months ago

Last modified 2 months ago

#54415 new defect (bug)

remove_accent does not remove accents

Reported by: malthert Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

remove_accents function removes some special characters only conditionally.

There is a condition like:

<?php
if ( in_array( $locale, array( 'de_DE', 'de_DE_formal', 'de_CH', 'de_CH_informal', 'de_AT' ), true ) ) {

Why would some accents only be removed conditionally based on the current language?

This makes zero sense and causes unexpected results in general, but especially when using one of the locales that should be there but is not, e.g. de_AT_formal/informal or sv_SE (only da_DK is there even though the characters are used in Swedish just the same)

Change History (11)

#1 @johnbillion
3 months ago

  • Keywords reporter-feedback added; has-patch removed
  • Version trunk deleted

Thanks for the report @malthert . What specific problems are you seeing? And which language(s) are you using?

This ticket was mentioned in PR #1860 on WordPress/wordpress-develop by kkmuffme.


3 months ago

  • Keywords has-patch added

#3 @malthert
3 months ago

@johnbillion added PR

There are a variety of issues we encounter:

  • we use sv_SE locale, where we'd expect same accents to be removed as for da_DK locale
  • we use de_AT_formal (copy of de_DE_formal) which is not in this list
  • we use remove_accents in general for search that does not support special/diacritics - e.g. we sell Würth products (screws,...) and if somebody on our english page searches for Würth the search does not return anything, because get_locale returns "en_US" in this case and "ü" is not changed to "ue"

#4 in reply to: ↑ description ; follow-up: @SergeyBiryukov
3 months ago

Hi there, thanks for the report!

Replying to malthert:

Why would some accents only be removed conditionally based on the current language?

Because the replacement for some characters depends on the locale. For example:

  • Ä is replaced with Ae for German, and just A for other locales.
  • Đ is replaced with DJ for Serbian and Bosnian, and just D for other locales.

This was originally added in [23361] / #3782, there is some discussion on the ticket about the implementation.

The list was further adjusted in [26585], [33027], [37698], [37853], [38646], [39939], [49967].

This makes zero sense and causes unexpected results in general, but especially when using one of the locales that should be there but is not, e.g. de_AT_formal/informal or sv_SE (only da_DK is there even though the characters are used in Swedish just the same)

I think we should add those locales to the list, instead of removing the conditions altogether.

Last edited 3 months ago by SergeyBiryukov (previous) (diff)

#5 @knutsp
3 months ago

only da_DK is there even though the characters are used in Swedish just the same

Not entirely correct. Danish and Norwegian use the exact same alphabet, but Swedish use ä and ö instead of æ and ø.

#6 in reply to: ↑ 4 ; follow-up: @malthert
3 months ago

Replying to SergeyBiryukov:
Thanks for the feedback!

To sum this up, I will change the following things:
We need to keep the language distinction, since some languages have them transliterated differently.

1) add the missing locales to the existing function
or: remove _formal/_informal from locale string and just check for de_DE, de_AT, de_CH so all those cases are covered
=> best: I think we should just use the first 2 letters of the locale here, so it will be consistent for all "de" locale, as this is what is originally intended here.

2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss

3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)

so developers can pass a locale they want to use for this use case instead of the locale of the current installation.

If that sounds good for you, I will update the PR

#7 follow-up: @ocean90
3 months ago

  • Severity changed from major to normal

locales that should be there but is not, e.g. de_AT_formal/informal

These locales don't exist so they are not part of the list.

#8 in reply to: ↑ 7 @malthert
3 months ago

Replying to ocean90:

locales that should be there but is not, e.g. de_AT_formal/informal

These locales don't exist so they are not part of the list.

They do exist, WP just doesn't have a native translation for it yet.

Why do you think there is no distinction between formal/informal in Austrian German?

#9 in reply to: ↑ 6 ; follow-up: @SergeyBiryukov
3 months ago

Replying to malthert:

1) add the missing locales to the existing function
or: remove _formal/_informal from locale string and just check for de_DE, de_AT, de_CH so all those cases are covered
=> best: I think we should just use the first 2 letters of the locale here, so it will be consistent for all "de" locale, as this is what is originally intended here.

Sounds good to me, let's just check for any de_* locale.

2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss

I think I'd prefer for this to be added to the conditial too, for clarity and consistency, to keep the replacements relevant to German locales in a single place.

3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)

I think this can already be achieved by using switch_to_locale() and restore_previous_locale():

$switched_locale = switch_to_locale( '...' );

$string = remove_accents( $string );

if ( $switched_locale ) {
	restore_previous_locale();
}

The same approach is used when sending emails in user's language instead of site language, without the need for passing any additional parameters to the functions.

#10 in reply to: ↑ 9 @malthert
2 months ago

Replying to SergeyBiryukov:

Sounds good to me, let's just check for any de_* locale.

Ok, will check for "de" locale (since "de" without trailing underscore is technically valid too, and there are no 3 letter locales starting with "de" so it's safe in this regard too https://www.localeplanet.com/icu/)

2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss

I think I'd prefer for this to be added to the conditial too, for clarity and consistency, to keep the replacements relevant to German locales in a single place.

But you already have "ß" in the general part - where it's transliterated to "s". However this is wrong - "ß" is always transliterated to "ss". I wasn't able to find ANY reliable sources, that transliterated "ß" => "s", like WP core does.

Nvm, makes sense Latin 1 does indeed do it like this.

3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)

I think this can already be achieved by using switch_to_locale() and restore_previous_locale():

$switched_locale = switch_to_locale( '...' );

$string = remove_accents( $string );

if ( $switched_locale ) {
	restore_previous_locale();
}

The same approach is used when sending emails in user's language instead of site language, without the need for passing any additional parameters to the functions.

Good point, I should have mentioned before (since I was aware of this):
Calling switch_to_locale will load all strings of that locale - this has a major impact on performance and memory-consumption (up to the point of OOM errors, which can be quickly reached if we have a website that serves multiple European markets/languages).

For the given use case of "remove_accents" this is breaking a fly on the wheel (which WP is struggling with in loads of other areas already, don't want to make it even worse)

I think adding an optional $locale parameter would be a non-invasive change, that would not impact performance and not break anything either. Additionally it would make any user directly aware, that remove_accents is a locale aware function, where locale decides what "result" you get.

Last edited 2 months ago by malthert (previous) (diff)

#11 @malthert
2 months ago

  • Keywords reporter-feedback removed

Updated the pull request now to accomodate your suggestions @SergeyBiryukov

Note: See TracTickets for help on using tickets.