#54415 closed defect (bug) (fixed)
remove_accent does not remove accents
Reported by: | malthert | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | 6.0 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Formatting | Keywords: | has-patch |
Focuses: | Cc: |
Description
remove_accents function removes some special characters only conditionally.
There is a condition like:
<?php if ( in_array( $locale, array( 'de_DE', 'de_DE_formal', 'de_CH', 'de_CH_informal', 'de_AT' ), true ) ) {
Why would some accents only be removed conditionally based on the current language?
This makes zero sense and causes unexpected results in general, but especially when using one of the locales that should be there but is not, e.g. de_AT_formal/informal or sv_SE (only da_DK is there even though the characters are used in Swedish just the same)
Change History (15)
This ticket was mentioned in PR #1860 on WordPress/wordpress-develop by kkmuffme.
3 years ago
#2
- Keywords has-patch added
Trac ticket: https://core.trac.wordpress.org/ticket/54415
#3
@
3 years ago
@johnbillion added PR
There are a variety of issues we encounter:
- we use sv_SE locale, where we'd expect same accents to be removed as for da_DK locale
- we use de_AT_formal (copy of de_DE_formal) which is not in this list
- we use remove_accents in general for search that does not support special/diacritics - e.g. we sell Würth products (screws,...) and if somebody on our english page searches for Würth the search does not return anything, because get_locale returns "en_US" in this case and "ü" is not changed to "ue"
#4
in reply to:
↑ description
;
follow-up:
↓ 6
@
3 years ago
Hi there, thanks for the report!
Replying to malthert:
Why would some accents only be removed conditionally based on the current language?
Because the replacement for some characters depends on the locale. For example:
Ä
is replaced withAe
for German, and justA
for other locales.Đ
is replaced withDJ
for Serbian and Bosnian, and justD
for other locales.
This was originally added in [23361] / #3782, there is some discussion on the ticket about the implementation.
The list was further adjusted in [26585], [33027], [37698], [37853], [38646], [39939], [49967].
This makes zero sense and causes unexpected results in general, but especially when using one of the locales that should be there but is not, e.g. de_AT_formal/informal or sv_SE (only da_DK is there even though the characters are used in Swedish just the same)
I think we should add those locales to the list, instead of removing the conditions altogether.
#5
@
3 years ago
only da_DK is there even though the characters are used in Swedish just the same
Not entirely correct. Danish and Norwegian use the exact same alphabet, but Swedish use ä
and ö
instead of æ
and ø
.
#6
in reply to:
↑ 4
;
follow-up:
↓ 9
@
3 years ago
Replying to SergeyBiryukov:
Thanks for the feedback!
To sum this up, I will change the following things:
We need to keep the language distinction, since some languages have them transliterated differently.
1) add the missing locales to the existing function
or: remove _formal/_informal from locale string and just check for de_DE, de_AT, de_CH so all those cases are covered
=> best: I think we should just use the first 2 letters of the locale here, so it will be consistent for all "de" locale, as this is what is originally intended here.
2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss
3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)
so developers can pass a locale they want to use for this use case instead of the locale of the current installation.
If that sounds good for you, I will update the PR
#7
follow-up:
↓ 8
@
3 years ago
- Severity changed from major to normal
locales that should be there but is not, e.g. de_AT_formal/informal
These locales don't exist so they are not part of the list.
#8
in reply to:
↑ 7
@
3 years ago
Replying to ocean90:
locales that should be there but is not, e.g. de_AT_formal/informal
These locales don't exist so they are not part of the list.
They do exist, WP just doesn't have a native translation for it yet.
Why do you think there is no distinction between formal/informal in Austrian German?
#9
in reply to:
↑ 6
;
follow-up:
↓ 10
@
3 years ago
Replying to malthert:
1) add the missing locales to the existing function
or: remove _formal/_informal from locale string and just check for de_DE, de_AT, de_CH so all those cases are covered
=> best: I think we should just use the first 2 letters of the locale here, so it will be consistent for all "de" locale, as this is what is originally intended here.
Sounds good to me, let's just check for any de_*
locale.
2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss
I think I'd prefer for this to be added to the conditial too, for clarity and consistency, to keep the replacements relevant to German locales in a single place.
3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)
I think this can already be achieved by using switch_to_locale()
and restore_previous_locale()
:
$switched_locale = switch_to_locale( '...' ); $string = remove_accents( $string ); if ( $switched_locale ) { restore_previous_locale(); }
The same approach is used when sending emails in user's language instead of site language, without the need for passing any additional parameters to the functions.
#10
in reply to:
↑ 9
@
3 years ago
Replying to SergeyBiryukov:
Sounds good to me, let's just check for any de_* locale.
Ok, will check for "de" locale (since "de" without trailing underscore is technically valid too, and there are no 3 letter locales starting with "de" so it's safe in this regard too https://www.localeplanet.com/icu/)
2) the letter "ß" is only used in German languages anyway and there is no transliteration of ß => s to be found anywhere.
This will be removed from the conditional and update the "general" transliteration of ß to ss
I think I'd prefer for this to be added to the conditial too, for clarity and consistency, to keep the replacements relevant to German locales in a single place.
But you already have "ß" in the general part - where it's transliterated to "s". However this is wrong - "ß" is always transliterated to "ss". I wasn't able to find ANY reliable sources, that transliterated "ß" => "s", like WP core does.
Nvm, makes sense Latin 1 does indeed do it like this.
3) add a 2nd arg to the function remove_accents: $locale = false (@param string $locale optionally set a locale if you do not want to use the locale of the current language)
I think this can already be achieved by using
switch_to_locale()
andrestore_previous_locale()
:
$switched_locale = switch_to_locale( '...' ); $string = remove_accents( $string ); if ( $switched_locale ) { restore_previous_locale(); }The same approach is used when sending emails in user's language instead of site language, without the need for passing any additional parameters to the functions.
Good point, I should have mentioned before (since I was aware of this):
Calling switch_to_locale will load all strings of that locale - this has a major impact on performance and memory-consumption (up to the point of OOM errors, which can be quickly reached if we have a website that serves multiple European markets/languages).
For the given use case of "remove_accents" this is breaking a fly on the wheel (which WP is struggling with in loads of other areas already, don't want to make it even worse)
I think adding an optional $locale parameter would be a non-invasive change, that would not impact performance and not break anything either. Additionally it would make any user directly aware, that remove_accents is a locale aware function, where locale decides what "result" you get.
#11
@
3 years ago
- Keywords reporter-feedback removed
Updated the pull request now to accomodate your suggestions @SergeyBiryukov
#13
@
3 years ago
- Owner set to SergeyBiryukov
- Resolution set to fixed
- Status changed from new to closed
In 52809:
SergeyBiryukov commented on PR #1860:
3 years ago
#15
Thanks for the PR! Merged in https://core.trac.wordpress.org/changeset/52809.
Thanks for the report @malthert . What specific problems are you seeing? And which language(s) are you using?