Make WordPress Core

Opened 7 years ago

Closed 15 months ago

#3782 closed enhancement (fixed)

Use Digraphs for Umlauts in German Permalinks

Reported by: transalpin Owned by:
Milestone: 3.6 Priority: normal
Severity: normal Version: 2.1
Component: I18N Keywords: has-patch commit
Focuses: Cc:


There’s this pretty useful plugin:

To quote from the author’s site:

In German, the convention is to transliterate umlauts into two letters (ä=ae, ö=oe, ü=ue). The CVS code has some two-character conversions, so I added the umlauts to that list in my plugin, and the result was schroeder-sucks, which is just what I wanted.

I wouldn’t want Schröder to suck, but I think that it shouldn’t take a plugin to fix this bug. There are more characters that are wrongly sanitized (e.g., å = aa, æ = ae, …)

(Note that in Spanish and French, the ü is not transliterated in this matter. Not sure about Turkish. Maybe WordPress should check the locale before creating the post slug)

Attachments (6)

3782.patch (672 bytes) - added by SergeyBiryukov 16 months ago.
3782.2.patch (906 bytes) - added by ocean90 16 months ago.
3782.2-unit-test.patch (649 bytes) - added by ocean90 16 months ago.
3782.3.patch (733 bytes) - added by SergeyBiryukov 15 months ago.
3782.4.patch (719 bytes) - added by SergeyBiryukov 15 months ago.
3782.5.patch (742 bytes) - added by SergeyBiryukov 15 months ago.

Download all attachments as: .zip

Change History (26)

comment:1 link927 years ago

Another option would be to use IRIs, and then normalise them to URIs for compatibility with UAs that don't support IRIs.

comment:2 Nazgul7 years ago

  • Type changed from defect to enhancement

I think this should stay a plugin, because you stated yourself that there's differences between languages/regions on how to interpret/translate these characters.

We can't assume everybody wants to use the German convention.

comment:3 rob1n7 years ago

  • Resolution set to wontfix
  • Status changed from new to closed

Yeah, this is definitely plugin material. Adding locale-specific checks (and maintaining them !!) would be waaay too much to put into the core.

comment:4 foolswisdom7 years ago

  • Milestone 2.2 deleted

comment:5 Bernhard Reiter16 months ago

  • Cc Bernhard Reiter added
  • Resolution wontfix deleted
  • Status changed from closed to reopened

Now that Wordpress seems to become more and more i18n/l10n aware, isn't it time to reconsider this? Having an umlaut normalise to only one letter seriously pisses of most German wordpress users I know of (including myself) that care about their permalinks, which is maybe something that latin-1-charset-language developers don't realise.

comment:6 nacin16 months ago

  • Milestone set to 3.6

I agree, this is doable by changing the remove_accents() mapping based on the locale.

We need a full list of all affected characters, and what their sanitized characters should be.

comment:7 ocean9016 months ago

  • Cc ocean90 added

comment:8 toscho16 months ago

  • Cc info@… added

SergeyBiryukov16 months ago

comment:9 SergeyBiryukov16 months ago

  • Keywords has-patch added

comment:10 SergeyBiryukov16 months ago

3782.patch handles the characters from WP Permalauts plugin.

comment:11 SergeyBiryukov16 months ago

In 1178/tests:

remove_accents() test for German umlauts. see #3782.

comment:12 SergeyBiryukov16 months ago

In 1179/tests:

Add capital characters to remove_accents() test for German umlauts. see #3782.

comment:13 nacin16 months ago

Let's assign get_locale() to a variable so if/when we add more locale-specific rules, we don't call it additional times, the diff is a bit cleaner, and so the commit here looks like it is written to be extended.

ocean9016 months ago

comment:14 ocean9016 months ago

Ä in Äpfel should be Ae. Of course, when all characters are uppercased AE seems fine. But that's usually not the case.

ocean9016 months ago

comment:15 SergeyBiryukov16 months ago

In 1180/tests:

Correct the expected results in remove_accents() test for German umlauts. props ocean90. see #3782.

comment:16 SergeyBiryukov15 months ago

  • Keywords commit added

comment:17 nacin15 months ago

Looks quite solid. Three things:

Probably no need to ever call get_locale() if ! seems_utf8().

I wonder if array_merge() is particularly slow or inefficient here, as this isn't a small array. The union operator (+) has a bit less overhead, as might just adding on the individual key/value pairs.

I would tend to agree that A should become Ae, not AE. remove_accents() does follow the same uppercase pattern elsewhere in the file. Since this is a language-specific rule, I am not too worried about the inconsistency. Note the only other usage of remove_accents() beyond permalinks, is sanitize_user(), which does not enforce strtolower (it used to in multisite).

SergeyBiryukov15 months ago

SergeyBiryukov15 months ago

SergeyBiryukov15 months ago

comment:18 SergeyBiryukov15 months ago

Performed a few tests with 20,000 iterations of remove_accents().

Without the patch10.290s
array_merge() (3782.3.patch)14.624s
Union operator (+) (3782.4.patch)11.342s
Direct assignment (3782.5.patch)10.619s

So, 3782.5.patch shows the best performance.

comment:19 nacin15 months ago

Good to go.

comment:20 SergeyBiryukov15 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 23361:

Use digraphs for German umlauts in remove_accents(). props SergeyBiryukov, ocean90. fixes #3782.

Note: See TracTickets for help on using tickets.