Make WordPress Core

Opened 15 years ago

Closed 9 years ago

#3782 closed enhancement (fixed)

Use Digraphs for Umlauts in German Permalinks

Reported by: transalpin Owned by:
Milestone: 3.6 Priority: normal
Severity: normal Version: 2.1
Component: I18N Keywords: has-patch commit
Focuses: Cc:


There’s this pretty useful plugin:

To quote from the author’s site:

In German, the convention is to transliterate umlauts into two letters (ä=ae, ö=oe, ü=ue). The CVS code has some two-character conversions, so I added the umlauts to that list in my plugin, and the result was schroeder-sucks, which is just what I wanted.

I wouldn’t want Schröder to suck, but I think that it shouldn’t take a plugin to fix this bug. There are more characters that are wrongly sanitized (e.g., å = aa, æ = ae, …)

(Note that in Spanish and French, the ü is not transliterated in this matter. Not sure about Turkish. Maybe WordPress should check the locale before creating the post slug)

Attachments (6)

3782.patch (672 bytes) - added by SergeyBiryukov 9 years ago.
3782.2.patch (906 bytes) - added by ocean90 9 years ago.
3782.2-unit-test.patch (649 bytes) - added by ocean90 9 years ago.
3782.3.patch (733 bytes) - added by SergeyBiryukov 9 years ago.
3782.4.patch (719 bytes) - added by SergeyBiryukov 9 years ago.
3782.5.patch (742 bytes) - added by SergeyBiryukov 9 years ago.

Download all attachments as: .zip

Change History (26)

#1 @link92
15 years ago

Another option would be to use IRIs, and then normalise them to URIs for compatibility with UAs that don't support IRIs.

#2 @Nazgul
15 years ago

  • Type changed from defect to enhancement

I think this should stay a plugin, because you stated yourself that there's differences between languages/regions on how to interpret/translate these characters.

We can't assume everybody wants to use the German convention.

#3 @rob1n
15 years ago

  • Resolution set to wontfix
  • Status changed from new to closed

Yeah, this is definitely plugin material. Adding locale-specific checks (and maintaining them !!) would be waaay too much to put into the core.

#4 @foolswisdom
15 years ago

  • Milestone 2.2 deleted

#5 @Bernhard Reiter
9 years ago

  • Cc Bernhard Reiter added
  • Resolution wontfix deleted
  • Status changed from closed to reopened

Now that Wordpress seems to become more and more i18n/l10n aware, isn't it time to reconsider this? Having an umlaut normalise to only one letter seriously pisses of most German wordpress users I know of (including myself) that care about their permalinks, which is maybe something that latin-1-charset-language developers don't realise.

#6 @nacin
9 years ago

  • Milestone set to 3.6

I agree, this is doable by changing the remove_accents() mapping based on the locale.

We need a full list of all affected characters, and what their sanitized characters should be.

#7 @ocean90
9 years ago

  • Cc ocean90 added

#8 @toscho
9 years ago

  • Cc info@… added

9 years ago

#9 @SergeyBiryukov
9 years ago

  • Keywords has-patch added

#10 @SergeyBiryukov
9 years ago

3782.patch handles the characters from WP Permalauts plugin.

#11 @SergeyBiryukov
9 years ago

In 1178/tests:

remove_accents() test for German umlauts. see #3782.

#12 @SergeyBiryukov
9 years ago

In 1179/tests:

Add capital characters to remove_accents() test for German umlauts. see #3782.

#13 @nacin
9 years ago

Let's assign get_locale() to a variable so if/when we add more locale-specific rules, we don't call it additional times, the diff is a bit cleaner, and so the commit here looks like it is written to be extended.

9 years ago

#14 @ocean90
9 years ago

Ä in Äpfel should be Ae. Of course, when all characters are uppercased AE seems fine. But that's usually not the case.

#15 @SergeyBiryukov
9 years ago

In 1180/tests:

Correct the expected results in remove_accents() test for German umlauts. props ocean90. see #3782.

#16 @SergeyBiryukov
9 years ago

  • Keywords commit added

#17 @nacin
9 years ago

Looks quite solid. Three things:

Probably no need to ever call get_locale() if ! seems_utf8().

I wonder if array_merge() is particularly slow or inefficient here, as this isn't a small array. The union operator (+) has a bit less overhead, as might just adding on the individual key/value pairs.

I would tend to agree that A should become Ae, not AE. remove_accents() does follow the same uppercase pattern elsewhere in the file. Since this is a language-specific rule, I am not too worried about the inconsistency. Note the only other usage of remove_accents() beyond permalinks, is sanitize_user(), which does not enforce strtolower (it used to in multisite).

#18 @SergeyBiryukov
9 years ago

Performed a few tests with 20,000 iterations of remove_accents().

Without the patch10.290s
array_merge() (3782.3.patch)14.624s
Union operator (+) (3782.4.patch)11.342s
Direct assignment (3782.5.patch)10.619s

So, 3782.5.patch shows the best performance.

#19 @nacin
9 years ago

Good to go.

#20 @SergeyBiryukov
9 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 23361:

Use digraphs for German umlauts in remove_accents(). props SergeyBiryukov, ocean90. fixes #3782.

Note: See TracTickets for help on using tickets.