WordPress.org

Make WordPress Core

Opened 2 years ago

Closed 16 months ago

#23907 closed defect (bug) (fixed)

Scandinavian ligatures transcribed wrong in remove_accents()

Reported by: dnusim Owned by: nacin
Milestone: 3.8 Priority: normal
Severity: minor Version: 3.6
Component: I18N Keywords: has-patch needs-testing 3.8-early
Focuses: Cc:

Description

remove_accents() is transcribing old scandinavian ligatures wrong: ÆØÅøå should be transcribed as 'Ae', 'Oe', 'Aa', 'oe' and 'aa' respectively.

Transcribation rules are the same in swedish, danish and norwegian. In trunk they are transcribed as 'AE', 'O', 'A', 'o', 'a'.

See #4739, #9591 for history.

Attachments (8)

23907.patch (1.9 KB) - added by dnusim 2 years ago.
23907.2.patch (626 bytes) - added by tlamedia 18 months ago.
23907.3.diff (2.2 KB) - added by atimmer 18 months ago.
23907.4.diff (731 bytes) - added by tlamedia 18 months ago.
23907.4.unit-test.diff (1.9 KB) - added by tlamedia 18 months ago.
23907.5.diff (2.7 KB) - added by atimmer 18 months ago.
23907.6.diff (2.3 KB) - added by tlamedia 16 months ago.
23907.7.diff (1.4 KB) - added by tlamedia 16 months ago.

Download all attachments as: .zip

Change History (51)

@dnusim2 years ago

comment:1 @dnusim2 years ago

  • Keywords has-patch added

comment:2 @SergeyBiryukov2 years ago

  • Component changed from Permalinks to I18N
  • Milestone changed from Awaiting Review to 3.6

comment:3 @knutsp2 years ago

Æ (æ) Ø (ø) Å (å) are letters in the Norwegian, Danish and (Å in Finnish) alphabets. They are not ligatures in these languages.

The letter Ø (ø) does not exist in the Swedish alphabet. Swedish (and Finnish) uses Ö (ö), which is a separate letter written as an accented character. Swedish has Æ, Ö and Å in its alphabet.

I would prefer these transcriptions:
Æ -> ae (like a ligature)
Ø -> o (like for ö, not to oe)
Å -> a (not aa)

Transcribing to a two letter sequence, like for a true ligature, gives longer strings and no better readability, imo. (Norwegian speaking)

comment:4 @dnusim2 years ago

Yes, these characters are letters in their own right (though they have developed as semi-ligatures). Precisely because Å is a letter in it's own right, and not simply 'A' with a diacritical mark, it makes more sense to transliterate to 'aa' instead of 'a'. The same goes for 'Ø': Since it is a character by itself, and not merely an 'O' with a dash through it, it should be translitereted properly – that is to 'oe'.

Correct scandinavian alphabetization would treat 'aa' like 'å', because they denote the same letter, so that "Knut Ås" and "Knut Aass" would appear next to each other in places like phonebooks. (It is perhaps more common in Denmark than in Norway to use old fashioned spelling of names with 'aa' in them.)

English wikipedia has a couple of sections on the subject in their article on the letter Å.
Airlines recommend transliterating to 'aa' and 'oe' in this article (norwegian).

(Norwegian speaking myself, btw.)

comment:5 @markjaquith22 months ago

  • Milestone changed from 3.6 to Future Release

comment:6 @SergeyBiryukov22 months ago

  • Keywords 3.7-early added

comment:7 @wonderboymusic20 months ago

  • Milestone changed from Future Release to 3.7

these are all marked 3.7-early

comment:8 @tlamedia20 months ago

In the Danish language the characters are always transcribed as follows:
Æ -> ae
Ø -> oe
Å -> aa

Google and Bing recognize these transcriptions.

(I'm Danish speaking)

comment:9 @nacin18 months ago

  • Milestone changed from 3.7 to Future Release

Seems like there is not an agreement here. We should toss it to the translation teams as well as seek additional input from Scandinavian users and developers.

comment:10 @knutsp18 months ago

There is no disagreement on what is the standard transcription rules.

I just would prefer the current way, "Nærbøås" to "narboas" over "naerboeaas" because I haven't seen any problems with that.

But if search engines better recognizes this standard than the current and simpler WordPress way, then I also go for it.

The Scandinavian lobby at WordCamp Europe could have a chat about it in the weekend, and report back here. I'll be there, too. I'm quite sure nobody wants this ticket laying around unresolved because of minor differences in preferences.

Last edited 18 months ago by knutsp (previous) (diff)

comment:11 @knutsp18 months ago

Due to communication problems at WCE, we didn't succeed to have talks about this question. :-(

Currently, in WordPress, ä becomes a, æ becomes ae, ø and ö becomes o and å becomes a.

ä is the Swedish equivalent of Danish/Norwegian æ, and Swedish ö is the equivalent of Danish/Norwegian ø. Ø is an accented/modified O, where the accent/modifier is "/", å is an accented a where the special accent is a small ring above and æ may me seen as having a ligature of ae as its origin, even if it's an alphabetic character in it's own right in Danish/Norwegian. Likewise, å, ä and ö is part of the the Swedish alphabet.

Changing how ö transliterates will not only implicate Swedish, but other languages, too. If we don't want that changed, then ø should not be transliterated to anything else either.

WordPress tradition here is to just remove all accents, as the name of function in question, remove_accents(), indicates. This is the simplest way to make a normalized post_name and URL. Hence, the Swedish truly accented characters should (still) be treated like that. The corresponding Norwegian/Danish characters should transliterate to the same base as the corresponding Swedish ones, even if "/" and "small ring above" are not always regarded as true accents, or modifiers.

I have tried to find some documentation on that the official standard transliterations (ae, oe, aa) has an advantage for SEO, but I haven't found anything that indicates this. Correct me if I'm wrong and please point to some documents, research or views, if there is any. I know other CMS'es just replace all non-ASCII characters with either nothing, an underscore or a hyphen, and that is bad, both for SEO and readability.

Transliterating with semi-unique character couples (oe, aa) may be useful when having to make a reverse transliteration, and that is the case for the use of this standard on things like passenger name on airline tickets, or more generally forcing Scandinavian names and words into (old) computer systems/software only supporting ASCII. For URLs a correct reverse is not that important, I think. And I think readability doesn't suffer either.

So my suggestion for WordPress, and accented latin characters in general, is: Just remove any accent or modifier and transliterate to the basic character (or characters, in case of a ligature origin).

This means I quite strongly suggest a wontfix for this ticket, letting æ still be transliterated to ae, and the other ones to their base, as they are now, plain and simple.

comment:12 follow-up: @solsikkehaven18 months ago

In Danish, convention (normal practice) is:

æ - Æ -> ae
ø - Ø -> oe
å - Å -> aa

so I agree with thread starter, that

ÆØÅøå should be transcribed as 'Ae', 'Oe', 'Aa', 'ae', 'oe' and 'aa' respectively.

Last edited 18 months ago by solsikkehaven (previous) (diff)

comment:13 in reply to: ↑ 12 @Compute18 months ago

+1 for:

æ - Æ -> ae
ø - Ø -> oe
å - Å -> aa

Without a doubt the correct way to translate those character in Danish, but I think it's different for the other scandinavian countries as they use diffent letters.

Last edited 18 months ago by Compute (previous) (diff)

comment:14 @tlamedia18 months ago

If you do a search in Bing.dk that contains æ,ø or å Bing will match ae, oe and aa in the URL and emphasize with a bold font. So if you search for æblegrød Bing will look for aeblegroed in the url and emphasize it with a bold font in the result page. Google only matces å to aa in the result page.

Keyword match in the URL counts as 1% in the Google algorithm and even though it's not a lot, best SEO practice is to rewrite to ae, oe, aa. I think all Danish SEO's would agree that it's best to use ae, oe, aa.

I doubt that there is a transcription that will make all Scandinavians happy and I suggest that we make different transcriptions based on the WPLANG constant. We can either do that by implementing a hook in the beginning of the remove_accents function that can be used by plugins, or instert a simple check for WPLANG=='da_DK' and transcripe æ,ø,å if we are on a Danish site.

comment:15 follow-up: @dnusim18 months ago

-1 to fixing this for Danish only. The use of these characters is exactly the same in Norwegian – and, as far as I know, in every other language using these characters as well.

In the Swedish Wikipedia article about the letter Å, it says that Å is transliterated to Aa in Swedish passports. It would be good if a swede could make a comment – so far we've only heard from danes and norwegians.

If I understand you correctly, knutsp, everyone agrees on what is the correct way to transliterate æøå in Danish and Norwegian (though you find the correct way to be impractical).

Last edited 18 months ago by dnusim (previous) (diff)

comment:16 in reply to: ↑ 15 ; follow-up: @knutsp18 months ago

Replying to dnusim:

If I understand you correctly, knutsp, everyone agrees on what is the correct way to transliterate æøå in Danish and Norwegian (though you find the correct way to be impractical).

I find that the "correct", the "official" way to transliterate are rules made for other situations than remove_accents() is used for in WordPress.

I also expect that remove_accents() should just do that, when possible. øand å have origins as accented characters with special accent types.

When you search the internet for words search engines will find your content based on titles and content. There is rarely a need to go the other way, having a correct word out of an URL path. Removing accents is a way to make characters safe to be used in URLs.

Fixing this language specific is not an option. Introducing a filter, making everybody happy, is the WordPress way. Easy to agree on that.

What this ticket is about is how the default way WordPress should transliterate these characters, no matter what language is set. Special SEO considerations for each language, or personal preferences, should be handled through plugins (filters).

Swedish users should weigh in on å.

Norwegians should weigh in on all three, as Danish and Norwegian share these characters and have exactly the same alphabet.

The few people I have talked to about this all have been quite clear on not wanting to change the current transliterations, and reacts to possible positive SEO effects in his case by shaking their heads. I don't know if the "correct" transliterations are more hated/unwanted in Norway than in Denmark, but it may be the case. Sad, if so, because we should be able to reach a consensus based on factual arguments, not feelings.

If positive, consistent and language independent SEO impacts of a change can be demonstrated I am willing to reconsider, at least if there is a filter that will let me keep the old way. But I think hat SEO considerations that more debatable, or language specific, should be handled by plugins.

comment:17 follow-up: @dnusim18 months ago

øand å have origins as accented characters with special accent types.

Then this seems to be where we disagree. Though it may be a somewhat common misconception, ø and å don't have origins as accented characters. These characters' origins are as ligatures.

You are obviously right in that more people should weigh in.

Edit: Also, the readability of the URL is what makes it a pretty one. Being able to differentiate between words like OL and øl would be an advantage.

Last edited 18 months ago by dnusim (previous) (diff)

comment:18 in reply to: ↑ 16 ; follow-up: @SergeyBiryukov18 months ago

Replying to knutsp:

Fixing this language specific is not an option.

That might actually be an option, see the precedent in #3782.

comment:19 in reply to: ↑ 17 @knutsp18 months ago

Replying to dnusim:

øand å have origins as accented characters with special accent types.

Then this seems to be where we disagree. Though it may be a somewhat common misconception, ø and å don't have origins as accented characters. These characters' origins are as ligatures.

My bad. The historical origins are a special kind of ligatures, at least the å as aa-ligature. For ø the "oe" origin seems a little more speculative. I don't know, and don't care that much. I see the slash and the ring.

What I meant is that in modern use, very few/nobody sees the ligature origin, but rather a character with a simple stroke of the pen added, the slash and the small ring above. They can, today, in my view, be regarded as accented characters, or at least modified characters. The name for such is diacritics. The origin of the two dots above the o in ö is also a small e.

I don't know if the views here, the historical origins, or the a more practical modern functions, should have much to say. WordPress has a practice of removing accents, and have, rightfully or not, selected to view ø and å as accented, and just removes those. And I think that's fine. However, I see that transliterating æ to ae is then a bit inconsistent. I defend that also to be kept as is, just because the ligature origin is more clearly visible in this case, and we avoid both å and æ to transliterate to the same character.

I have pointed out in my first comment here that use of the word "ligatures" in the summary of this ticket is wrong. These characters are not to be regarded as ligatures today, even if their history and origin is such. This is because they are regarded as characters in their own right, in the Scandinavian alphabets. (But even non-Scandinavians can see that æ still looks like a ligature)

The Swedes invented å centuries ago, but it took a long time to be official. Norwegian adopted it early, as part of construction a Norwegian written language in the late 19. century. Danish only accepted it officially as late as 1948, but official Danish names still use "Aa" in some cases (city of Aarhus). This may be why it seems Norwegians and Danes differ in this matter.

I mention this because understanding why we differ may be the key to resolve this, in lack of convincing arguments.

comment:20 in reply to: ↑ 18 @dnusim18 months ago

Replying to SergeyBiryukov:

Replying to knutsp:

Fixing this language specific is not an option.

That might actually be an option, see the precedent in #3782.

So we're already doing it right in German? That's great! The German ö corresponds to the Norwegian and Danish ø (and also to the Swedish ö, for that matter).

Responding to knutsp:

This may be why it seems Norwegians and Danes differ in this matter.

Note that I'm Norwegian, and still agree with the Danish commenters :)

comment:21 @atimmer18 months ago

  • Cc atimmermans@… added
  • Keywords needs-unit-tests added

@tlamedia18 months ago

comment:22 follow-up: @theorboman18 months ago

I'm not Swedish, but I live and work in Sweden, I speak Swedish fluently and I've spoken with several of my Swedish colleagues regarding this. The general consensus is that although it's not uncommon for Swedish vowels to be replaced by a single letter, that is to say å becomes simply a, it is more correct that we follow the suggested pattern above for Norwegian and Danish translations so å becomes a.

The best argument for this is to differentiate from words that are spelt with a or å (or any other Swedish vowel for that matter).

As an example, far in Swedish means 'Father' but får means 'sheep'. If we replace å with a we can't differentiate between the two (which might make for awkward family gatherings).

@atimmer18 months ago

comment:23 follow-ups: @atimmer18 months ago

  • Keywords needs-testing added; needs-unit-tests removed

I think we should do this, WordPress should respect the meaning of characters in different languages.

23907.3.diff add tests based on the the topic starter. It also includes the second patch with norse and swedish added. I also changed AA to Aa.

comment:24 @SergeyBiryukov18 months ago

  • Milestone changed from Future Release to 3.8

comment:25 @tlamedia18 months ago

I know that all Danes will be happy with this solution. Any objections from the Norwegian or Swedes?

comment:26 in reply to: ↑ 22 @knutsp18 months ago

Replying to theorboman:

The best argument for this is to differentiate from words that are spelt with a or å (or any other Swedish vowel for that matter).

Why do we need to differentiate? To me this about making valid URLs, mostly.

comment:27 in reply to: ↑ 23 ; follow-up: @knutsp18 months ago

Replying to atimmer:

23907.3.diff add tests based on the the topic starter. It also includes the second patch with norse and swedish added. I also changed AA to Aa.

"Norse" language? locale "no_NO"? Never heard of this language or code.

We have to forms of Norwegian:

nb_NO for Norwegian bokmål
nn_NO for Norwegian nynorsk

I also miss the unit test for æ

comment:28 in reply to: ↑ 23 @knutsp18 months ago

(duplicate comment)

Last edited 18 months ago by knutsp (previous) (diff)

@tlamedia18 months ago

comment:29 @tlamedia18 months ago

  • Keywords needs-unit-tests added

I have corrected the Norwegian locales.

The unit tests should also test for æ

comment:30 @tlamedia18 months ago

  • Cc tlamedia added
  • Keywords needs-unit-tests removed

The unit test now includes both Norwegian bokmål and Norwegian nynorsk. It is also testing æ > ae

comment:31 @knutsp18 months ago

I don't like locale specific substitutions when not absolutely necessary. The same substitutions should be made for ø (as for æ) regardless of the set locale code. This ø character belongs solely to Danish and Norwegian, and it's presence always indicates that the word (name) it is used in is a Norwegian or Danish word/name.

When we talk about å this may be treated differently between non-Scandinavian languages (i.e. Skolt Sami, Chamorro and Walloon) and Scandinavian. This would then have to be locale specific not to interfere with non-Scandinavian uses.

æ is not debated. It still looks like a ligature, as the origin. Currently æ -> ae.

And if a filter is added for the $chars array then I could even accept this change.

Last edited 18 months ago by knutsp (previous) (diff)

@atimmer18 months ago

comment:32 in reply to: ↑ 27 @atimmer18 months ago

  • Keywords 3.8-early added; 3.7-early removed

Replying to knutsp:

Replying to atimmer:

23907.3.diff add tests based on the the topic starter. It also includes the second patch with norse and swedish added. I also changed AA to Aa.

"Norse" language? locale "no_NO"? Never heard of this language or code.

We have to forms of Norwegian:

nb_NO for Norwegian bokmål
nn_NO for Norwegian nynorsk

I also miss the unit test for æ

Whoops, should have done my research better.

23907.5.diff is a combined patch.

comment:33 follow-up: @Anderton16 months ago

The recommendation from the Membership organization ”Svenska datatermgruppen” (Apple, IBM, Microsoft Radio Sweden (like BBC Radio), universities and goverment organisations, Language Council of Sweden and Statistics Sweden to mention a few) regarding use of Swedish ÅÄÖåäö in data is:

Å = A
Ä = A
Ö = O
å = a
ä = a
ö = o

(http://www.datatermgruppen.se/fragor-och-svar.html#f1) (You need to run it thru Google Translate)

It's also defacto standard in Sweden (i know, born and raised here and have using computers since 1980). I know for a fact that is a pattern used by developers working with systems at The Swedish Parliament.

Last edited 16 months ago by Anderton (previous) (diff)

comment:34 in reply to: ↑ 33 @tlamedia16 months ago

Replying to Anderton:

The recommendation from the Membership organization ”Svenska datatermgruppen” (Apple, IBM, Microsoft Radio Sweden (like BBC Radio), universities and goverment organisations, Language Council of Sweden and Statistics Sweden to mention a few) regarding use of Swedish ÅÄÖåäö in data is:

Å = A
Ä = A
Ö = O
å = a
ä = a
ö = o

(http://www.datatermgruppen.se/fragor-och-svar.html#f1) (You need to run it thru Google Translate)

It's also defacto standard in Sweden (i know, born and raised here and have using computers since 1980). I know for a fact that is a pattern used by developers working with systems at The Swedish Parliament.

OK so lets just do the transcription for da_DK, nb_NO and nn_NO

@tlamedia16 months ago

comment:35 @tlamedia16 months ago

23907.6.diff​ is a combined patch with unit test

comment:36 follow-up: @knutsp16 months ago

In domains, Norwegian muncipalities transliterate ø to o. "Løten" has domain loten.kommune.no. This continues even after IDN was introduced.

The Swedish character ö is similar to the Norwegian/Danish ø. When ö becomes o then ø should also become o. If Swedish å should become a then Norwegian/Danish å (exact same character as the Swedish one) should also become a.

ä -> a (Swedish)
æ -> ae (Danish/Norwegian)
ö -> o (Swedish)
ø -> o (Danish/Norwegian)
å -> a (Swedish/Danish/Norwegian all common)

This is still a WONTFIX to me.

comment:37 in reply to: ↑ 36 ; follow-up: @tlamedia16 months ago

Replying to knutsp:

In domains, Norwegian muncipalities transliterate ø to o. "Løten" has domain loten.kommune.no. This continues even after IDN was introduced.

The Swedish character ö is similar to the Norwegian/Danish ø. When ö becomes o then ø should also become o. If Swedish å should become a then Norwegian/Danish å (exact same character as the Swedish one) should also become a.

ä -> a (Swedish)
æ -> ae (Danish/Norwegian)
ö -> o (Swedish)
ø -> o (Danish/Norwegian)
å -> a (Swedish/Danish/Norwegian all common)

This is still a WONTFIX to me.

So to make everyone happy we need to limit it to just da_DK

@tlamedia16 months ago

comment:38 in reply to: ↑ 37 @knutsp16 months ago

Replying to tlamedia:

So to make everyone happy we need to limit it to just da_DK

Seems so, but it's kind of sad, because Norwegian and Danish written languages are basically the same, as is the alphabet.

But if it makes da_DK users happy, go for it and finish this ticket.

comment:39 @tlamedia16 months ago

23907.7.diff​ is a combined patch with unit tests and it is now limited to da_DK.

The Danish community is happy with this solution. If there is still disagreement in the Norwegian or Swedish communities I suggest that they discuss it with their translation teams.

For now we have an optimal solution da_DK

comment:40 @solsikkehaven16 months ago

->tlmedia: 23907.7.diff looks 100% OK.
we need to have someone commit this patch as it would be nice to have included in 3.8

->knutsp: can't see why it should make anyone sad, because it does seem danish "normal-practices" for converting æøåÆØÅ is different than what you normally would do in Norway.
So agreed, let's make this a danish-only patch.

comment:41 follow-up: @dnusim16 months ago

Going through Norwegian voices in this thread, I'm counting 1 strongly for (myself) and 1 strongly against (knutsp) this commit. Two people disagreeing hardly seems like a consensus.

If any other Norwegians are reading this: I would still love for you to chime in, as the two of us obviously disagree on which practice we should follow.

Even though there isn't a consensus for Norwegian, one of the Danish-only patches should still be committed as soon as possible.

comment:42 in reply to: ↑ 41 @tlamedia16 months ago

Replying to dnusim:

Going through Norwegian voices in this thread, I'm counting 1 strongly for (myself) and 1 strongly against (knutsp) this commit. Two people disagreeing hardly seems like a consensus.

If any other Norwegians are reading this: I would still love for you to chime in, as the two of us obviously disagree on which practice we should follow.

Even though there isn't a consensus for Norwegian, one of the Danish-only patches should still be committed as soon as possible.

I suggest that you open a new ticket for the Norwegian language so we can commit 23907.7 and close this ticket

comment:43 @nacin16 months ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 26585:

Remove certain accents in the Danish language.

props tlamedia.
fixes #23907.

Note: See TracTickets for help on using tickets.