WordPress.org

Make WordPress Core

Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#33511 closed defect (bug) (fixed)

Bad value for attribute lang on element html

Reported by: Chouby Owned by: ocean90
Milestone: 4.5 Priority: normal
Severity: normal Version: 4.3
Component: I18N Keywords: has-patch commit
Focuses: accessibility Cc:
PR Number:

Description

An RFC 5646 language tag consists of hyphen-separated ASCII-alphanumeric subtags. There is a primary tag identifying a natural language by its shortest ISO 639 language code (e.g. en for English) and zero or more additional subtags adding precision. The most common additional subtag type is a region subtag which most commonly is a two-letter ISO 3166 country code (e.g. GB for the United Kingdom). IANA maintains a registry of permissible subtags

WP 4.3 introduced two new languages packs with locales which do not validate at https://validator.w3.org/:
'de-DE-formal' and 'oci'.

I have no idea for de-DE-formal, but we should use the ISO 639-1 language code 'oc' instead of 'oci'.

Then I reviewed all locales at https://translate.wordpress.org/ using a ISO 639-2 code. Several of them do not validate because an ISO 639-1 code is avalaible:

bel -> be
bre -> br
dzo -> dz
ido -> io
kin -> rw
mri -> mi
roh -> rm
srd -> sc
tuk -> tk

Finally 'bal' is totally misused as the code is for the Balochi language and not for Catalan (Balear).

Attachments (3)

33511.patch (550 bytes) - added by SergeyBiryukov 4 years ago.
33511.2.patch (906 bytes) - added by SergeyBiryukov 4 years ago.
33511.3.patch (914 bytes) - added by SergeyBiryukov 4 years ago.

Download all attachments as: .zip

Change History (25)

#1 follow-up: @Chouby
4 years ago

I went on with my investigations. If I well understood the BCP47, we could use private use subtags. Private use subtags are introduced by a 'x'.

As such, we could keep a different locale for formal German represented by: de-DE-x-formal
If we need to keep a separate locale for Catalan (Balear), we could use something as: ca-x-ES-IB (ES-IB beeing the ISO 3166-2 code for Balearic Islands.

Both new proposed codes validate at https://validator.w3.org/

#2 @samuelsidler
4 years ago

This is... fun. :)

In general, I think we should continue to use the ISO 639-3 codes for our subdomains and internal tracking, but perhaps output lang attributes with ISO 639-2 codes if such a code exists for that locale. In the case of de-de-formal, perhaps that locale should output de-de as the lang attribute and not de-de-formal, but I am sure experts will disagree. :)

AFAICT, Catalan (Balear) has never been used and could probably be removed / recreated with the correct locale code.

(Note that the polyglots team handles locale codes and it's probably best to post over there with a link to this ticket.)

#3 @petya
4 years ago

It should be ok to just remove the Catalan (Balear) locale. It's not active and doesn't seem like it would be. The last activity is from the week of WordCamp Barcelona, but it doesn't seem those contributors will be back to translating it.

#4 in reply to: ↑ 1 @timwhitlock
4 years ago

IANA have a subtag registration process (section 3.5 of RFC 5646). Perhaps WordPress could submit a request to make formal a recognised "variant" type. Otherwise I am in favour of the private use extension for non-standard tags.

#5 @reitermarkus
4 years ago

Well, de-DE-formal is still exactly the same language as de-DE, so I think it should output de-DE. Unless you can think of any benefits you get from something like de-DE-x-formal.

#6 @rianrietveld
4 years ago

The same occurs for Dutch lang="nl-NL" and lang="nl-NL-formal".
nl-NL-formal gives a validation error: "Bad value nl-NL-formal for attribute lang on element html: Bad variant subtag formal.”
These translations are significant different in the way users are addressed, but both very Dutch.
So lang="nl-NL" would validate for both of them.

#7 @rianrietveld
4 years ago

  • Focuses accessibility added

#8 follow-up: @SergeyBiryukov
4 years ago

  • Milestone changed from Awaiting Review to 4.5

Seems like we should strip -formal in get_bloginfo(), see 33511.patch.

#10 in reply to: ↑ 8 @ocean90
4 years ago

  • Focuses accessibility removed
  • Milestone changed from 4.5 to Awaiting Review

Replying to SergeyBiryukov:

Seems like we should strip -formal in get_bloginfo(), see 33511.patch.

Then we should strip -informal too, even if there is currently no such language.

But still, this doesn't solve the real issue: Using the wp_locale for the lang attribute is just wrong.

#11 @afercia
4 years ago

  • Focuses accessibility added

Since the lang attribute affects the way screen readers read out web pages, I'd recommend to keep the accessibility focus on this ticket. It doesn't harm, and helps the accessibility team to track this issue :) See: http://adrianroselli.com/2015/01/on-use-of-lang-attribute.html

This ticket was mentioned in Slack in #accessibility by rianrietveld. View the logs.


4 years ago

#13 @SergeyBiryukov
4 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 4.5

Following the Slack discussion, seems like 33511.2.patch should work.

This ticket was mentioned in Slack in #core-i18n by ocean90. View the logs.


4 years ago

#15 @SergeyBiryukov
4 years ago

  • Keywords commit added

33511.3.patch includes a stricter check in case the string is (incorrectly) translated literally.

#16 @markoheijnen
4 years ago

As I have mentioned when this got implemented, the issue in my opinion is how it is implemented. It was also breaking BC in the admin where the body class is also get_locale which probably also should use the language value. I still believe in storing the locale and the writing style/dialect etc. separate. So the information can be better used then stripping values.

#17 @ocean90
4 years ago

  • Owner set to ocean90
  • Resolution set to fixed
  • Status changed from new to closed

In 36802:

I18N: Don't use the locale for the HTML language attribute.

Locales are codes to identify a language in WordPress which can be different from the specification for language tags, see https://www.w3.org/International/articles/language-tags/.
An example is de_DE_formal or nl_NL_formal where the subtag formal isn't officially supported.

To give translators the possibility to specify the language tag of their language introduce a string html_lang_attribute which can be translated into the language tag which conforms to the specification.

Props SergeyBiryukov.
Fixes #33511.

#18 @afercia
4 years ago

For posterity: here's what happens using a screen reader when the language attribute is wrong:

https://www.youtube.com/watch?v=0uzxu9dQnuU

Video reported by @rianrietveld on Slack, courtesy of Mr. Steve Faulkner.

Last edited 4 years ago by afercia (previous) (diff)

#19 follow-ups: @Chouby
4 years ago

@SergeyBiryukov, @ocean90 In theory, your solution should work. I wonder how it will be handled by translators in practice.

Beside this new html_lang_attribute, we already have ltr, number_format_decimal_point and number_format_thousands_sep which must not be translated in the usual way. I checked a few locales and this seems to be misunderstood by some translators (comments do no seem to be sufficient). Ex: bel, dzo

@afercia it's even worse than what I would have expected ;-)

Version 0, edited 4 years ago by Chouby (next)

#20 in reply to: ↑ 19 ; follow-up: @SergeyBiryukov
4 years ago

Replying to Chouby:

I checked a few locales and this seems to be misunderstood by some translators (comments do no seem to be sufficient). Ex: bel, dzo

I've checked the ltr, number_format_decimal_point, and number_format_thousands_sep strings in those locales. They're not translated in dzo, and only the last two are translated in bel, which doesn't surprise me as both locales are only ~78% complete. Is there any other issue I've missed?

#21 in reply to: ↑ 19 @ocean90
4 years ago

Replying to Chouby:

I rejected the wrong translations. I also have a script which I usually run a few days before a release which catches those cases.

#22 in reply to: ↑ 20 @Chouby
4 years ago

Replying to SergeyBiryukov:

They're not translated in dzo

I guess that @ocean90 acted meanwhile. Thanks for fixing the link.

Do you mean that some automatic check could be planned for this case too? It may be difficult to catch cases such as oci where the string was already translated to 'oci' instead of 'oc'.

Last edited 4 years ago by Chouby (previous) (diff)
Note: See TracTickets for help on using tickets.