Ticket #4739 (closed defect (bug): fixed)

Opened 4 years ago

Last modified 16 months ago

Some icelandic/Norwegian/Danish letters do not work in page slugs

Reported by: einare Owned by: westi
Priority: high Milestone:
Component: Permalinks Version: 2.2.1
Severity: major Keywords: needs-patch early 2nd-opinion dev-feedback
Cc: janbrasna

Description (last modified by westi) (diff)

When the page slug is generated from the post title, three icelandic letters are not converted correctly. These three letters are Ð ð, Þ þ and Æ æ. They should be converted to D d, TH th and AE ae but are not.

For instance, when I made a post with the title ‘Þátturinn’ the post-slug would become ‘þatturinn’ and when I tried to enter that address in my address bar it changed to ‘%c3%beatturinn’ and I got a ‘page not found’ error from Wordpress.

This can be fixed by adding the following six lines to formatting.txt, in the function remove_accents, inside the if (seems_utf8($string)) { condition.

chr(195).chr(144) => 'D', 
chr(195).chr(176) => 'd',
chr(195).chr(158) => 'TH',
chr(195).chr(190) => 'th',
chr(195).chr(134) => 'AE',
chr(195).chr(166) => 'ae',

Also (from #5952) When the post slug is generated from the post title, the letter 'Å' 'å' converts to 'a', should convert to 'aa' which is the general practice in countries using this character (Confer  Wikipedia).

Furthermore, the Norwegian/Danish characters 'Æ' 'æ' and 'Ø' 'ø' should be converted to respectively 'ae' and 'oe'. As of now, these convert to '%c3%a6' and '%c3%b8'.

Attachments

4739.patch Download (2.8 KB) - added by einare 4 years ago.
Fix for the ticket

Change History

einare4 years ago

Fix for the ticket

  • Keywords has-patch added
  • Milestone changed from 2.2.3 to 2.3 (trunk)
  • Keywords dev-reviewed added
  • Owner changed from anonymous to westi
  • Status changed from new to assigned

+1

  • Status changed from assigned to closed
  • Resolution set to fixed

(In [5969]) Add utf8->ascii mappings for icelandic letters. Fixes #4739 props einare

  • Status changed from closed to reopened
  • Resolution fixed deleted

This commit breaks permalinks of posts, containing these characters and posted using the old version of this function.

We should either revert it or pass all permalinks, which aren't manually edited, through the new sanitize title. IN order to achieve this we have to compare the output of the old and the new remove_accents functions.

Or maybe we should change the query post name matching, so that it uses the raw post name from the url, not the decoded one. If we don't do this we should be very careful in modifying sanitize_title's behaviour.

  • Keywords developer-feedback added; has-patch dev-reviewed removed
  • Priority changed from normal to high
  • Severity changed from minor to major

comment:7   ryan4 years ago

Affected posts can be fixed by resaving them. The old slug redirector will handle redirecting the old URL. But, that's not very friendly. For 2.3 we should probably revert the change.

comment:8   ryan4 years ago

(In [6150]) Revert [5969]. It can break permalinks. see #4739

comment:9   ryan4 years ago

  • Milestone changed from 2.3 to 2.4

Reverted for 2.3. We'll try to fix it properly for 2.4.

  • Keywords needs=patch early added; developer-feedback removed

I guess we need to make sure that any changes we make to the slug generation code they don't affect old posts in the way it currently does.

We should always be checking against the string we use to generate the permalink not a re-santized one.

westi, we aren't always generating the permalink based on information we have in the database. Usually the title is used, but users are allowed to enter their own slugs and we don't keep the original slug -- only the sanitized one.

  • Keywords needs-patch added; needs=patch removed
  • Summary changed from Some icelandic letters do not work in page slugs to Some icelandic/Norwegian/Danish letters do not work in page slugs
  • Description modified (diff)
  • Milestone changed from 2.5 to 2.6

Closed #5952 as a dupe of this and updated bug with more characters to fix.

Moving to 2.6 as this needs fixing early and lots of testing so we can be sure we don't break things.

Is there any way I can help without actually coding?

comment:15 follow-up: ↓ 16   snakefoot3 years ago

Duplicate #4273 ?

comment:16 in reply to: ↑ 15   westi3 years ago

Replying to snakefoot:

Duplicate #4273 ?

I think that is a similar issue not sure if it's a dupe though

This problem seems to be something quite easy to fix. If I understand correctly you only have to add a few lines to formatting.txt.

Why has this then not already been fixed?

I had similar problems (a Page titled "Bøger" (books) screwed up the permalink (slug?), and after reading this, I created (wow) a few more lines fixing it for Å/å and Ø/ø. With the fix in the top, and these lines, my problems with æ/ø/å is done.

chr(195).chr(133) => 'Aa',
chr(195).chr(165) => 'aa',
chr(195).chr(152) => 'Oe',
chr(195).chr(184) => 'oe',

Please incorporate it in the official version :)

/svendk

  • Cc janbrasna added
  • Keywords 2nd-opinion dev-feedback added
  • Component changed from i18n to Permalinks
  • Milestone changed from 2.9 to 2.8

It used to work in non–UTF processing at some point in the past (see  http://trac.wordpress.org/browser/trunk/wp-includes/formatting.php?rev=10150#L401 for the old Latin 1 transliteration code) but was apparently omitted when the UTF transliteration segment was written.

Anyway the main problem is the sanitize_title line  http://trac.wordpress.org/browser/trunk/wp-includes/query.php?rev=10150#L1671 that makes it effectively impossible to change the transliteration array at any point. I can't seem to find the point in comment:11 because the post_name field in the DB is always matched and regarding comment:5 I think the sanitize_title at that point is maybe overboard? Wouldn't just escaping it for SQL be enough? Otherwise, comment:7 sounds fine for that matter.

  • Status changed from reopened to closed
  • Resolution set to duplicate

Merging into #9591.

  • Milestone 2.8 deleted
  • Resolution changed from duplicate to fixed

(In [15930]) remove_accents(): Nordic characters fixes. Props einare. Fixes #4739. See #9591

Note: See TracTickets for help on using tickets.