Ticket #9591 (closed defect (bug): fixed)

Opened 3 years ago

Last modified 3 months ago

remove_accents() improvements for i18n permalinks

Reported by: Denis-de-Bernardy Owned by: ryan
Priority: high Milestone: 3.3
Component: Permalinks Version: 2.8
Severity: minor Keywords: has-patch
Cc: Ken@…, tj@…, shaneiseminger, Pierre_02

Description

Opening a master-ticket, so as to trim the 1.3k bug list:

  • #8765: Strip ° characters from permalink (closed as dup of this)
  • #3843: Smart quote apostrophe ’ results in a permalink URL with %e2%80%99 (closed as dup of this, has patch)
  • #6973: Problems with š, ž, č, ć, đ (closed as dup of this)
  • #4739: Problems with Ð ð, Þ þ and Æ æ (closed as dup of this, has patch)
  • #5554: Problems with ÆØÅæøå (closed as dup of this, has patch)
  • #3206: ¡ and ¿ should get stripped (closed as dup of this)

Related:

  • #3727: Permalink problems with chinese/japanese chars (this one probably contain the fix that led the fix to #4739 from getting reversed)
  • #8446: Post class problem with chinese/japanese chars

Attachments

9591.diff Download (5.5 KB) - added by scribu 2 years ago.
separate contexts for sanitize_title()
9591.patch Download (421 bytes) - added by tosak 17 months ago.
urldecode before match
9591.2.diff Download (4.0 KB) - added by scribu 16 months ago.
refresh
9591.3.diff Download (4.7 KB) - added by scribu 16 months ago.
Introduce _qv_basename()
9591.4.diff Download (3.8 KB) - added by scribu 16 months ago.
refresh
9591.30.diff Download (600 bytes) - added by jose.geraldo 15 months ago.
correcting ordinal characters in portuguese and spanish langiuages slug urls
9591.2.patch Download (3.0 KB) - added by SergeyBiryukov 5 months ago.
9591.tests.patch Download (11.1 KB) - added by ampt 5 months ago.
9591.3.patch Download (956 bytes) - added by SergeyBiryukov 3 months ago.

Change History

  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n to sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks

Also related: #5918 -- cannot register with a username containing arabic characters

Also related: #8934 -- strip fancy single and double quotes, triple periods from permalinks

also: #9127 -- suggests "dashing" apostrophes

also: #9534 -- ș and ț not caught by remove accents, with patch

See also #9017 -- permalink related upload problem

See also: #4836 -- ellipses problem

see also: #9866 -- problems with cyrillic.

  1. Enable permalink - /%category%/%postname%
  1. Create post with slug "тест" - then you can access it trough the link from the side bar and the URL contains "тест" at the end
  1. Create Page with slug "страница" - the link "Страница" appears in the sidebar, but when you click it - 404 page not found, and the url
  • Status changed from new to closed
  • Resolution set to fixed

All listed tickets are closed.

  • Status changed from closed to reopened
  • Resolution fixed deleted

Indeed... I closed them all as dups of this one, so we can reference all of the issues from a single ticket.

  • Milestone changed from 2.9 to 3.0
  • Keywords needs-patch, early added; needs-patch removed
  • Keywords needs-patch added; needs-patch, removed
  • Keywords early removed
  • Milestone changed from 3.0 to Future Release

to fix this, we need to ensure proper migration of permalinks.

Closing #10797: handle curly quotes and em/en-dashes properly

Might be related: #11724

#10451 closed as a duplicate.

  • Priority changed from normal to high
  • Milestone changed from Future Release to 3.0

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

comment:20 in reply to: ↑ 19   scribu2 years ago

Replying to miqrogroove:

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

The reason why that's difficult:

ryan, on #9534:

Incoming slugs (from permalink requests) are run through the sanitizer and they won't match the old slug.

scribu2 years ago

separate contexts for sanitize_title()

  • Keywords has-patch added; needs-patch removed

Currently, the two slugs bellow match the same post:

le-slug

lé-șlüg

Depending on the language, this might or might not make sense.

To solve the accents issue, we need to have two contexts for sanitize_title: one for storing a slug and one for retrieving a post, based on a slug.

See 9591.diff

Note that if some users want the old behaviour (lé-șlüg matches le-slug), they can have it back with a single line:

add_filter('sanitize_title', 'remove_accents');

comment:23 in reply to: ↑ 19   Denis-de-Bernardy2 years ago

Replying to miqrogroove:

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

My understanding of the problem is that the same sanitization flow comes in when looking up posts in the database. In other words, if abc previously resulted in abd, and abd now results abe, the abc post would no longer be reachable. Or something like that.

For the gory details, see the comments from Ryan in the multitudes of tickets I've closed as dup of this one.

  • Keywords early added
  • Milestone changed from 3.0 to 3.1

Early 3.1 per the dev chat today.

Adding #12373 to the list.

And also #12822.

  • Cc Ken@… added
  • Milestone changed from Awaiting Triage to 3.1

Back to early 3.1 per wpdevel and previous IRC chat.

Guillemets also need attention. Try typing «Hello» for a title and the slug/permalink will bet set to «hello».

I got tired of trying to understand the root of this problem, so I just attached what works for me.. A $match = urldecode($match) in WP::parse_request()

Again this might be a completely wrong fix, and not backwards compatible, but it seems to work for me with all tested characters (scandinavian, kanji, hebrew and cyrillic). «hello» works too.

tosak, I tried your patch and it failed on a simple URL like this:

 http://localhost/wp/index.php/category/category-b/

  • Cc tj@… added

tosak17 months ago

urldecode before match

Scribu, the patch worked for pages only. I'm sorry about that.

Urldecode correctly decoded the string, but replaced + with space characters, messing up the regex. I've updated the patch to use rawurldecode instead, and as far as I can see both pages, categories and posts are working correctly now (fingers crossed).

Please test it again!

I don't really understand what your patch fixes.

The problem here (or at least that which I'm trying to fix) is that you end up with ugly permalinks, due to poor slug sanitization:

http://example.com/slug-with-foreign-characterș-that-gețs-url-encoded

If you save a page with say Kanji letters:  http://localhost/wp/漢字

WP saves the regex as: (%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$

Then when you call the page, WP decodes the request, but not the regex, so it tries to match 漢字 against (%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$ resulting in a 404

if you use rawurlencode, WP correctly matches 漢字 against (漢字)(/[0-9]+)?/?$

I'm sorry if I misunderstood what we're trying to do here, this just fixed the issue with foreign characters in URL's and 404's for me, so wanted to share it.

Ok, thanks for the clarification.

I tracked down an interesting changeset from 5 years ago: [2191] - it first introduced the urldecode( urlencode() ) stuff.

refresh

Found out why [2191] was needed:

echo basename( 'misc/știri' ); // Result: 'tiri'

Introduce _qv_basename()

9591.3.diff:

  • introduce sanitize_title_for_query() and use it where appropriate
  • introduce _qv_basename() helper method and use it for hierarchical taxonomies too.

From my testing, old permalinks still work, but new slugs can't contain odd characters, which is ideal.

(In [15923]) Introduce _qv_basename() and apply it to hierarchical taxonomies. See #9591.

refresh

(In [15929]) Introduce sanitize_title_for_query(). See #9591

(In [15930]) remove_accents(): Nordic characters fixes. Props einare. Fixes #4739. See #9591

(In [15931]) Also convert uppercase letters in Latin Extended-B. See #9591

  • Version changed from 2.8 to 3.0.1
  • Type changed from defect (bug) to enhancement
  • Severity changed from normal to minor
  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks to remove_accents() improvements for i18n permalinks

Suggestion to improve remove_accents

insert at line 537 in wp-includes/formatting.php (inside remove_accents function) the following code

Special ordinals chr(194).chr(170) => 'a', chr(194).chr(176) => 'o', chr(194).chr(186) => 'o',

It deals with ordinals in portuguese/spanish post/page titles in order to create a correct slug url (404 page not found error instead).

Related: #10249 - Page slug in cyrillic = Error 404 - Not Found! (Duplicate?)

As i wrote in http://core.trac.wordpress.org/ticket/10249#comment:35

An example.. "примерно" you can use these russian chars in permalinks but "şŞ İı Ğğ Üü Öö çÇ" becoming "ss-ii-gg-uu-oo-cc" and this is a problem actually. You can try this yourself too. those chars are UTF8.

If its working as intended, plz give me a custom way to use those chars.

Opened a suggestion ticket about it: http://core.trac.wordpress.org/ticket/15248

  • Milestone changed from 3.1 to Future Release

Done with this for 3.1. Have some ideas about how to do migrations for 3.2.

  • Keywords early has-patch removed

What sort of migrations?

correcting ordinal characters in portuguese and spanish langiuages slug urls

  • Keywords 3.2-early added

(In [16832]) Make get_term_by() use sanitize_title_for_query() too. See #9591

jose.geraldo: Could you point to the UTF block those characters belong to?

Unicode block I mean.

If I'm reading that correctly, he's patching

U+00AA ª Feminine Ordinal Indicator
U+00B0 ° Degree symbol
U+00BA º Masculine ordinal indicator

I think the degree symbol should be stripped, rather than be converted to 'o'.

Also, the code is duplicated.

  • Cc shaneiseminger added

As of 3.03: when a title has smart quotes, the sanitization happens halfway: they are converted to dumb quotes but are not stripped out. The permalink then contains dumb quotes which of course breaks it.

Dumb quotes are stripped correctly.

A little experimention also indicates that remove_accents() does not convert ø -- chr(216) and chr(248) -- I know it's not technically an accented character but it seems like this would be the intent.

(In [17357]) Revert [16832]. see #9591. fixes #16282

Have a patch at Related: #16905 that makes remove_accents filterable.

Probably it's better to normalize the data in remove_accents() and make the transliteration table overload-able - Related: #16908

comment:63 follow-up: ↓ 65   Pierre_029 months ago

  • Cc Pierre_02 added
  • Keywords needs-patch added
  • Version changed from 3.0.1 to 3.1
  • Type changed from enhancement to defect (bug)

Hi all,

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination. i'm currently developing a website where i use my own custom post type for accomodation/restaurant. So a user can select a type of accomodation/restaurant (eg : Hôtel, Restaurant, Camping, etc.) by clicking on a tag then it can refine its choice by click on a tag of the town name (eg : Villers-Cotterêts). The problem is that, if the name of the town contains an emphasis when i click on "Next page" i got a 301 redirect and my emphasis caracters are removed during the redirection :(... So the name of town (Villers-Cotterêts) becomes "Villers-Cotterts" and no record is found...

I know about URL sanitization of WP and it's why i used a querystring. Hopping that WP let it untouched. Of course, i've tried to hard code the name of the town in my 'meta_value' and then, everything works fine. Despite the redirection, WP show me the next pages for the selected criteria ; so this is not a programming problem.

In my code, i use a syntax with querystring to generate my url. example :  http://mywebsite.fr/commerce/type/restaurant/?ville=Villers-Cotter%C3%AAts so the next page link (generated by the 'next_posts_link' function) will be  http://mywebsite.fr/commerce/type/restaurant/page/2/?ville=Villers-Cotter%C3%AAts but if i click on it the WP redirect me (by a 301 error) to  http://mywebsite.fr/commerce/type/hotel/page/2/?ville=Villers-Cotterts (notice the removing of ê). WP removes space in names too...

So my question is : Why, when it's the first page, WP let untouched my querystring and as soon as i paginate i'm victim of 301 redirect ?

PS. Sorry this a not a problem from version 3.0.1 to 3.1 but a problem in the 3.1.2 release.

TIA.

Amicably,

Pierre.

Last edited 9 months ago by Pierre_02 (previous) (diff)
  • Keywords has-patch added; 3.2-early needs-patch removed
  • Version changed from 3.1 to 2.8
  • Milestone changed from Future Release to 3.3

I've reviewed all the tickets linked from here and checked them for remaining issues.

9591.2.patch Download fixes:

  • #5554: Problems with ÆØÅæøå (only Ø, actually)
  • ticket:9591:47: Ordinal indicators in Portuguese/Spanish

Only 3 characters are actually added, the rest is formatting.

Also uploaded a new patch in #10797.

comment:65 in reply to: ↑ 63   SergeyBiryukov5 months ago

Replying to Pierre_02:

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.

This should be discussed in another ticket, as it has little to do with remove_accents(), which sanitizes "Villers-Cotterêts" correctly. You probably should use post slugs instead of titles.

Still without a solution for opening question and exclamation marks ¡ and ¿ (#12373)

That's handled in #10797, see my comment there.

comment:68 in reply to: ↑ 67   bi0xid5 months ago

Replying to SergeyBiryukov:

That's handled in #10797, see my comment there.

Thank you :). Awesome job :)

  • Keywords needs-unit-tests added

comment:70 follow-up: ↓ 71   unsalkorkmaz5 months ago

After i saw "Good idea! We're working on it" status at  http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love and while SergeyBiryukov working on this ticket, want to ask if there is any update on my comment: http://core.trac.wordpress.org/ticket/9591#comment:49 ?

comment:71 in reply to: ↑ 70 ; follow-up: ↓ 72   SergeyBiryukov5 months ago

Replying to unsalkorkmaz:

want to ask if there is any update on my comment

I've attached a plugin to allow accented characters in slugs in #15248.

comment:72 in reply to: ↑ 71 ; follow-up: ↓ 73   unsalkorkmaz5 months ago

Replying to SergeyBiryukov:

I've attached a plugin to allow accented characters in slugs in #15248.

It basically uses raw_title and completely disabling all filtering right? is this safe? if its safe, why we filtering title then? :S

comment:73 in reply to: ↑ 72 ; follow-up: ↓ 74   SergeyBiryukov5 months ago

Replying to unsalkorkmaz:

It basically uses raw_title and completely disabling all filtering right? is this safe?

It only cancels remove_accents(), while the other filtering, such as sanitize_title_with_dashes(), is still performed. So I guess if you want to use accented characters in slugs, that's the proper solution.

If you only want to remove accents from one characters and keep them on others, you can create your own version of remove_accents() and hook it to sanitize_title. A filter in remove_accents() itself would probably be a bit cleaner solution, but it's not necessary.

I've attached ticket:15248:allow-turkish-accents-in-slugs.php Download. It should do exactly what you want: allow "şŞ İı Ğğ Üü Öö çÇ" characters and remove accents from all others. It also allows these characters in usernames, as per your comment there.

comment:74 in reply to: ↑ 73   unsalkorkmaz5 months ago

Replying to SergeyBiryukov:

It also allows these characters in usernames, as per your comment there.

Thanks for trying to help for usernames too :) But i am getting "Only lowercase letters (a-z) and numbers are allowed" error from function wpmu_validate_user_signup @ wp-includes/ms-functions.php

ampt5 months ago

Add unit tests. The tests also replace some existing test cases under jacob/TestFormatting. Perhaps another set of eyes on this would be good to make sure its on the right track.

Closed #18903 as a duplicate of this ticket and #5554.

comment:77 follow-ups: ↓ 78 ↓ 79   ryan3 months ago

Committed 9591.tests.patch to unit-tests repo:

 http://unit-tests.trac.wordpress.org/changeset/471

comment:78 in reply to: ↑ 77   SergeyBiryukov3 months ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

remove_accents_from_* files are now empty:
 http://unit-tests.trac.wordpress.org/browser/wp-testdata/jacob

They can probably be removed.

comment:79 in reply to: ↑ 77   SergeyBiryukov3 months ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

wp-testdata/formatting/remove_accents.01.input.txt file was not committed: http://core.trac.wordpress.org/attachment/ticket/9591/9591.tests.patch#L111

1) TestRemoveAccents::test_remove_accents_iso8859
file_get_contents(./wp-testdata\formatting\remove_accents.01.input.txt): failed to open stream: No such file or directory

Cool. Before patch: Tests: 6, Assertions: 7, Failures: 1.

With 9591.2.patch Download: OK (6 tests, 7 assertions)

  • Status changed from reopened to closed
  • Resolution set to fixed

In [19125]:

Add a few characters to remove_accents(). props SergeyBiryukov. props ampt for  [UT471]. fixes #9591.

  • Keywords needs-unit-tests removed
Note: See TracTickets for help on using tickets.