WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 2 years ago

Last modified 13 months ago

#9591 closed defect (bug) (fixed)

remove_accents() improvements for i18n permalinks

Reported by: Denis-de-Bernardy Owned by: ryan
Milestone: 3.3 Priority: high
Severity: minor Version: 2.8
Component: Permalinks Keywords: has-patch
Focuses: Cc:

Description

Opening a master-ticket, so as to trim the 1.3k bug list:

  • #8765: Strip ° characters from permalink (closed as dup of this)
  • #3843: Smart quote apostrophe ’ results in a permalink URL with %e2%80%99 (closed as dup of this, has patch)
  • #6973: Problems with š, ž, č, ć, đ (closed as dup of this)
  • #4739: Problems with Ð ð, Þ þ and Æ æ (closed as dup of this, has patch)
  • #5554: Problems with ÆØÅæøå (closed as dup of this, has patch)
  • #3206: ¡ and ¿ should get stripped (closed as dup of this)

Related:

  • #3727: Permalink problems with chinese/japanese chars (this one probably contain the fix that led the fix to #4739 from getting reversed)
  • #8446: Post class problem with chinese/japanese chars

Attachments (9)

9591.diff (5.5 KB) - added by scribu 4 years ago.
separate contexts for sanitize_title()
9591.patch (421 bytes) - added by tosak 4 years ago.
urldecode before match
9591.2.diff (4.0 KB) - added by scribu 3 years ago.
refresh
9591.3.diff (4.7 KB) - added by scribu 3 years ago.
Introduce _qv_basename()
9591.4.diff (3.8 KB) - added by scribu 3 years ago.
refresh
9591.30.diff (600 bytes) - added by jose.geraldo 3 years ago.
correcting ordinal characters in portuguese and spanish langiuages slug urls
9591.2.patch (3.0 KB) - added by SergeyBiryukov 3 years ago.
9591.tests.patch (11.1 KB) - added by ampt 3 years ago.
9591.3.patch (956 bytes) - added by SergeyBiryukov 2 years ago.

Download all attachments as: .zip

Change History (93)

comment:1 Denis-de-Bernardy5 years ago

  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n to sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks

comment:3 Denis-de-Bernardy5 years ago

Also related: #5918 -- cannot register with a username containing arabic characters

comment:4 Denis-de-Bernardy5 years ago

Also related: #8934 -- strip fancy single and double quotes, triple periods from permalinks

comment:5 Denis-de-Bernardy5 years ago

also: #9127 -- suggests "dashing" apostrophes

comment:6 Denis-de-Bernardy5 years ago

also: #9534 -- ș and ț not caught by remove accents, with patch

comment:7 Denis-de-Bernardy5 years ago

See also #9017 -- permalink related upload problem

comment:8 Denis-de-Bernardy5 years ago

See also: #4836 -- ellipses problem

comment:9 Denis-de-Bernardy5 years ago

see also: #9866 -- problems with cyrillic.

  1. Enable permalink - /%category%/%postname%
  1. Create post with slug "тест" - then you can access it trough the link from the side bar and the URL contains "тест" at the end
  1. Create Page with slug "страница" - the link "Страница" appears in the sidebar, but when you click it - 404 page not found, and the url

comment:10 azaozz4 years ago

  • Resolution set to fixed
  • Status changed from new to closed

All listed tickets are closed.

comment:11 Denis-de-Bernardy4 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Indeed... I closed them all as dups of this one, so we can reference all of the issues from a single ticket.

comment:12 lloydbudd4 years ago

  • Milestone changed from 2.9 to 3.0

comment:13 lloydbudd4 years ago

  • Keywords early added

comment:14 nacin4 years ago

  • Keywords changed from needs-patch, early to needs-patch early

comment:15 Denis-de-Bernardy4 years ago

  • Keywords early removed
  • Milestone changed from 3.0 to Future Release

to fix this, we need to ensure proper migration of permalinks.

comment:16 Denis-de-Bernardy4 years ago

Closing #10797: handle curly quotes and em/en-dashes properly

comment:17 hakre4 years ago

Might be related: #11724

comment:18 nacin4 years ago

#10451 closed as a duplicate.

comment:19 follow-ups: miqrogroove4 years ago

  • Milestone changed from Future Release to 3.0
  • Priority changed from normal to high

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

comment:20 in reply to: ↑ 19 scribu4 years ago

Replying to miqrogroove:

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

The reason why that's difficult:

ryan, on #9534:

Incoming slugs (from permalink requests) are run through the sanitizer and they won't match the old slug.

scribu4 years ago

separate contexts for sanitize_title()

comment:21 scribu4 years ago

  • Keywords has-patch added; needs-patch removed

Currently, the two slugs bellow match the same post:

le-slug

lé-șlüg

Depending on the language, this might or might not make sense.

To solve the accents issue, we need to have two contexts for sanitize_title: one for storing a slug and one for retrieving a post, based on a slug.

See 9591.diff

comment:22 scribu4 years ago

Note that if some users want the old behaviour (lé-șlüg matches le-slug), they can have it back with a single line:

add_filter('sanitize_title', 'remove_accents');

comment:23 in reply to: ↑ 19 Denis-de-Bernardy4 years ago

Replying to miqrogroove:

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

My understanding of the problem is that the same sanitization flow comes in when looking up posts in the database. In other words, if abc previously resulted in abd, and abd now results abe, the abc post would no longer be reachable. Or something like that.

For the gory details, see the comments from Ryan in the multitudes of tickets I've closed as dup of this one.

comment:24 nacin4 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1

Early 3.1 per the dev chat today.

comment:25 bi0xid4 years ago

Adding #12373 to the list.

comment:26 scribu4 years ago

And also #12822.

comment:28 WraithKenny4 years ago

  • Cc Ken@… added

comment:29 nacin4 years ago

  • Milestone changed from Awaiting Triage to 3.1

Back to early 3.1 per wpdevel and previous IRC chat.

comment:30 ilovecolors4 years ago

Guillemets also need attention. Try typing «Hello» for a title and the slug/permalink will bet set to «hello».

comment:31 tosak4 years ago

I got tired of trying to understand the root of this problem, so I just attached what works for me.. A $match = urldecode($match) in WP::parse_request()

Again this might be a completely wrong fix, and not backwards compatible, but it seems to work for me with all tested characters (scandinavian, kanji, hebrew and cyrillic). «hello» works too.

comment:32 scribu4 years ago

tosak, I tried your patch and it failed on a simple URL like this:

http://localhost/wp/index.php/category/category-b/

comment:33 intlect4 years ago

  • Cc tj@… added

tosak4 years ago

urldecode before match

comment:34 tosak4 years ago

Scribu, the patch worked for pages only. I'm sorry about that.

Urldecode correctly decoded the string, but replaced + with space characters, messing up the regex. I've updated the patch to use rawurldecode instead, and as far as I can see both pages, categories and posts are working correctly now (fingers crossed).

Please test it again!

comment:35 scribu4 years ago

I don't really understand what your patch fixes.

The problem here (or at least that which I'm trying to fix) is that you end up with ugly permalinks, due to poor slug sanitization:

http://example.com/slug-with-foreign-characterș-that-gețs-url-encoded

comment:36 tosak4 years ago

If you save a page with say Kanji letters:
http://localhost/wp/漢字

WP saves the regex as:
(%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$

Then when you call the page, WP decodes the request, but not the regex, so it tries to match 漢字 against (%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$ resulting in a 404

if you use rawurlencode, WP correctly matches 漢字 against (漢字)(/[0-9]+)?/?$

I'm sorry if I misunderstood what we're trying to do here, this just fixed the issue with foreign characters in URL's and 404's for me, so wanted to share it.

comment:37 scribu4 years ago

Ok, thanks for the clarification.

comment:38 scribu3 years ago

I tracked down an interesting changeset from 5 years ago: [2191] - it first introduced the urldecode( urlencode() ) stuff.

scribu3 years ago

refresh

comment:39 scribu3 years ago

Found out why [2191] was needed:

echo basename( 'misc/știri' ); // Result: 'tiri'

scribu3 years ago

Introduce _qv_basename()

comment:40 scribu3 years ago

9591.3.diff:

  • introduce sanitize_title_for_query() and use it where appropriate
  • introduce _qv_basename() helper method and use it for hierarchical taxonomies too.

comment:41 scribu3 years ago

From my testing, old permalinks still work, but new slugs can't contain odd characters, which is ideal.

comment:42 scribu3 years ago

(In [15923]) Introduce _qv_basename() and apply it to hierarchical taxonomies. See #9591.

scribu3 years ago

refresh

comment:43 scribu3 years ago

(In [15929]) Introduce sanitize_title_for_query(). See #9591

comment:44 scribu3 years ago

(In [15930]) remove_accents(): Nordic characters fixes. Props einare. Fixes #4739. See #9591

comment:46 scribu3 years ago

(In [15931]) Also convert uppercase letters in Latin Extended-B. See #9591

comment:47 jose.geraldo3 years ago

  • Severity changed from normal to minor
  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks to remove_accents() improvements for i18n permalinks
  • Type changed from defect (bug) to enhancement
  • Version changed from 2.8 to 3.0.1

Suggestion to improve remove_accents

insert at line 537 in wp-includes/formatting.php (inside remove_accents function) the following code

Special ordinals
chr(194).chr(170) => 'a',
chr(194).chr(176) => 'o', chr(194).chr(186) => 'o',

It deals with ordinals in portuguese/spanish post/page titles in order to create a correct slug url (404 page not found error instead).

comment:48 hakre3 years ago

Related: #10249 - Page slug in cyrillic = Error 404 - Not Found! (Duplicate?)

comment:49 unsalkorkmaz3 years ago

As i wrote in
http://core.trac.wordpress.org/ticket/10249#comment:35

An example.. "примерно" you can use these russian chars in permalinks but "şŞ İı Ğğ Üü Öö çÇ" becoming "ss-ii-gg-uu-oo-cc" and this is a problem actually. You can try this yourself too. those chars are UTF8.

If its working as intended, plz give me a custom way to use those chars.

comment:51 markjaquith3 years ago

  • Milestone changed from 3.1 to Future Release

Done with this for 3.1. Have some ideas about how to do migrations for 3.2.

comment:52 scribu3 years ago

  • Keywords early has-patch removed

What sort of migrations?

jose.geraldo3 years ago

correcting ordinal characters in portuguese and spanish langiuages slug urls

comment:53 nacin3 years ago

  • Keywords 3.2-early added

comment:54 scribu3 years ago

(In [16832]) Make get_term_by() use sanitize_title_for_query() too. See #9591

comment:55 scribu3 years ago

jose.geraldo: Could you point to the UTF block those characters belong to?

comment:56 scribu3 years ago

Unicode block I mean.

comment:57 miqrogroove3 years ago

If I'm reading that correctly, he's patching

U+00AA ª Feminine Ordinal Indicator
U+00B0 ° Degree symbol
U+00BA º Masculine ordinal indicator

comment:58 scribu3 years ago

I think the degree symbol should be stripped, rather than be converted to 'o'.

Also, the code is duplicated.

comment:59 shaneiseminger3 years ago

  • Cc shaneiseminger added

As of 3.03: when a title has smart quotes, the sanitization happens halfway: they are converted to dumb quotes but are not stripped out. The permalink then contains dumb quotes which of course breaks it.

Dumb quotes are stripped correctly.

A little experimention also indicates that remove_accents() does not convert ø -- chr(216) and chr(248) -- I know it's not technically an accented character but it seems like this would be the intent.

comment:60 markjaquith3 years ago

(In [17357]) Revert [16832]. see #9591. fixes #16282

comment:61 hakre3 years ago

Have a patch at Related: #16905 that makes remove_accents filterable.

comment:62 hakre3 years ago

Probably it's better to normalize the data in remove_accents() and make the transliteration table overload-able - Related: #16908

comment:63 follow-up: Pierre_023 years ago

  • Cc Pierre_02 added
  • Keywords needs-patch added
  • Type changed from enhancement to defect (bug)
  • Version changed from 3.0.1 to 3.1

Hi all,

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.
i'm currently developing a website where i use my own custom post type for accomodation/restaurant.
So a user can select a type of accomodation/restaurant (eg : Hôtel, Restaurant, Camping, etc.) by clicking on a tag then it can refine its choice by click on a tag of the town name (eg : Villers-Cotterêts).
The problem is that, if the name of the town contains an emphasis when i click on "Next page" i got a 301 redirect and my emphasis caracters are removed during the redirection :(... So the name of town (Villers-Cotterêts) becomes "Villers-Cotterts" and no record is found...

I know about URL sanitization of WP and it's why i used a querystring. Hopping that WP let it untouched.
Of course, i've tried to hard code the name of the town in my 'meta_value' and then, everything works fine. Despite the redirection, WP show me the next pages for the selected criteria ; so this is not a programming problem.

In my code, i use a syntax with querystring to generate my url.
example : http://mywebsite.fr/commerce/type/restaurant/?ville=Villers-Cotter%C3%AAts so the next page link (generated by the 'next_posts_link' function) will be http://mywebsite.fr/commerce/type/restaurant/page/2/?ville=Villers-Cotter%C3%AAts but if i click on it the WP redirect me (by a 301 error) to http://mywebsite.fr/commerce/type/hotel/page/2/?ville=Villers-Cotterts (notice the removing of ê). WP removes space in names too...

So my question is : Why, when it's the first page, WP let untouched my querystring and as soon as i paginate i'm victim of 301 redirect ?

TIA.

Amicably,

Pierre.

Version 0, edited 3 years ago by Pierre_02 (next)

SergeyBiryukov3 years ago

comment:64 SergeyBiryukov3 years ago

  • Keywords has-patch added; 3.2-early needs-patch removed
  • Milestone changed from Future Release to 3.3
  • Version changed from 3.1 to 2.8

I've reviewed all the tickets linked from here and checked them for remaining issues.

9591.2.patch fixes:

  • #5554: Problems with ÆØÅæøå (only Ø, actually)
  • ticket:9591:47: Ordinal indicators in Portuguese/Spanish

Only 3 characters are actually added, the rest is formatting.

Also uploaded a new patch in #10797.

comment:65 in reply to: ↑ 63 SergeyBiryukov3 years ago

Replying to Pierre_02:

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.

This should be discussed in another ticket, as it has little to do with remove_accents(), which sanitizes "Villers-Cotterêts" correctly. You probably should use post slugs instead of titles.

comment:66 bi0xid3 years ago

Still without a solution for opening question and exclamation marks ¡ and ¿ (#12373)

comment:67 follow-up: SergeyBiryukov3 years ago

That's handled in #10797, see my comment there.

comment:68 in reply to: ↑ 67 bi0xid3 years ago

Replying to SergeyBiryukov:

That's handled in #10797, see my comment there.

Thank you :). Awesome job :)

comment:69 SergeyBiryukov3 years ago

  • Keywords needs-unit-tests added

comment:70 follow-up: unsalkorkmaz3 years ago

After i saw "Good idea! We're working on it" status at
http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love
and while SergeyBiryukov working on this ticket, want to ask if there is any update on my comment:
http://core.trac.wordpress.org/ticket/9591#comment:49
?

comment:71 in reply to: ↑ 70 ; follow-up: SergeyBiryukov3 years ago

Replying to unsalkorkmaz:

want to ask if there is any update on my comment

I've attached a plugin to allow accented characters in slugs in #15248.

comment:72 in reply to: ↑ 71 ; follow-up: unsalkorkmaz3 years ago

Replying to SergeyBiryukov:

I've attached a plugin to allow accented characters in slugs in #15248.

It basically uses raw_title and completely disabling all filtering right? is this safe? if its safe, why we filtering title then? :S

comment:73 in reply to: ↑ 72 ; follow-up: SergeyBiryukov3 years ago

Replying to unsalkorkmaz:

It basically uses raw_title and completely disabling all filtering right? is this safe?

It only cancels remove_accents(), while the other filtering, such as sanitize_title_with_dashes(), is still performed. So I guess if you want to use accented characters in slugs, that's the proper solution.

If you only want to remove accents from one characters and keep them on others, you can create your own version of remove_accents() and hook it to sanitize_title. A filter in remove_accents() itself would probably be a bit cleaner solution, but it's not necessary.

I've attached ticket:15248:allow-turkish-accents-in-slugs.php. It should do exactly what you want: allow "şŞ İı Ğğ Üü Öö çÇ" characters and remove accents from all others. It also allows these characters in usernames, as per your comment there.

comment:74 in reply to: ↑ 73 unsalkorkmaz3 years ago

Replying to SergeyBiryukov:

It also allows these characters in usernames, as per your comment there.

Thanks for trying to help for usernames too :) But i am getting "Only lowercase letters (a-z) and numbers are allowed" error from function wpmu_validate_user_signup @ wp-includes/ms-functions.php

ampt3 years ago

comment:75 ampt3 years ago

Add unit tests. The tests also replace some existing test cases under jacob/TestFormatting. Perhaps another set of eyes on this would be good to make sure its on the right track.

comment:76 SergeyBiryukov3 years ago

Closed #18903 as a duplicate of this ticket and #5554.

comment:77 follow-ups: ryan2 years ago

Committed 9591.tests.patch to unit-tests repo:

http://unit-tests.trac.wordpress.org/changeset/471

comment:78 in reply to: ↑ 77 SergeyBiryukov2 years ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

remove_accents_from_* files are now empty:
http://unit-tests.trac.wordpress.org/browser/wp-testdata/jacob

They can probably be removed.

comment:79 in reply to: ↑ 77 SergeyBiryukov2 years ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

wp-testdata/formatting/remove_accents.01.input.txt file was not committed:
http://core.trac.wordpress.org/attachment/ticket/9591/9591.tests.patch#L111

1) TestRemoveAccents::test_remove_accents_iso8859
file_get_contents(./wp-testdata\formatting\remove_accents.01.input.txt): failed to open stream: No such file or directory

comment:81 SergeyBiryukov2 years ago

Cool. Before patch: Tests: 6, Assertions: 7, Failures: 1.

With 9591.2.patch: OK (6 tests, 7 assertions)

SergeyBiryukov2 years ago

comment:82 nacin2 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In [19125]:

Add a few characters to remove_accents(). props SergeyBiryukov. props ampt for [UT471]. fixes #9591.

comment:83 SergeyBiryukov2 years ago

  • Keywords needs-unit-tests removed

comment:84 dnusim13 months ago

#4739 was marked as a duplicate.

Note: See TracTickets for help on using tickets.