Make WordPress Core

Opened 15 years ago

Closed 13 years ago

Last modified 12 years ago

#9591 closed defect (bug) (fixed)

remove_accents() improvements for i18n permalinks

Reported by: denis-de-bernardy's profile Denis-de-Bernardy Owned by: ryan's profile ryan
Milestone: 3.3 Priority: high
Severity: minor Version: 2.8
Component: Permalinks Keywords: has-patch
Focuses: Cc:

Description

Opening a master-ticket, so as to trim the 1.3k bug list:

  • #8765: Strip ° characters from permalink (closed as dup of this)
  • #3843: Smart quote apostrophe ’ results in a permalink URL with %e2%80%99 (closed as dup of this, has patch)
  • #6973: Problems with š, ž, č, ć, đ (closed as dup of this)
  • #4739: Problems with Ð ð, Þ þ and Æ æ (closed as dup of this, has patch)
  • #5554: Problems with ÆØÅæøå (closed as dup of this, has patch)
  • #3206: ¡ and ¿ should get stripped (closed as dup of this)

Related:

  • #3727: Permalink problems with chinese/japanese chars (this one probably contain the fix that led the fix to #4739 from getting reversed)
  • #8446: Post class problem with chinese/japanese chars

Attachments (9)

9591.diff (5.5 KB) - added by scribu 15 years ago.
separate contexts for sanitize_title()
9591.patch (421 bytes) - added by tosak 14 years ago.
urldecode before match
9591.2.diff (4.0 KB) - added by scribu 14 years ago.
refresh
9591.3.diff (4.7 KB) - added by scribu 14 years ago.
Introduce _qv_basename()
9591.4.diff (3.8 KB) - added by scribu 14 years ago.
refresh
9591.30.diff (600 bytes) - added by jose.geraldo 14 years ago.
correcting ordinal characters in portuguese and spanish langiuages slug urls
9591.2.patch (3.0 KB) - added by SergeyBiryukov 13 years ago.
9591.tests.patch (11.1 KB) - added by ampt 13 years ago.
9591.3.patch (956 bytes) - added by SergeyBiryukov 13 years ago.

Download all attachments as: .zip

Change History (93)

#1 @Denis-de-Bernardy
15 years ago

  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n to sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks

#3 @Denis-de-Bernardy
15 years ago

Also related: #5918 -- cannot register with a username containing arabic characters

#4 @Denis-de-Bernardy
15 years ago

Also related: #8934 -- strip fancy single and double quotes, triple periods from permalinks

#5 @Denis-de-Bernardy
15 years ago

also: #9127 -- suggests "dashing" apostrophes

#6 @Denis-de-Bernardy
15 years ago

also: #9534 -- ș and ț not caught by remove accents, with patch

#7 @Denis-de-Bernardy
15 years ago

See also #9017 -- permalink related upload problem

#8 @Denis-de-Bernardy
15 years ago

See also: #4836 -- ellipses problem

#9 @Denis-de-Bernardy
15 years ago

see also: #9866 -- problems with cyrillic.

  1. Enable permalink - /%category%/%postname%
  1. Create post with slug "тест" - then you can access it trough the link from the side bar and the URL contains "тест" at the end
  1. Create Page with slug "страница" - the link "Страница" appears in the sidebar, but when you click it - 404 page not found, and the url

#10 @azaozz
15 years ago

  • Resolution set to fixed
  • Status changed from new to closed

All listed tickets are closed.

#11 @Denis-de-Bernardy
15 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Indeed... I closed them all as dups of this one, so we can reference all of the issues from a single ticket.

#12 @lloydbudd
15 years ago

  • Milestone changed from 2.9 to 3.0

#13 @lloydbudd
15 years ago

  • Keywords early added

#14 @nacin
15 years ago

  • Keywords changed from needs-patch, early to needs-patch early

#15 @Denis-de-Bernardy
15 years ago

  • Keywords early removed
  • Milestone changed from 3.0 to Future Release

to fix this, we need to ensure proper migration of permalinks.

#16 @Denis-de-Bernardy
15 years ago

Closing #10797: handle curly quotes and em/en-dashes properly

#17 @hakre
15 years ago

Might be related: #11724

#18 @nacin
15 years ago

#10451 closed as a duplicate.

#19 follow-ups: @miqrogroove
15 years ago

  • Milestone changed from Future Release to 3.0
  • Priority changed from normal to high

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

#20 in reply to: ↑ 19 @scribu
15 years ago

Replying to miqrogroove:

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

The reason why that's difficult:

ryan, on #9534:

Incoming slugs (from permalink requests) are run through the sanitizer and they won't match the old slug.

@scribu
15 years ago

separate contexts for sanitize_title()

#21 @scribu
15 years ago

  • Keywords has-patch added; needs-patch removed

Currently, the two slugs bellow match the same post:

le-slug

lé-șlüg

Depending on the language, this might or might not make sense.

To solve the accents issue, we need to have two contexts for sanitize_title: one for storing a slug and one for retrieving a post, based on a slug.

See 9591.diff

#22 @scribu
15 years ago

Note that if some users want the old behaviour (lé-șlüg matches le-slug), they can have it back with a single line:

add_filter('sanitize_title', 'remove_accents');

#23 in reply to: ↑ 19 @Denis-de-Bernardy
15 years ago

Replying to miqrogroove:

Please also strip em-dash.

I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.

My understanding of the problem is that the same sanitization flow comes in when looking up posts in the database. In other words, if abc previously resulted in abd, and abd now results abe, the abc post would no longer be reachable. Or something like that.

For the gory details, see the comments from Ryan in the multitudes of tickets I've closed as dup of this one.

#24 @nacin
15 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1

Early 3.1 per the dev chat today.

#25 @bi0xid
15 years ago

Adding #12373 to the list.

#26 @scribu
15 years ago

And also #12822.

#28 @WraithKenny
14 years ago

  • Cc Ken@… added

#29 @nacin
14 years ago

  • Milestone changed from Awaiting Triage to 3.1

Back to early 3.1 per wpdevel and previous IRC chat.

#30 @ilovecolors
14 years ago

Guillemets also need attention. Try typing «Hello» for a title and the slug/permalink will bet set to «hello».

#31 @tosak
14 years ago

I got tired of trying to understand the root of this problem, so I just attached what works for me.. A $match = urldecode($match) in WP::parse_request()

Again this might be a completely wrong fix, and not backwards compatible, but it seems to work for me with all tested characters (scandinavian, kanji, hebrew and cyrillic). «hello» works too.

#32 @scribu
14 years ago

tosak, I tried your patch and it failed on a simple URL like this:

http://localhost/wp/index.php/category/category-b/

#33 @intlect
14 years ago

  • Cc tj@… added

@tosak
14 years ago

urldecode before match

#34 @tosak
14 years ago

Scribu, the patch worked for pages only. I'm sorry about that.

Urldecode correctly decoded the string, but replaced + with space characters, messing up the regex. I've updated the patch to use rawurldecode instead, and as far as I can see both pages, categories and posts are working correctly now (fingers crossed).

Please test it again!

#35 @scribu
14 years ago

I don't really understand what your patch fixes.

The problem here (or at least that which I'm trying to fix) is that you end up with ugly permalinks, due to poor slug sanitization:

http://example.com/slug-with-foreign-characterș-that-gețs-url-encoded

#36 @tosak
14 years ago

If you save a page with say Kanji letters:
http://localhost/wp/漢字

WP saves the regex as:
(%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$

Then when you call the page, WP decodes the request, but not the regex, so it tries to match 漢字 against (%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$ resulting in a 404

if you use rawurlencode, WP correctly matches 漢字 against (漢字)(/[0-9]+)?/?$

I'm sorry if I misunderstood what we're trying to do here, this just fixed the issue with foreign characters in URL's and 404's for me, so wanted to share it.

#37 @scribu
14 years ago

Ok, thanks for the clarification.

#38 @scribu
14 years ago

I tracked down an interesting changeset from 5 years ago: [2191] - it first introduced the urldecode( urlencode() ) stuff.

@scribu
14 years ago

refresh

#39 @scribu
14 years ago

Found out why [2191] was needed:

echo basename( 'misc/știri' ); // Result: 'tiri'

@scribu
14 years ago

Introduce _qv_basename()

#40 @scribu
14 years ago

9591.3.diff:

  • introduce sanitize_title_for_query() and use it where appropriate
  • introduce _qv_basename() helper method and use it for hierarchical taxonomies too.

#41 @scribu
14 years ago

From my testing, old permalinks still work, but new slugs can't contain odd characters, which is ideal.

#42 @scribu
14 years ago

(In [15923]) Introduce _qv_basename() and apply it to hierarchical taxonomies. See #9591.

@scribu
14 years ago

refresh

#43 @scribu
14 years ago

(In [15929]) Introduce sanitize_title_for_query(). See #9591

#44 @scribu
14 years ago

(In [15930]) remove_accents(): Nordic characters fixes. Props einare. Fixes #4739. See #9591

#46 @scribu
14 years ago

(In [15931]) Also convert uppercase letters in Latin Extended-B. See #9591

#47 @jose.geraldo
14 years ago

  • Severity changed from normal to minor
  • Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks to remove_accents() improvements for i18n permalinks
  • Type changed from defect (bug) to enhancement
  • Version changed from 2.8 to 3.0.1

Suggestion to improve remove_accents

insert at line 537 in wp-includes/formatting.php (inside remove_accents function) the following code

Special ordinals
chr(194).chr(170) => 'a',
chr(194).chr(176) => 'o', chr(194).chr(186) => 'o',

It deals with ordinals in portuguese/spanish post/page titles in order to create a correct slug url (404 page not found error instead).

#48 @hakre
14 years ago

Related: #10249 - Page slug in cyrillic = Error 404 - Not Found! (Duplicate?)

#49 @unsalkorkmaz
14 years ago

As i wrote in
http://core.trac.wordpress.org/ticket/10249#comment:35

An example.. "примерно" you can use these russian chars in permalinks but "şŞ İı Ğğ Üü Öö çÇ" becoming "ss-ii-gg-uu-oo-cc" and this is a problem actually. You can try this yourself too. those chars are UTF8.

If its working as intended, plz give me a custom way to use those chars.

#51 @markjaquith
14 years ago

  • Milestone changed from 3.1 to Future Release

Done with this for 3.1. Have some ideas about how to do migrations for 3.2.

#52 @scribu
14 years ago

  • Keywords early has-patch removed

What sort of migrations?

@jose.geraldo
14 years ago

correcting ordinal characters in portuguese and spanish langiuages slug urls

#53 @nacin
14 years ago

  • Keywords 3.2-early added

#54 @scribu
14 years ago

(In [16832]) Make get_term_by() use sanitize_title_for_query() too. See #9591

#55 @scribu
14 years ago

jose.geraldo: Could you point to the UTF block those characters belong to?

#56 @scribu
14 years ago

Unicode block I mean.

#57 @miqrogroove
14 years ago

If I'm reading that correctly, he's patching

U+00AA ª Feminine Ordinal Indicator
U+00B0 ° Degree symbol
U+00BA º Masculine ordinal indicator

#58 @scribu
14 years ago

I think the degree symbol should be stripped, rather than be converted to 'o'.

Also, the code is duplicated.

#59 @shaneiseminger
14 years ago

  • Cc shaneiseminger added

As of 3.03: when a title has smart quotes, the sanitization happens halfway: they are converted to dumb quotes but are not stripped out. The permalink then contains dumb quotes which of course breaks it.

Dumb quotes are stripped correctly.

A little experimention also indicates that remove_accents() does not convert ø -- chr(216) and chr(248) -- I know it's not technically an accented character but it seems like this would be the intent.

#60 @markjaquith
14 years ago

(In [17357]) Revert [16832]. see #9591. fixes #16282

#61 @hakre
14 years ago

Have a patch at Related: #16905 that makes remove_accents filterable.

#62 @hakre
14 years ago

Probably it's better to normalize the data in remove_accents() and make the transliteration table overload-able - Related: #16908

#63 follow-up: @Pierre_02
13 years ago

  • Cc Pierre_02 added
  • Keywords needs-patch added
  • Type changed from enhancement to defect (bug)
  • Version changed from 3.0.1 to 3.1

Hi all,

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.
i'm currently developing a website where i use my own custom post type for accomodation/restaurant.
So a user can select a type of accomodation/restaurant (eg : Hôtel, Restaurant, Camping, etc.) by clicking on a tag then it can refine its choice by click on a tag of the town name (eg : Villers-Cotterêts).
The problem is that, if the name of the town contains an emphasis when i click on "Next page" i got a 301 redirect and my emphasis caracters are removed during the redirection :(... So the name of town (Villers-Cotterêts) becomes "Villers-Cotterts" and no record is found...

I know about URL sanitization of WP and it's why i used a querystring. Hopping that WP let it untouched.
Of course, i've tried to hard code the name of the town in my 'meta_value' and then, everything works fine. Despite the redirection, WP show me the next pages for the selected criteria ; so this is not a programming problem.

In my code, i use a syntax with querystring to generate my url.
example : http://mywebsite.fr/commerce/type/restaurant/?ville=Villers-Cotter%C3%AAts so the next page link (generated by the 'next_posts_link' function) will be http://mywebsite.fr/commerce/type/restaurant/page/2/?ville=Villers-Cotter%C3%AAts but if i click on it the WP redirect me (by a 301 error) to http://mywebsite.fr/commerce/type/hotel/page/2/?ville=Villers-Cotterts (notice the removing of ê). WP removes space in names too...

So my question is : Why, when it's the first page, WP let untouched my querystring and as soon as i paginate i'm victim of 301 redirect ?

PS. Sorry this a not a problem from version 3.0.1 to 3.1 but a problem in the 3.1.2 release.

TIA.

Amicably,

Pierre.

Last edited 13 years ago by Pierre_02 (previous) (diff)

#64 @SergeyBiryukov
13 years ago

  • Keywords has-patch added; 3.2-early needs-patch removed
  • Milestone changed from Future Release to 3.3
  • Version changed from 3.1 to 2.8

I've reviewed all the tickets linked from here and checked them for remaining issues.

9591.2.patch fixes:

  • #5554: Problems with ÆØÅæøå (only Ø, actually)
  • ticket:9591:47: Ordinal indicators in Portuguese/Spanish

Only 3 characters are actually added, the rest is formatting.

Also uploaded a new patch in #10797.

#65 in reply to: ↑ 63 @SergeyBiryukov
13 years ago

Replying to Pierre_02:

i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.

This should be discussed in another ticket, as it has little to do with remove_accents(), which sanitizes "Villers-Cotterêts" correctly. You probably should use post slugs instead of titles.

#66 @bi0xid
13 years ago

Still without a solution for opening question and exclamation marks ¡ and ¿ (#12373)

#67 follow-up: @SergeyBiryukov
13 years ago

That's handled in #10797, see my comment there.

#68 in reply to: ↑ 67 @bi0xid
13 years ago

Replying to SergeyBiryukov:

That's handled in #10797, see my comment there.

Thank you :). Awesome job :)

#69 @SergeyBiryukov
13 years ago

  • Keywords needs-unit-tests added

#70 follow-up: @unsalkorkmaz
13 years ago

After i saw "Good idea! We're working on it" status at
http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love
and while SergeyBiryukov working on this ticket, want to ask if there is any update on my comment:
http://core.trac.wordpress.org/ticket/9591#comment:49
?

#71 in reply to: ↑ 70 ; follow-up: @SergeyBiryukov
13 years ago

Replying to unsalkorkmaz:

want to ask if there is any update on my comment

I've attached a plugin to allow accented characters in slugs in #15248.

#72 in reply to: ↑ 71 ; follow-up: @unsalkorkmaz
13 years ago

Replying to SergeyBiryukov:

I've attached a plugin to allow accented characters in slugs in #15248.

It basically uses raw_title and completely disabling all filtering right? is this safe? if its safe, why we filtering title then? :S

#73 in reply to: ↑ 72 ; follow-up: @SergeyBiryukov
13 years ago

Replying to unsalkorkmaz:

It basically uses raw_title and completely disabling all filtering right? is this safe?

It only cancels remove_accents(), while the other filtering, such as sanitize_title_with_dashes(), is still performed. So I guess if you want to use accented characters in slugs, that's the proper solution.

If you only want to remove accents from one characters and keep them on others, you can create your own version of remove_accents() and hook it to sanitize_title. A filter in remove_accents() itself would probably be a bit cleaner solution, but it's not necessary.

I've attached ticket:15248:allow-turkish-accents-in-slugs.php. It should do exactly what you want: allow "şŞ İı Ğğ Üü Öö çÇ" characters and remove accents from all others. It also allows these characters in usernames, as per your comment there.

#74 in reply to: ↑ 73 @unsalkorkmaz
13 years ago

Replying to SergeyBiryukov:

It also allows these characters in usernames, as per your comment there.

Thanks for trying to help for usernames too :) But i am getting "Only lowercase letters (a-z) and numbers are allowed" error from function wpmu_validate_user_signup @ wp-includes/ms-functions.php

@ampt
13 years ago

#75 @ampt
13 years ago

Add unit tests. The tests also replace some existing test cases under jacob/TestFormatting. Perhaps another set of eyes on this would be good to make sure its on the right track.

#76 @SergeyBiryukov
13 years ago

Closed #18903 as a duplicate of this ticket and #5554.

#77 follow-ups: @ryan
13 years ago

Committed 9591.tests.patch to unit-tests repo:

http://unit-tests.trac.wordpress.org/changeset/471

#78 in reply to: ↑ 77 @SergeyBiryukov
13 years ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

remove_accents_from_* files are now empty:
http://unit-tests.trac.wordpress.org/browser/wp-testdata/jacob

They can probably be removed.

#79 in reply to: ↑ 77 @SergeyBiryukov
13 years ago

Replying to ryan:

Committed 9591.tests.patch to unit-tests repo

wp-testdata/formatting/remove_accents.01.input.txt file was not committed:
http://core.trac.wordpress.org/attachment/ticket/9591/9591.tests.patch#L111

1) TestRemoveAccents::test_remove_accents_iso8859
file_get_contents(./wp-testdata\formatting\remove_accents.01.input.txt): failed to open stream: No such file or directory

#81 @SergeyBiryukov
13 years ago

Cool. Before patch: Tests: 6, Assertions: 7, Failures: 1.

With 9591.2.patch: OK (6 tests, 7 assertions)

#82 @nacin
13 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In [19125]:

Add a few characters to remove_accents(). props SergeyBiryukov. props ampt for [UT471]. fixes #9591.

#83 @SergeyBiryukov
13 years ago

  • Keywords needs-unit-tests removed

#84 @dnusim
12 years ago

#4739 was marked as a duplicate.

Note: See TracTickets for help on using tickets.