#9591 closed defect (bug) (fixed)
remove_accents() improvements for i18n permalinks
Reported by: | Denis-de-Bernardy | Owned by: | ryan |
---|---|---|---|
Milestone: | 3.3 | Priority: | high |
Severity: | minor | Version: | 2.8 |
Component: | Permalinks | Keywords: | has-patch |
Focuses: | Cc: |
Description
Opening a master-ticket, so as to trim the 1.3k bug list:
- #8765: Strip ° characters from permalink (closed as dup of this)
- #3843: Smart quote apostrophe ’ results in a permalink URL with %e2%80%99 (closed as dup of this, has patch)
- #6973: Problems with š, ž, č, ć, đ (closed as dup of this)
- #4739: Problems with Ð ð, Þ þ and Æ æ (closed as dup of this, has patch)
- #5554: Problems with ÆØÅæøå (closed as dup of this, has patch)
- #3206: ¡ and ¿ should get stripped (closed as dup of this)
Related:
Attachments (9)
Change History (93)
#1
@
15 years ago
- Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n to sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks
#3
@
15 years ago
Also related: #5918 -- cannot register with a username containing arabic characters
#4
@
15 years ago
Also related: #8934 -- strip fancy single and double quotes, triple periods from permalinks
#9
@
15 years ago
see also: #9866 -- problems with cyrillic.
- Enable permalink - /%category%/%postname%
- Create post with slug "тест" - then you can access it trough the link from the side bar and the URL contains "тест" at the end
- Create Page with slug "страница" - the link "Страница" appears in the sidebar, but when you click it - 404 page not found, and the url
#10
@
15 years ago
- Resolution set to fixed
- Status changed from new to closed
All listed tickets are closed.
#11
@
15 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
Indeed... I closed them all as dups of this one, so we can reference all of the issues from a single ticket.
#15
@
15 years ago
- Keywords early removed
- Milestone changed from 3.0 to Future Release
to fix this, we need to ensure proper migration of permalinks.
#19
follow-ups:
↓ 20
↓ 23
@
15 years ago
- Milestone changed from Future Release to 3.0
- Priority changed from normal to high
Please also strip em-dash.
I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.
#20
in reply to:
↑ 19
@
15 years ago
Replying to miqrogroove:
I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.
The reason why that's difficult:
ryan, on #9534:
Incoming slugs (from permalink requests) are run through the sanitizer and they won't match the old slug.
#21
@
15 years ago
- Keywords has-patch added; needs-patch removed
Currently, the two slugs bellow match the same post:
le-slug
lé-șlüg
Depending on the language, this might or might not make sense.
To solve the accents issue, we need to have two contexts for sanitize_title: one for storing a slug and one for retrieving a post, based on a slug.
See 9591.diff
#22
@
15 years ago
Note that if some users want the old behaviour (lé-șlüg matches le-slug), they can have it back with a single line:
add_filter('sanitize_title', 'remove_accents');
#23
in reply to:
↑ 19
@
15 years ago
Replying to miqrogroove:
Please also strip em-dash.
I don't understand Denis's reason for punting this ticket. Existing permalinks should not be affected by a proper patch; this should only filter slugs generated for new items.
My understanding of the problem is that the same sanitization flow comes in when looking up posts in the database. In other words, if abc previously resulted in abd, and abd now results abe, the abc post would no longer be reachable. Or something like that.
For the gory details, see the comments from Ryan in the multitudes of tickets I've closed as dup of this one.
#24
@
15 years ago
- Keywords early added
- Milestone changed from 3.0 to 3.1
Early 3.1 per the dev chat today.
#29
@
14 years ago
- Milestone changed from Awaiting Triage to 3.1
Back to early 3.1 per wpdevel and previous IRC chat.
#30
@
14 years ago
Guillemets also need attention. Try typing «Hello» for a title and the slug/permalink will bet set to «hello».
#31
@
14 years ago
I got tired of trying to understand the root of this problem, so I just attached what works for me.. A $match = urldecode($match) in WP::parse_request()
Again this might be a completely wrong fix, and not backwards compatible, but it seems to work for me with all tested characters (scandinavian, kanji, hebrew and cyrillic). «hello» works too.
#34
@
14 years ago
Scribu, the patch worked for pages only. I'm sorry about that.
Urldecode correctly decoded the string, but replaced + with space characters, messing up the regex. I've updated the patch to use rawurldecode instead, and as far as I can see both pages, categories and posts are working correctly now (fingers crossed).
Please test it again!
#35
@
14 years ago
I don't really understand what your patch fixes.
The problem here (or at least that which I'm trying to fix) is that you end up with ugly permalinks, due to poor slug sanitization:
http://example.com/slug-with-foreign-characterș-that-gețs-url-encoded
#36
@
14 years ago
If you save a page with say Kanji letters:
http://localhost/wp/漢字
WP saves the regex as:
(%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$
Then when you call the page, WP decodes the request, but not the regex, so it tries to match 漢字 against (%e6%bc%a2%e5%ad%97)(/[0-9]+)?/?$ resulting in a 404
if you use rawurlencode, WP correctly matches 漢字 against (漢字)(/[0-9]+)?/?$
I'm sorry if I misunderstood what we're trying to do here, this just fixed the issue with foreign characters in URL's and 404's for me, so wanted to share it.
#38
@
14 years ago
I tracked down an interesting changeset from 5 years ago: [2191] - it first introduced the urldecode( urlencode() ) stuff.
#39
@
14 years ago
Found out why [2191] was needed:
echo basename( 'misc/știri' ); // Result: 'tiri'
#40
@
14 years ago
9591.3.diff:
- introduce sanitize_title_for_query() and use it where appropriate
- introduce _qv_basename() helper method and use it for hierarchical taxonomies too.
#41
@
14 years ago
From my testing, old permalinks still work, but new slugs can't contain odd characters, which is ideal.
#45
@
14 years ago
About 9591.patch: [4979]
#47
@
14 years ago
- Severity changed from normal to minor
- Summary changed from sanitize_title_with_dashes() and remove_accents() improvements for i18n permalinks to remove_accents() improvements for i18n permalinks
- Type changed from defect (bug) to enhancement
- Version changed from 2.8 to 3.0.1
Suggestion to improve remove_accents
insert at line 537 in wp-includes/formatting.php (inside remove_accents function) the following code
Special ordinals
chr(194).chr(170) => 'a',
chr(194).chr(176) => 'o', chr(194).chr(186) => 'o',
It deals with ordinals in portuguese/spanish post/page titles in order to create a correct slug url (404 page not found error instead).
#49
@
14 years ago
As i wrote in
http://core.trac.wordpress.org/ticket/10249#comment:35
An example.. "примерно" you can use these russian chars in permalinks but "şŞ İı Ğğ Üü Öö çÇ" becoming "ss-ii-gg-uu-oo-cc" and this is a problem actually. You can try this yourself too. those chars are UTF8.
If its working as intended, plz give me a custom way to use those chars.
#50
@
14 years ago
Opened a suggestion ticket about it: http://core.trac.wordpress.org/ticket/15248
#51
@
14 years ago
- Milestone changed from 3.1 to Future Release
Done with this for 3.1. Have some ideas about how to do migrations for 3.2.
#57
@
14 years ago
If I'm reading that correctly, he's patching
U+00AA ª Feminine Ordinal Indicator
U+00B0 ° Degree symbol
U+00BA º Masculine ordinal indicator
#58
@
14 years ago
I think the degree symbol should be stripped, rather than be converted to 'o'.
Also, the code is duplicated.
#59
@
14 years ago
- Cc shaneiseminger added
As of 3.03: when a title has smart quotes, the sanitization happens halfway: they are converted to dumb quotes but are not stripped out. The permalink then contains dumb quotes which of course breaks it.
Dumb quotes are stripped correctly.
A little experimention also indicates that remove_accents() does not convert ø -- chr(216) and chr(248) -- I know it's not technically an accented character but it seems like this would be the intent.
#62
@
14 years ago
Probably it's better to normalize the data in remove_accents() and make the transliteration table overload-able - Related: #16908
#63
follow-up:
↓ 65
@
13 years ago
- Cc Pierre_02 added
- Keywords needs-patch added
- Type changed from enhancement to defect (bug)
- Version changed from 3.0.1 to 3.1
Hi all,
i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.
i'm currently developing a website where i use my own custom post type for accomodation/restaurant.
So a user can select a type of accomodation/restaurant (eg : Hôtel, Restaurant, Camping, etc.) by clicking on a tag then it can refine its choice by click on a tag of the town name (eg : Villers-Cotterêts).
The problem is that, if the name of the town contains an emphasis when i click on "Next page" i got a 301 redirect and my emphasis caracters are removed during the redirection :(... So the name of town (Villers-Cotterêts) becomes "Villers-Cotterts" and no record is found...
I know about URL sanitization of WP and it's why i used a querystring. Hopping that WP let it untouched.
Of course, i've tried to hard code the name of the town in my 'meta_value' and then, everything works fine. Despite the redirection, WP show me the next pages for the selected criteria ; so this is not a programming problem.
In my code, i use a syntax with querystring to generate my url.
example : http://mywebsite.fr/commerce/type/restaurant/?ville=Villers-Cotter%C3%AAts so the next page link (generated by the 'next_posts_link' function) will be http://mywebsite.fr/commerce/type/restaurant/page/2/?ville=Villers-Cotter%C3%AAts but if i click on it the WP redirect me (by a 301 error) to http://mywebsite.fr/commerce/type/hotel/page/2/?ville=Villers-Cotterts (notice the removing of ê). WP removes space in names too...
So my question is : Why, when it's the first page, WP let untouched my querystring and as soon as i paginate i'm victim of 301 redirect ?
PS. Sorry this a not a problem from version 3.0.1 to 3.1 but a problem in the 3.1.2 release.
TIA.
Amicably,
Pierre.
#64
@
13 years ago
- Keywords has-patch added; 3.2-early needs-patch removed
- Milestone changed from Future Release to 3.3
- Version changed from 3.1 to 2.8
I've reviewed all the tickets linked from here and checked them for remaining issues.
9591.2.patch fixes:
- #5554: Problems with ÆØÅæøå (only Ø, actually)
- ticket:9591:47: Ordinal indicators in Portuguese/Spanish
Only 3 characters are actually added, the rest is formatting.
Also uploaded a new patch in #10797.
#65
in reply to:
↑ 63
@
13 years ago
Replying to Pierre_02:
i notice a bug with custom post type with emphasis (i use them, i'm french) and pagination.
This should be discussed in another ticket, as it has little to do with remove_accents()
, which sanitizes "Villers-Cotterêts" correctly. You probably should use post slugs instead of titles.
#66
@
13 years ago
Still without a solution for opening question and exclamation marks ¡
and ¿
(#12373)
#68
in reply to:
↑ 67
@
13 years ago
Replying to SergeyBiryukov:
That's handled in #10797, see my comment there.
Thank you :). Awesome job :)
#70
follow-up:
↓ 71
@
13 years ago
After i saw "Good idea! We're working on it" status at
http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love
and while SergeyBiryukov working on this ticket, want to ask if there is any update on my comment:
http://core.trac.wordpress.org/ticket/9591#comment:49
?
#71
in reply to:
↑ 70
;
follow-up:
↓ 72
@
13 years ago
Replying to unsalkorkmaz:
want to ask if there is any update on my comment
I've attached a plugin to allow accented characters in slugs in #15248.
#72
in reply to:
↑ 71
;
follow-up:
↓ 73
@
13 years ago
Replying to SergeyBiryukov:
I've attached a plugin to allow accented characters in slugs in #15248.
It basically uses raw_title and completely disabling all filtering right? is this safe? if its safe, why we filtering title then? :S
#73
in reply to:
↑ 72
;
follow-up:
↓ 74
@
13 years ago
Replying to unsalkorkmaz:
It basically uses raw_title and completely disabling all filtering right? is this safe?
It only cancels remove_accents()
, while the other filtering, such as sanitize_title_with_dashes()
, is still performed. So I guess if you want to use accented characters in slugs, that's the proper solution.
If you only want to remove accents from one characters and keep them on others, you can create your own version of remove_accents()
and hook it to sanitize_title
. A filter in remove_accents()
itself would probably be a bit cleaner solution, but it's not necessary.
I've attached ticket:15248:allow-turkish-accents-in-slugs.php. It should do exactly what you want: allow "şŞ İı Ğğ Üü Öö çÇ" characters and remove accents from all others. It also allows these characters in usernames, as per your comment there.
#74
in reply to:
↑ 73
@
13 years ago
Replying to SergeyBiryukov:
It also allows these characters in usernames, as per your comment there.
Thanks for trying to help for usernames too :) But i am getting "Only lowercase letters (a-z) and numbers are allowed" error from function wpmu_validate_user_signup @ wp-includes/ms-functions.php
#75
@
13 years ago
Add unit tests. The tests also replace some existing test cases under jacob/TestFormatting. Perhaps another set of eyes on this would be good to make sure its on the right track.
#78
in reply to:
↑ 77
@
13 years ago
Replying to ryan:
Committed 9591.tests.patch to unit-tests repo
remove_accents_from_*
files are now empty:
http://unit-tests.trac.wordpress.org/browser/wp-testdata/jacob
They can probably be removed.
#79
in reply to:
↑ 77
@
13 years ago
Replying to ryan:
Committed 9591.tests.patch to unit-tests repo
wp-testdata/formatting/remove_accents.01.input.txt
file was not committed:
http://core.trac.wordpress.org/attachment/ticket/9591/9591.tests.patch#L111
1) TestRemoveAccents::test_remove_accents_iso8859 file_get_contents(./wp-testdata\formatting\remove_accents.01.input.txt): failed to open stream: No such file or directory
#81
@
13 years ago
Cool. Before patch: Tests: 6, Assertions: 7, Failures: 1.
With 9591.2.patch: OK (6 tests, 7 assertions)
See also: #5663