Make WordPress Core

Opened 17 years ago

Closed 13 years ago

#3843 closed defect (bug) (duplicate)

Smart quote apostrophe ’ results in a permalink URL with %e2%80%99

Reported by: foolswisdom's profile foolswisdom Owned by: ryan's profile ryan
Milestone: Priority: normal
Severity: minor Version: 2.2
Component: Permalinks Keywords: has-patch slug permalink dev-feedback
Focuses: Cc:

Description

Smart quote apostrophe ’ results in a permalink URL (slug) with %e2%80%99

ENV: WP trunk r4915

smart quote apostrophe ’
Mac shortcut: Using Shift - Option - ]

ADDITIONAL DETAILS
My guess is that a solution should identify allowed characters, translated to hyphen -, and strip all the others.

Attachments (2)

unicode-punctuation-removal.diff (1.6 KB) - added by noel 16 years ago.
Permalink filter for unicode punctuation.
formatting-7-2-4am.diff (1.0 KB) - added by noel 16 years ago.
Unicode fixes that do not produce a 404.

Download all attachments as: .zip

Change History (27)

#1 @rob1n
17 years ago

This is proper behavior. The curly quote isn't plaintext -- it's a symbol and has to be translated. The same is for other UTF-8 symbols such as Chinese characters (some other bug was about that) -- they are and should be turned into URL-safe entities.

#2 @markjaquith
17 years ago

Solution would have to deal with this case specifically. Note that the URL, while ugly, is functional. Also note that in 2.1, people should be able to edit their post slug and have the old one redirect to the current one.

#3 @jhodgdon
17 years ago

Just a little clarification. The function being used to create the slug is sanitize_title_with_dashes in wp-includes/formatting.php

The sequence of events is currently:

1) Post title becomes the slug candidate

2) Accents are removed (replaced by un-accented letters)

3) Characters that still look like they are UTF-8 are encoded with utf8_uri_encode into octets (%e2, etc.) (this is what is creating the reported behavior)

4) HTML entities and any character except letters, numbers, underscores, spaces, octets, and hyphens are removed (this is where other punctuation is removed)

5) Spaces are turned into hyphens, and whole thing is lower-cased

So... to fix this, would have to add step 2.5:

2.5: Translate into hyphens, or remove (more consistent with what happens to other punctuation), a specific list of special (but common) punctuation characters.

Questions:

a) Is this worth doing, considering that the current behavior makes a usable slug, and that you can always edit your slug by hand if you want to?

b) If it is worth doing, what should the list of special punctuation characters be, and should they be removed or translated into hyphens?

#4 @foolswisdom
17 years ago

  • Milestone changed from 2.2 to 2.4

#5 @drmike
17 years ago

  • Cc drmike added

Issue exists over in wp.com land as well:

http://en.forums.wordpress.com/topic.php?id=9645

#6 @drmike
17 years ago

Also why not just strip it out?

#7 @rob1n
17 years ago

Well, we strip *regular* quotes out, but not fancy quotes. I think this is really not going to be fixed easily -- we can strip out UTF-8 quotes, but what about other encodings?

#8 @thee17
16 years ago

  • Milestone 2.5 deleted
  • Resolution set to wontfix
  • Status changed from new to closed

#9 @pishmishy
16 years ago

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Please can you leave a comment explaining why you've closed the ticket.

#10 @lloydbudd
16 years ago

  • Milestone set to 2.7

#11 @noel
16 years ago

  • Priority changed from low to normal

This patch should fix the problem - we were treating all unicode as equal - when we should have been defining the different categories and removing the unicode characters relevant to punctuation, etc.

This patch simply attaches onto the other sanitize_title functions and will probably need to be integrated more fully in the future. As for now, it works great for me on all the test cases I threw at it.

In the future, when all browsers support full unicode characters in the URL shouldn't we not be converting them at all? ;)

@noel
16 years ago

Permalink filter for unicode punctuation.

#12 @noel
16 years ago

  • Milestone changed from 2.7 to 2.6
  • Owner changed from anonymous to ryan
  • Status changed from reopened to new

#13 @noel
16 years ago

  • Keywords dev-feedback added

#14 @noel
16 years ago

  • Keywords has-patch added

#15 @ryan
16 years ago

Any changes to the sanitizer will lead to 404s for slugs made with the old sanitizer.

#16 @noel
16 years ago

  • Owner changed from ryan to noel

I'll get that sorted out and resubmit a patch.

@noel
16 years ago

Unicode fixes that do not produce a 404.

#17 @noel
16 years ago

  • Owner changed from noel to ryan

#18 @ryan
16 years ago

This will cause old permalinks that have %e2%80%99 in them to 404.

#19 @noel
16 years ago

  • Milestone changed from 2.6 to 2.7

#20 @westi
16 years ago

  • Milestone changed from 2.7 to 2.9

As with #3206 i am pushing this out we need a single solution to the whole mess ;-)

#21 @ryan
15 years ago

  • Component changed from Administration to Permalinks

#22 @Denis-de-Bernardy
15 years ago

  • Milestone 2.9 deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Merging into #9591.

#23 @scribu
13 years ago

  • Milestone set to 3.1
  • Resolution duplicate deleted
  • Status changed from closed to reopened

#24 @scribu
13 years ago

It's better to have remaining issues highlighted through open tickets.

#25 @nacin
13 years ago

  • Milestone 3.1 deleted
  • Resolution set to duplicate
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.