Opened 6 years ago

Closed 2 years ago

#3843 closed defect (bug) (duplicate)

Smart quote apostrophe ’ results in a permalink URL with %e2%80%99

Reported by: foolswisdom Owned by: ryan
Priority: normal Milestone:
Component: Permalinks Version: 2.2
Severity: minor Keywords: has-patch slug permalink dev-feedback
Cc: drmike

Description

Smart quote apostrophe ’ results in a permalink URL (slug) with %e2%80%99

ENV: WP trunk r4915

smart quote apostrophe ’
Mac shortcut: Using Shift - Option - ]

ADDITIONAL DETAILS
My guess is that a solution should identify allowed characters, translated to hyphen -, and strip all the others.

Attachments (2)

unicode-punctuation-removal.diff (1.6 KB) - added by noel 5 years ago.
Permalink filter for unicode punctuation.
formatting-7-2-4am.diff (1.0 KB) - added by noel 5 years ago.
Unicode fixes that do not produce a 404.

Download all attachments as: .zip

Change History (27)

This is proper behavior. The curly quote isn't plaintext -- it's a symbol and has to be translated. The same is for other UTF-8 symbols such as Chinese characters (some other bug was about that) -- they are and should be turned into URL-safe entities.

Solution would have to deal with this case specifically. Note that the URL, while ugly, is functional. Also note that in 2.1, people should be able to edit their post slug and have the old one redirect to the current one.

Just a little clarification. The function being used to create the slug is sanitize_title_with_dashes in wp-includes/formatting.php

The sequence of events is currently:

1) Post title becomes the slug candidate

2) Accents are removed (replaced by un-accented letters)

3) Characters that still look like they are UTF-8 are encoded with utf8_uri_encode into octets (%e2, etc.) (this is what is creating the reported behavior)

4) HTML entities and any character except letters, numbers, underscores, spaces, octets, and hyphens are removed (this is where other punctuation is removed)

5) Spaces are turned into hyphens, and whole thing is lower-cased

So... to fix this, would have to add step 2.5:

2.5: Translate into hyphens, or remove (more consistent with what happens to other punctuation), a specific list of special (but common) punctuation characters.

Questions:

a) Is this worth doing, considering that the current behavior makes a usable slug, and that you can always edit your slug by hand if you want to?

b) If it is worth doing, what should the list of special punctuation characters be, and should they be removed or translated into hyphens?

  • Milestone changed from 2.2 to 2.4
  • Cc drmike added

Issue exists over in wp.com land as well:

http://en.forums.wordpress.com/topic.php?id=9645

Also why not just strip it out?

Well, we strip *regular* quotes out, but not fancy quotes. I think this is really not going to be fixed easily -- we can strip out UTF-8 quotes, but what about other encodings?

  • Milestone 2.5 deleted
  • Resolution set to wontfix
  • Status changed from new to closed
  • Resolution wontfix deleted
  • Status changed from closed to reopened

Please can you leave a comment explaining why you've closed the ticket.

  • Milestone set to 2.7
  • Priority changed from low to normal

This patch should fix the problem - we were treating all unicode as equal - when we should have been defining the different categories and removing the unicode characters relevant to punctuation, etc.

This patch simply attaches onto the other sanitize_title functions and will probably need to be integrated more fully in the future. As for now, it works great for me on all the test cases I threw at it.

In the future, when all browsers support full unicode characters in the URL shouldn't we not be converting them at all? ;)

noel5 years ago

Permalink filter for unicode punctuation.

  • Milestone changed from 2.7 to 2.6
  • Owner changed from anonymous to ryan
  • Status changed from reopened to new
  • Keywords dev-feedback added
  • Keywords has-patch added

Any changes to the sanitizer will lead to 404s for slugs made with the old sanitizer.

  • Owner changed from ryan to noel

I'll get that sorted out and resubmit a patch.

noel5 years ago

Unicode fixes that do not produce a 404.

  • Owner changed from noel to ryan

This will cause old permalinks that have %e2%80%99 in them to 404.

  • Milestone changed from 2.6 to 2.7
  • Milestone changed from 2.7 to 2.9

As with #3206 i am pushing this out we need a single solution to the whole mess ;-)

  • Component changed from Administration to Permalinks
  • Milestone 2.9 deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Merging into #9591.

  • Milestone set to 3.1
  • Resolution duplicate deleted
  • Status changed from closed to reopened

It's better to have remaining issues highlighted through open tickets.

  • Milestone 3.1 deleted
  • Resolution set to duplicate
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.