WordPress.org

Make WordPress Core

Opened 7 years ago

Closed 3 years ago

#3843 closed defect (bug) (duplicate)

Smart quote apostrophe ’ results in a permalink URL with %e2%80%99

Reported by: foolswisdom Owned by: ryan
Milestone: Priority: normal
Severity: minor Version: 2.2
Component: Permalinks Keywords: has-patch slug permalink dev-feedback
Focuses: Cc:

Description

Smart quote apostrophe ’ results in a permalink URL (slug) with %e2%80%99

ENV: WP trunk r4915

smart quote apostrophe ’
Mac shortcut: Using Shift - Option - ]

ADDITIONAL DETAILS
My guess is that a solution should identify allowed characters, translated to hyphen -, and strip all the others.

Attachments (2)

unicode-punctuation-removal.diff (1.6 KB) - added by noel 6 years ago.
Permalink filter for unicode punctuation.
formatting-7-2-4am.diff (1.0 KB) - added by noel 6 years ago.
Unicode fixes that do not produce a 404.

Download all attachments as: .zip

Change History (27)

comment:1 rob1n7 years ago

This is proper behavior. The curly quote isn't plaintext -- it's a symbol and has to be translated. The same is for other UTF-8 symbols such as Chinese characters (some other bug was about that) -- they are and should be turned into URL-safe entities.

comment:2 markjaquith7 years ago

Solution would have to deal with this case specifically. Note that the URL, while ugly, is functional. Also note that in 2.1, people should be able to edit their post slug and have the old one redirect to the current one.

comment:3 jhodgdon7 years ago

Just a little clarification. The function being used to create the slug is sanitize_title_with_dashes in wp-includes/formatting.php

The sequence of events is currently:

1) Post title becomes the slug candidate

2) Accents are removed (replaced by un-accented letters)

3) Characters that still look like they are UTF-8 are encoded with utf8_uri_encode into octets (%e2, etc.) (this is what is creating the reported behavior)

4) HTML entities and any character except letters, numbers, underscores, spaces, octets, and hyphens are removed (this is where other punctuation is removed)

5) Spaces are turned into hyphens, and whole thing is lower-cased

So... to fix this, would have to add step 2.5:

2.5: Translate into hyphens, or remove (more consistent with what happens to other punctuation), a specific list of special (but common) punctuation characters.

Questions:

a) Is this worth doing, considering that the current behavior makes a usable slug, and that you can always edit your slug by hand if you want to?

b) If it is worth doing, what should the list of special punctuation characters be, and should they be removed or translated into hyphens?

comment:4 foolswisdom7 years ago

  • Milestone changed from 2.2 to 2.4

comment:5 drmike7 years ago

  • Cc drmike added

Issue exists over in wp.com land as well:

http://en.forums.wordpress.com/topic.php?id=9645

comment:6 drmike7 years ago

Also why not just strip it out?

comment:7 rob1n7 years ago

Well, we strip *regular* quotes out, but not fancy quotes. I think this is really not going to be fixed easily -- we can strip out UTF-8 quotes, but what about other encodings?

comment:8 thee176 years ago

  • Milestone 2.5 deleted
  • Resolution set to wontfix
  • Status changed from new to closed

comment:9 pishmishy6 years ago

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Please can you leave a comment explaining why you've closed the ticket.

comment:10 lloydbudd6 years ago

  • Milestone set to 2.7

comment:11 noel6 years ago

  • Priority changed from low to normal

This patch should fix the problem - we were treating all unicode as equal - when we should have been defining the different categories and removing the unicode characters relevant to punctuation, etc.

This patch simply attaches onto the other sanitize_title functions and will probably need to be integrated more fully in the future. As for now, it works great for me on all the test cases I threw at it.

In the future, when all browsers support full unicode characters in the URL shouldn't we not be converting them at all? ;)

noel6 years ago

Permalink filter for unicode punctuation.

comment:12 noel6 years ago

  • Milestone changed from 2.7 to 2.6
  • Owner changed from anonymous to ryan
  • Status changed from reopened to new

comment:13 noel6 years ago

  • Keywords dev-feedback added

comment:14 noel6 years ago

  • Keywords has-patch added

comment:15 ryan6 years ago

Any changes to the sanitizer will lead to 404s for slugs made with the old sanitizer.

comment:16 noel6 years ago

  • Owner changed from ryan to noel

I'll get that sorted out and resubmit a patch.

noel6 years ago

Unicode fixes that do not produce a 404.

comment:17 noel6 years ago

  • Owner changed from noel to ryan

comment:18 ryan6 years ago

This will cause old permalinks that have %e2%80%99 in them to 404.

comment:19 noel6 years ago

  • Milestone changed from 2.6 to 2.7

comment:20 westi6 years ago

  • Milestone changed from 2.7 to 2.9

As with #3206 i am pushing this out we need a single solution to the whole mess ;-)

comment:21 ryan5 years ago

  • Component changed from Administration to Permalinks

comment:22 Denis-de-Bernardy5 years ago

  • Milestone 2.9 deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Merging into #9591.

comment:23 scribu3 years ago

  • Milestone set to 3.1
  • Resolution duplicate deleted
  • Status changed from closed to reopened

comment:24 scribu3 years ago

It's better to have remaining issues highlighted through open tickets.

comment:25 nacin3 years ago

  • Milestone 3.1 deleted
  • Resolution set to duplicate
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.