Make WordPress Core

Opened 11 years ago

Last modified 5 years ago

#22402 new enhancement

Stripping non-alphanumeric multi-byte characters from slugs

Reported by: johnbillion's profile johnbillion Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Formatting Keywords: needs-patch dev-feedback
Focuses: Cc:

Description

sanitize_title_with_dashes() strips non-alphanumeric characters from a title to create a slug. Unfortunately it only strips ASCII non-alphanumeric characters. Apart from a few exceptions, all multi-byte characters are preserved. This means all non-Western (and plenty of Western) non-alphanumeric characters end up in the slug as they're treated just like any other multi-byte character.

As an example, here are some common non-alphanumeric Chinese characters which would ideally be stripped from slugs, but are not:

  • 。 (U+3002, Ideographic Full Stop, %E3%80%82)
  • , (U+FF0C, Fullwidth Comma, %EF%BC%8C)
  • ! (U+FF01, Fullwidth Exclamation Mark, %EF%BC%81)
  • : (U+FF1A, Fullwidth Colon, %EF%BC%9A)
  • 《 (U+300A, Left Double Angle Bracket, %E3%80%8A)
  • 》 (U+300B, Right Double Angle Bracket, %E3%80%8B)

Obviously it would be impractical to make a list of all the non-ASCII characters we want to strip from slugs. The list would be gigantic.

So the question is, would it be possible to use Unicode ranges to blacklist (or whitelist) whole ranges of characters to be stripped from (or preserved in) slugs? Is this practical or even desirable?

Or would it make more sense to continue using a list of just the most common multi-byte characters to be stripped?

The latter makes a whole lot more sense, but the former is a more complete solution.

Thoughts?

Change History (4)

#1 @toscho
11 years ago

  • Cc info@… added

#2 @knutsp
11 years ago

  • Cc knut@… added

#3 @nacin
10 years ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

I'm comfortable with a ranged whitelist or blacklist if it can be done properly. In the meantime, we should still try to blacklist individual characters as we identify them.

#4 @chriscct7
8 years ago

  • Keywords dev-feedback added
Note: See TracTickets for help on using tickets.