WordPress.org

Make WordPress Core

Opened 18 months ago

Last modified 3 months ago

#22402 new enhancement

Stripping non-alphanumeric multi-byte characters from slugs

Reported by: johnbillion Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Formatting Keywords: needs-patch
Focuses: Cc:

Description

sanitize_title_with_dashes() strips non-alphanumeric characters from a title to create a slug. Unfortunately it only strips ASCII non-alphanumeric characters. Apart from a few exceptions, all multi-byte characters are preserved. This means all non-Western (and plenty of Western) non-alphanumeric characters end up in the slug as they're treated just like any other multi-byte character.

As an example, here are some common non-alphanumeric Chinese characters which would ideally be stripped from slugs, but are not:

  • 。 (U+3002, Ideographic Full Stop, %E3%80%82)
  • , (U+FF0C, Fullwidth Comma, %EF%BC%8C)
  • ! (U+FF01, Fullwidth Exclamation Mark, %EF%BC%81)
  • : (U+FF1A, Fullwidth Colon, %EF%BC%9A)
  • 《 (U+300A, Left Double Angle Bracket, %E3%80%8A)
  • 》 (U+300B, Right Double Angle Bracket, %E3%80%8B)

Obviously it would be impractical to make a list of all the non-ASCII characters we want to strip from slugs. The list would be gigantic.

So the question is, would it be possible to use Unicode ranges to blacklist (or whitelist) whole ranges of characters to be stripped from (or preserved in) slugs? Is this practical or even desirable?

Or would it make more sense to continue using a list of just the most common multi-byte characters to be stripped?

The latter makes a whole lot more sense, but the former is a more complete solution.

Thoughts?

Change History (3)

comment:1 toscho18 months ago

  • Cc info@… added

comment:2 knutsp17 months ago

  • Cc knut@… added

comment:3 nacin3 months ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

I'm comfortable with a ranged whitelist or blacklist if it can be done properly. In the meantime, we should still try to blacklist individual characters as we identify them.

Note: See TracTickets for help on using tickets.