Stripping non-alphanumeric multi-byte characters from slugs
|Reported by:||johnbillion||Owned by:|
sanitize_title_with_dashes() strips non-alphanumeric characters from a title to create a slug. Unfortunately it only strips ASCII non-alphanumeric characters. Apart from a few exceptions, all multi-byte characters are preserved. This means all non-Western (and plenty of Western) non-alphanumeric characters end up in the slug as they're treated just like any other multi-byte character.
As an example, here are some common non-alphanumeric Chinese characters which would ideally be stripped from slugs, but are not:
- 。 (U+3002, Ideographic Full Stop, %E3%80%82)
- ， (U+FF0C, Fullwidth Comma, %EF%BC%8C)
- ！ (U+FF01, Fullwidth Exclamation Mark, %EF%BC%81)
- ： (U+FF1A, Fullwidth Colon, %EF%BC%9A)
- 《 (U+300A, Left Double Angle Bracket, %E3%80%8A)
- 》 (U+300B, Right Double Angle Bracket, %E3%80%8B)
Obviously it would be impractical to make a list of all the non-ASCII characters we want to strip from slugs. The list would be gigantic.
So the question is, would it be possible to use Unicode ranges to blacklist (or whitelist) whole ranges of characters to be stripped from (or preserved in) slugs? Is this practical or even desirable?
Or would it make more sense to continue using a list of just the most common multi-byte characters to be stripped?
The latter makes a whole lot more sense, but the former is a more complete solution.