Percent Encoding in URL should be capital alphabets instead of small letters
|Reported by:||akky||Owned by:||ryan|
When you use [Options]-[Permalinks]-[Common Options]-[Date and name based], or %postname% in custom URL, entry title will be normalized for fitting to URL permitted letters.
If title has non-ASCII letters, those letters cannot be directly put in URL so they are percent-encoded. This is processed in sanitize_title_with_dashes_original() and utf8_uri_encode() .
The problem is, these two functions normalizes too much for replacing to small letters.
For example, in unit test data, "Zhang Ziyi(in Chinese)" is currently converted to "%e7%ab%a0%e5%ad%90%e6%80%a1", however, this should be "%E7%AB%A0%E5%AD%90%E6%80%A1".
Currently, WordPress generates title-in-url by url-encode, then apply strtolower(), so everything goes to be small letters.
The reason capital letters in percent encoding required is described on RFC3986 section 2.1,
The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f', respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. '''For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.'''
As you see RFC2119 section 3, "should" means you may ignore it with valid reason. So if this was designed by WordPress's policy, I think I cannot force the fixation.
Now, let me explain why I, and other Japanese WordPress users needs this fix. In Japanese blogosphere, there is a Japanese del.icio.us equivalent, Social Bookmark Service 'Hatena Bookmark'.
This service accept users' bookmarks via several different ways like bookmarklet, widget, on site, user making tools. Some of them uses WordPress URL as is, whilst some others normalize to capital letters.
As a result of that, number of bookmarks for blog entries written on WordPress tend to be sparse. Interfered becoming popular items. When you compare against other blog system users blogs. (Search engine recognizes encoded URL titles so using this option is needed for Japanese users.)
If WordPress always generates capital encoded URL (not for roman alphabet parts, they are better staying small letters), this problem will be solved so that we can encourage more potential users migrates to WordPress.
I do not know if there are similar issues in other non-ASCII languages. And fixing this may make backword-incompatibility with old generated entries. However, this is a really big trouble for us Japanese users. This is rather WordPress marketing issue than RFC validness.
Change History (19)
- Component changed from General to Permalinks
- Owner changed from anonymous to ryan
- Keywords needs-patch added; UTF-8 URL normalization removed
- Milestone changed from 2.9 to 2.8