Opened 16 years ago
Closed 14 years ago
#6697 closed defect (bug) (maybelater)
Percent Encoding in URL should be capital alphabets instead of small letters
Reported by: | akky | Owned by: | ryan |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | major | Version: | 2.5 |
Component: | Permalinks | Keywords: | needs-patch |
Focuses: | Cc: |
Description
wp-includes/formatting.php utf8_uri_encode()
When you use [Options]-[Permalinks]-[Common Options]-[Date and name based], or %postname% in custom URL, entry title will be normalized for fitting to URL permitted letters.
If title has non-ASCII letters, those letters cannot be directly put in URL so they are percent-encoded. This is processed in sanitize_title_with_dashes_original() and utf8_uri_encode() .
The problem is, these two functions normalizes too much for replacing to small letters.
For example, in unit test data, "Zhang Ziyi(in Chinese)" is currently converted to "%e7%ab%a0%e5%ad%90%e6%80%a1", however, this should be "%E7%AB%A0%E5%AD%90%E6%80%A1".
Currently, WordPress generates title-in-url by url-encode, then apply strtolower(), so everything goes to be small letters.
The reason capital letters in percent encoding required is described on RFC3986 section 2.1,
The uppercase hexadecimal digits 'A' through 'F' are equivalent to the lowercase digits 'a' through 'f', respectively. If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent. '''For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.'''
As you see RFC2119 section 3, "should" means you may ignore it with valid reason. So if this was designed by WordPress's policy, I think I cannot force the fixation.
Now, let me explain why I, and other Japanese WordPress users needs this fix. In Japanese blogosphere, there is a Japanese del.icio.us equivalent, Social Bookmark Service 'Hatena Bookmark'.
This service accept users' bookmarks via several different ways like bookmarklet, widget, on site, user making tools. Some of them uses WordPress URL as is, whilst some others normalize to capital letters.
As a result of that, number of bookmarks for blog entries written on WordPress tend to be sparse. Interfered becoming popular items. When you compare against other blog system users blogs. (Search engine recognizes encoded URL titles so using this option is needed for Japanese users.)
If WordPress always generates capital encoded URL (not for roman alphabet parts, they are better staying small letters), this problem will be solved so that we can encourage more potential users migrates to WordPress.
I do not know if there are similar issues in other non-ASCII languages. And fixing this may make backword-incompatibility with old generated entries. However, this is a really big trouble for us Japanese users. This is rather WordPress marketing issue than RFC validness.
Attachments (3)
Change History (19)
#2
@
16 years ago
I have been using my quick patch for months on my Japanese WordPress blog, and for me it is working. The patch is, wp-includes/formatting.php at the end of the function sanitize_title_with_dashes(),
$title = trim($title, '-'); + $title = preg_replace( + '/%([a-fA-F0-9]{2})/e', + "'%'.strtoupper('\\1')", + $title + ); return $title;
It is quite a symptomatic treatment though.
#4
@
15 years ago
- Keywords needs-patch added; UTF-8 URL normalization removed
- Milestone changed from 2.9 to 2.8
suggested patch is not good. /e is being eliminated in WP in a separate ticket, in favor of preg_replace_callback. also, the regex could just as well be this, to reduce backtraces:
"/%[a-f0-9]{2}/"
#7
@
15 years ago
suggesting one slight tweak on the patch:
static $fn = create_function(bla blah);
preg_replace_callback(..., $fn, ...)
else it creates one new function per call, and it can get memory hungry.
@
15 years ago
Take 3. (Can't use create_function directly in a static initializer - otherwise changed as suggested)
#9
@
15 years ago
- Milestone changed from 2.8.1 to 2.9
I'm +1 for this one, but punting this back to 2.9 nonetheless. Ryan mentioned a couple of potential issues related to fixing sanitize_title_with_dashes() and its related functions. We'd want a means to enforce permalink history first.
#12
@
14 years ago
pauamma's patch above does not fix tag archie URLs etc. Adding a new patch.
http://core.trac.wordpress.org/attachment/ticket/6697/bug6697.4.patch
#13
@
14 years ago
- Keywords needs-patch added; has-patch removed
I don't understand the raw issues here, as in why we should do this.
We should avoid create_function all together and create a ne helper for it.
#14
@
14 years ago
I don't understand the raw issues here, as in why we should do this.
Neither do I. A bookmarking service converting to uppercase is not a valid reason to change it. The RFC quoted does recommend they be provided as uppercase, however lowercase is perfectly acceptable as well.
I don't think any processing overhead is worth it here honestly.
Given the 3 years and no traction, I'd suggest this be best implemented as a plugin for those that care.
Milestone 2.5.2 deleted