Make WordPress Core

Opened 16 years ago

Closed 14 years ago

#6697 closed defect (bug) (maybelater)

Percent Encoding in URL should be capital alphabets instead of small letters

Reported by: akky's profile akky Owned by: ryan's profile ryan
Milestone: Priority: normal
Severity: major Version: 2.5
Component: Permalinks Keywords: needs-patch
Focuses: Cc:

Description

wp-includes/formatting.php utf8_uri_encode()

When you use [Options]-[Permalinks]-[Common Options]-[Date and name based], or %postname% in custom URL, entry title will be normalized for fitting to URL permitted letters.

If title has non-ASCII letters, those letters cannot be directly put in URL so they are percent-encoded. This is processed in sanitize_title_with_dashes_original() and utf8_uri_encode() .

The problem is, these two functions normalizes too much for replacing to small letters.

For example, in unit test data, "Zhang Ziyi(in Chinese)" is currently converted to "%e7%ab%a0%e5%ad%90%e6%80%a1", however, this should be "%E7%AB%A0%E5%AD%90%E6%80%A1".

Currently, WordPress generates title-in-url by url-encode, then apply strtolower(), so everything goes to be small letters.

The reason capital letters in percent encoding required is described on RFC3986 section 2.1,

The uppercase hexadecimal digits 'A' through 'F' are equivalent to   the lowercase digits 'a' through 'f', respectively.  If two URIs differ only in the case of hexadecimal digits used in percent-encoded octets, they are equivalent.  '''For consistency, URI producers and normalizers should use uppercase hexadecimal digits for all percent-encodings.'''

As you see RFC2119 section 3, "should" means you may ignore it with valid reason. So if this was designed by WordPress's policy, I think I cannot force the fixation.


Now, let me explain why I, and other Japanese WordPress users needs this fix. In Japanese blogosphere, there is a Japanese del.icio.us equivalent, Social Bookmark Service 'Hatena Bookmark'.

This service accept users' bookmarks via several different ways like bookmarklet, widget, on site, user making tools. Some of them uses WordPress URL as is, whilst some others normalize to capital letters.

As a result of that, number of bookmarks for blog entries written on WordPress tend to be sparse. Interfered becoming popular items. When you compare against other blog system users blogs. (Search engine recognizes encoded URL titles so using this option is needed for Japanese users.)

If WordPress always generates capital encoded URL (not for roman alphabet parts, they are better staying small letters), this problem will be solved so that we can encourage more potential users migrates to WordPress.

I do not know if there are similar issues in other non-ASCII languages. And fixing this may make backword-incompatibility with old generated entries. However, this is a really big trouble for us Japanese users. This is rather WordPress marketing issue than RFC validness.

Attachments (3)

bug6697.2.patch (484 bytes) - added by pauamma 15 years ago.
Revised patch, should address review comments
bug6697.3.patch (786 bytes) - added by pauamma 15 years ago.
Take 3. (Can't use create_function directly in a static initializer - otherwise changed as suggested)
bug6697.4.patch (972 bytes) - added by wokamoto 14 years ago.

Download all attachments as: .zip

Change History (19)

#1 @ryan
16 years ago

  • Milestone changed from 2.5.2 to 2.9

Milestone 2.5.2 deleted

#2 @akky
16 years ago

I have been using my quick patch for months on my Japanese WordPress blog, and for me it is working. The patch is, wp-includes/formatting.php at the end of the function sanitize_title_with_dashes(),

  $title = trim($title, '-');

+  $title = preg_replace(
+    '/%([a-fA-F0-9]{2})/e',
+    "'%'.strtoupper('\\1')",
+    $title
+  );

  return $title;

It is quite a symptomatic treatment though.

#3 @ryan
16 years ago

  • Component changed from General to Permalinks
  • Owner changed from anonymous to ryan

#4 @Denis-de-Bernardy
15 years ago

  • Keywords needs-patch added; UTF-8 URL normalization removed
  • Milestone changed from 2.9 to 2.8

suggested patch is not good. /e is being eliminated in WP in a separate ticket, in favor of preg_replace_callback. also, the regex could just as well be this, to reduce backtraces:

"/%[a-f0-9]{2}/"

#5 @Denis-de-Bernardy
15 years ago

  • Milestone changed from 2.8 to Future Release

@pauamma
15 years ago

Revised patch, should address review comments

#6 @pauamma
15 years ago

  • Keywords has-patch added; needs-patch removed

#7 @Denis-de-Bernardy
15 years ago

suggesting one slight tweak on the patch:

static $fn = create_function(bla blah);

preg_replace_callback(..., $fn, ...)

else it creates one new function per call, and it can get memory hungry.

#8 @Denis-de-Bernardy
15 years ago

  • Milestone changed from Future Release to 2.8.1

@pauamma
15 years ago

Take 3. (Can't use create_function directly in a static initializer - otherwise changed as suggested)

#9 @Denis-de-Bernardy
15 years ago

  • Milestone changed from 2.8.1 to 2.9

I'm +1 for this one, but punting this back to 2.9 nonetheless. Ryan mentioned a couple of potential issues related to fixing sanitize_title_with_dashes() and its related functions. We'd want a means to enforce permalink history first.

#11 @ryan
15 years ago

  • Milestone changed from 2.9 to Future Release

@wokamoto
14 years ago

#12 @wokamoto
14 years ago

pauamma's patch above does not fix tag archie URLs etc. Adding a new patch.
http://core.trac.wordpress.org/attachment/ticket/6697/bug6697.4.patch

#13 @nacin
14 years ago

  • Keywords needs-patch added; has-patch removed

I don't understand the raw issues here, as in why we should do this.

We should avoid create_function all together and create a ne helper for it.

#14 @dd32
14 years ago

I don't understand the raw issues here, as in why we should do this.

Neither do I. A bookmarking service converting to uppercase is not a valid reason to change it. The RFC quoted does recommend they be provided as uppercase, however lowercase is perfectly acceptable as well.

I don't think any processing overhead is worth it here honestly.

Given the 3 years and no traction, I'd suggest this be best implemented as a plugin for those that care.

#15 @nacin
14 years ago

Or a localized drop-in.

#16 @westi
14 years ago

  • Milestone Future Release deleted
  • Resolution set to maybelater
  • Status changed from new to closed

I agree closing as maybelater - this is probably best in a plugin for the small subset of people who care about it.

Note: See TracTickets for help on using tickets.