URLs are not handeled properly
|Reported by:||hakre||Owned by:|
While digging into #14201, #14292 and similars, it came to my attention, that wordpress does not filter the URL input properly. This can lead to 404 responses where content is actually available as specified by http / RFC 2612.
Example run against current trunk to illustrate the issue:
# curl -I http://webroot.loc/wordpress/tag/%e4%b8%80%e6%a0%b7 HTTP/1.1 200 OK Date: Sun, 18 Jul 2010 18:53:02 GMT Server: Apache X-Pingback: http://webroot.loc/wordpress/xmlrpc.php Content-Type: text/html; charset=UTF-8
Doing the same request with an alternative writing in the URL does lead to a 404. Remind that the "a" of tag has been encoded as %41:
# curl -I http://webroot.loc/wordpress/t%41g/%e4%b8%80%e6%a0%b7 HTTP/1.1 404 Not Found Date: Sun, 18 Jul 2010 18:54:32 GMT Server: Apache Cache-Control: no-cache, must-revalidate, max-age=0 Expires: Wed, 11 Jan 1984 05:00:00 GMT Pragma: no-cache X-Pingback: http://webroot.loc/wordpress/xmlrpc.php Last-Modified: Sun, 18 Jul 2010 18:54:33 GMT Content-Type: text/html; charset=UTF-8
RFC 2613 clearly write about this in the comparison of URLs (3.2.3):
Characters other than those in the "reserved" and "unsafe" sets (see
RFC 2396 ) are equivalent to their ""%" HEX HEX" encoding.
These so called character triplets are written uppercase by the PHP urlencode() and rawurlencode() functions, are written lowercase mostly inside worpdress (e.g. slugs generation). They can be written either and even mixed case, even the RFCs introduce them uppercase first. But both variants are okay, even %dD is.
The webapplication should handle both URLs the same.