Opened 18 years ago
Closed 16 years ago
#3727 closed defect (bug) (fixed)
WP->parse_request() won't replace $pathinfo when $req_uri contains any %## encoding character.
Reported by: | Kirin_Lin | Owned by: | ryan |
---|---|---|---|
Milestone: | 2.8 | Priority: | normal |
Severity: | normal | Version: | 2.2 |
Component: | I18N | Keywords: | |
Focuses: | Cc: |
Description
I write blog in Chinese. WordPress can't use index_permalinks which contains "%postname%". The problem is using slug not in English, which will be urlencoded to %## format.
In WP2.1, wp-includes/classes.php line 61 and in WP2.0, wp-includes/classes.php line 1528:
$req_uri = str_replace($pathinfo, '', $req_uri);
When I use Chinese slug, the $pathinfo will be like "2007/01/31/測試" and the $requri will be "/index.php/2007/01/31/%e6%b8%ac%e8%a9%a6/". That makes the function of above doesn't work.
It should be like this:
$req_uri = str_replace($pathinfo, '', urldecode($req_uri));
Attachments (2)
Change History (46)
#3
@
18 years ago
What http server are you using and what version of that server? We need to see if different servers and server versions behave consistently before putting in a fix.
#4
@
18 years ago
I tested it in two servers:
Apache/2.2.3 (Win32),PHP/5.2.0
Apache/2.0.58 (Unix), PHP/4.4.2
#5
@
18 years ago
I asked for others help. There is another server enviornment.
Result: ok - Apache/1.3.37 (Unix), PHP/4.4.4
#6
@
18 years ago
- Resolution set to duplicate
- Status changed from new to closed
I think this is duplicated with ticket:1144.
#8
@
18 years ago
- Keywords has-patch added
- Milestone set to 2.2
- Resolution duplicate deleted
- Status changed from closed to reopened
Circular duplicates ;-)
Reopening this with the more recent patch.
#10
@
18 years ago
- Milestone changed from 2.2 to 2.1.2
- Resolution fixed deleted
- Status changed from closed to reopened
Fixed for 2.2. Let's try it out there for awhile and then backport to 2.1 and 2.0. Setting the milestone to 2.1.2 for now.
#11
@
18 years ago
The recent addition of urldecode() also converts '+' to ' '. This breaks plugins that have added custom permalink handling, that use '+' to separate arguments. The specific case that broke for me is language appending in Gengo, which creates permalinks like site.com/en+de/ . This gets decoded to site.com/en de/ which fails to match the regular expressions that have been set up. I'm sure there must be other plugins out there which rely on this behaviour as well, as '+' is a very natural separator for specifying 'more than one'.
I would request that rawurldecode() be used instead, which is identical to urldecode(), except that it preserves '+'.
#13
@
18 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
Changed to rawurldecode in 2.2.
#15
@
18 years ago
- Resolution set to fixed
- Status changed from reopened to closed
Not going into 2.1 and already in 2.1. Resolving.
#17
@
17 years ago
- Resolution fixed deleted
- Severity changed from normal to critical
- Status changed from closed to reopened
- Version changed from 2.1 to 2.2
I reopened this, becasue this Fix creates new Defects in the Rewrite Rule parsing engine. It transposes "%26"s into "&"s which are then passed to query string parser and will create a new query string values.
Can confirm that this is in 2.2 now.
#18
@
17 years ago
In a deeper analysis I realized that Kirin_Lin simply "changed something that affected something the way she/he wanted" and was fine was that but without even thinking about where the defect is originated in. This leads to an enourmous space for side-effects and esoteric behaviour of the rewrite part. Rawurldecode should only be applied on the query-info part of a querystring. so this has to be done on another place in the code-logic not in the part where decoded url-information has to stay inside the URL as within that line of code.
You can compare the applied fix as if the apache developers would pass a general rawurldecode() over the complete request URI.
#19
@
17 years ago
And it's even pittier to see that this kind of problem has already been seen by majelbstoat but his assumption still has no proper analysis. it just changes from urldecode to rawurldecode for one little "+". What is about all the other chars that plugins and article-names are bound to that need to be urlencoded in the url and then be passed urlencoded to the query-info-part to no being splitted?
#22
@
17 years ago
Warning: This can be used as an attack vector on Wordpress Blogs having pretty URLs enabled.
#25
follow-up:
↓ 26
@
17 years ago
google indexed my wordpress so:
for example:
www.travel-carhire.com/online/golf/853.php/page/2/ ( its wrong url) or www.travel-carhire.com/de/2007/02/page/2/ ( its wrong url to)
you can see more url at www.google.de/search?hl=de&q=site%3Awww.travel-carhire.com+page%2F2&meta=
also seen: page/2/ is at end
also: */online/golf/853.php/* come between wwww.xxx.com/..AND../paged/2/
it is realy very carzzy....!!!! and not good for google search...
How can I delete */page/* from end of www.xxx.com/page/ like /page/2, page/3, page/4,.....
also i dont want to have */page/* at end of URL
I mean if I clik at* << Previous Entries .. Next Entries >>
It dont me help if I delete only << Previous Entries .. Next Entries >>
also I muss delete /page/ from URL. and google mussnt see/find */page/* at my Homepage....!!!
can anybody help me pls?? how can i delete */page/* from Url.
very thanks
#26
in reply to:
↑ 25
@
17 years ago
Replying to wordpressfun:
Could you please stop posting this all over trac?
#27
@
17 years ago
Replying to hakre:
I reopened this, becasue this Fix creates new Defects in the Rewrite Rule parsing engine. It transposes "%26"s into "&"s which are then passed to query string parser and will create a new query string values.
Can confirm that this is in 2.2 now.
I realized what hakre said, so I rethink the problem and how to solve it. When we use urldecode(or rawueldecode) function to convert URI, it may cause security problem. In WordPress, the post title will be sanitized. I notice normal utf8 characters will be convert by utf8_uri_encode function.
Since we have risk by using decoding function, why don't we convert $pathinfo by using encoding function?
Thus, I try to use these code to test:
if( seems_utf8($pathinfo) ) $pathinfo = utf8_uri_encode($pathinfo); $req_uri = str_replace($pathinfo, '', $req_uri);
and
$pathinfo = utf8_uri_encode($pathinfo);
Now, posts seem fine, but pages are broken. Because the rewrite rules are decoded utf8 from changeset:1841 when generating page index uri.
I undo the changeset and flush rules again, then pages come back.
Posts and Pages are fine now. I will do more test before i submit another patch.
#28
@
17 years ago
Hi Kirin_Lin, do you know why $pathinfo contains other characters then the allowed ones for an URI (http://gbiv.com/protocols/uri/rfc/rfc3986.html#characters) ?
#29
@
17 years ago
Because the $PATH_INFO is implemented by following CGI/1.1 specification(http://hoohoo.ncsa.uiuc.edu/cgi/env.html), that says:
The extra path information, as given by the client. In other words, scripts can be accessed by their virtual pathname, followed by extra information at the end of this path. The extra information is sent as PATH_INFO. This information should be decoded by the server if it comes from a URL before it is passed to the CGI script.
That's why...
#30
@
17 years ago
- Priority changed from normal to high
Well then the question is at this point of the script if $req_uri mimics the client (as of for Wordpress Rewrite Rule parsing engine) or if it mimics the server (as of CGI that should decode PATH_INFO).
according to the sideeffect you are reporting (changeset:1841) I would say that the core wordpress development team should make clear and document which behaviour has to be implemented at this point of sourcecode: client or server. this affects the rewrite part as well. Until this has not been made clear any made and further changes will produce side-effects and might even create crititcal attack-vectors on wordpress.
I would tend to say, that at this point req_uri should be encoded (not decoded) but I'm not a core wordpress developer so this is more a personal opinion. I raise the priority to high to gather some more awarness.
#34
@
16 years ago
Only authoritative from a core developer is missing here. I could write a patch afterwards. Can a Developer please make a significant, written statement on this function / usage?
#38
@
16 years ago
Still waiting for a Core Developer Statement here on the Usage of the Function. How can I help?
#39
@
16 years ago
- Keywords needs-patch added; rewrite permalink removed
- Milestone changed from 2.9 to 2.8
still current I take it?
#41
@
16 years ago
- Priority changed from highest omg bbq to normal
- Severity changed from blocker to normal
#42
in reply to:
↑ description
@
16 years ago
looks like "index_permalinks" ain't a function in wp 2.8 any longer. looks more and more bogus to me.
#43
@
16 years ago
Yeah, this has been moved in the parse_request() method in wp-includes/classes.php.
Might be bogus by now, I've essentially no means to constructively test any of it.
#44
@
16 years ago
- Keywords needs-patch dev-feedback removed
- Resolution set to fixed
- Status changed from reopened to closed
i've taken a look through WP::parse_request(); and I must assume that it should handle IRI-Encoded URLs now properly. I do not fully understand why this all did not work with %postname% that time but with the current 2.7 / 2.8 version, %postname% is properly used in URLs even with UTF-8. I checked that with the UTF-8 css classnames ticket #8446.
WP->parse_request() is currently using rawurldecode() and is using it only on the path of the URI, not the query-info-part. That looks like a safe way of doing the IRI-Decoding to me.
This is fixed, %postname% can be used with utf-8 characters beyond 7bit ascii.
diff file for 2.1