Opened 6 years ago

Closed 4 years ago

#3727 closed defect (bug) (fixed)

WP->parse_request() won't replace $pathinfo when $req_uri contains any %## encoding character.

Reported by: Kirin_Lin Owned by: ryan
Priority: normal Milestone: 2.8
Component: I18N Version: 2.2
Severity: normal Keywords:
Cc: markjaquith, hakre

Description

I write blog in Chinese. WordPress can't use index_permalinks which contains "%postname%". The problem is using slug not in English, which will be urlencoded to %## format.

In WP2.1, wp-includes/classes.php line 61 and in WP2.0, wp-includes/classes.php line 1528:

$req_uri = str_replace($pathinfo, '', $req_uri);

When I use Chinese slug, the $pathinfo will be like "2007/01/31/測試" and the $requri will be "/index.php/2007/01/31/%e6%b8%ac%e8%a9%a6/". That makes the function of above doesn't work.

It should be like this:

$req_uri = str_replace($pathinfo, '', urldecode($req_uri));

Attachments (2)

classes-2.1.diff (623 bytes) - added by Kirin_Lin 6 years ago.
diff file for 2.1
classes-2.0.diff (628 bytes) - added by Kirin_Lin 6 years ago.
diff file for 2.0

Download all attachments as: .zip

Change History (46)

diff file for 2.1

  • Keywords rewrite permalink added
  • Owner changed from anonymous to ryan

diff file for 2.0

  • Cc markjaquith added

comment:3   ryan6 years ago

What http server are you using and what version of that server? We need to see if different servers and server versions behave consistently before putting in a fix.

I tested it in two servers:
Apache/2.2.3 (Win32),PHP/5.2.0
Apache/2.0.58 (Unix), PHP/4.4.2

I asked for others help. There is another server enviornment.
Result: ok - Apache/1.3.37 (Unix), PHP/4.4.4

  • Resolution set to duplicate
  • Status changed from new to closed

I think this is duplicated with ticket:1144.

  • Milestone 2.1.1 deleted
  • Keywords has-patch added
  • Milestone set to 2.2
  • Resolution duplicate deleted
  • Status changed from closed to reopened

Circular duplicates ;-)

Reopening this with the more recent patch.

comment:9   ryan6 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [4893]) decode req uri when processing pathinfo permalinks. Props Kirin_Lin. fixes #3727

  • Milestone changed from 2.2 to 2.1.2
  • Resolution fixed deleted
  • Status changed from closed to reopened

Fixed for 2.2. Let's try it out there for awhile and then backport to 2.1 and 2.0. Setting the milestone to 2.1.2 for now.

The recent addition of urldecode() also converts '+' to ' '. This breaks plugins that have added custom permalink handling, that use '+' to separate arguments. The specific case that broke for me is language appending in Gengo, which creates permalinks like site.com/en+de/ . This gets decoded to site.com/en de/ which fails to match the regular expressions that have been set up. I'm sure there must be other plugins out there which rely on this behaviour as well, as '+' is a very natural separator for specifying 'more than one'.

I would request that rawurldecode() be used instead, which is identical to urldecode(), except that it preserves '+'.

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [4979]) Use rawurldecode. fixes #3727

  • Resolution fixed deleted
  • Status changed from closed to reopened

Changed to rawurldecode in 2.2.

  • Milestone changed from 2.1.4 to 2.2
  • Resolution set to fixed
  • Status changed from reopened to closed

Not going into 2.1 and already in 2.1. Resolving.

Already in 2.2, I believe Ryan means.

  • Resolution fixed deleted
  • Severity changed from normal to critical
  • Status changed from closed to reopened
  • Version changed from 2.1 to 2.2

I reopened this, becasue this Fix creates new Defects in the Rewrite Rule parsing engine. It transposes "%26"s into "&"s which are then passed to query string parser and will create a new query string values.

Can confirm that this is in 2.2 now.

In a deeper analysis I realized that Kirin_Lin simply "changed something that affected something the way she/he wanted" and was fine was that but without even thinking about where the defect is originated in. This leads to an enourmous space for side-effects and esoteric behaviour of the rewrite part. Rawurldecode should only be applied on the query-info part of a querystring. so this has to be done on another place in the code-logic not in the part where decoded url-information has to stay inside the URL as within that line of code.

You can compare the applied fix as if the apache developers would pass a general rawurldecode() over the complete request URI.

And it's even pittier to see that this kind of problem has already been seen by majelbstoat but his assumption still has no proper analysis. it just changes from urldecode to rawurldecode for one little "+". What is about all the other chars that plugins and article-names are bound to that need to be urlencoded in the url and then be passed urlencoded to the query-info-part to no being splitted?

  • Severity changed from critical to blocker
  • Milestone changed from 2.2 to 2.2.1

Warning: This can be used as an attack vector on Wordpress Blogs having pretty URLs enabled.

  • Milestone changed from 2.2.1 to 2.2.2
  • Keywords has-patch removed

comment:25 follow-up: ↓ 26   wordpressfun6 years ago

google indexed my wordpress so:

for example:

www.travel-carhire.com/online/golf/853.php/page/2/ ( its wrong url) or www.travel-carhire.com/de/2007/02/page/2/ ( its wrong url to)

you can see more url at www.google.de/search?hl=de&q=site%3Awww.travel-carhire.com+page%2F2&meta=

also seen: page/2/ is at end

also: */online/golf/853.php/* come between wwww.xxx.com/..AND../paged/2/

it is realy very carzzy....!!!! and not good for google search...

How can I delete */page/* from end of www.xxx.com/page/ like /page/2, page/3, page/4,.....

also i dont want to have */page/* at end of URL

I mean if I clik at* << Previous Entries .. Next Entries >>

It dont me help if I delete only << Previous Entries .. Next Entries >>

also I muss delete /page/ from URL. and google mussnt see/find */page/* at my Homepage....!!!

can anybody help me pls?? how can i delete */page/* from Url.

very thanks

comment:26 in reply to: ↑ 25   Nazgul6 years ago

Replying to wordpressfun:
Could you please stop posting this all over trac?

Replying to hakre:

I reopened this, becasue this Fix creates new Defects in the Rewrite Rule parsing engine. It transposes "%26"s into "&"s which are then passed to query string parser and will create a new query string values.

Can confirm that this is in 2.2 now.

I realized what hakre said, so I rethink the problem and how to solve it. When we use urldecode(or rawueldecode) function to convert URI, it may cause security problem. In WordPress, the post title will be sanitized. I notice normal utf8 characters will be convert by utf8_uri_encode function.
Since we have risk by using decoding function, why don't we convert $pathinfo by using encoding function?
Thus, I try to use these code to test:

	if( seems_utf8($pathinfo) )
		$pathinfo = utf8_uri_encode($pathinfo);
	$req_uri = str_replace($pathinfo, '', $req_uri);

and

	$pathinfo = utf8_uri_encode($pathinfo);

Now, posts seem fine, but pages are broken. Because the rewrite rules are decoded utf8 from changeset:1841 when generating page index uri.
I undo the changeset and flush rules again, then pages come back.
Posts and Pages are fine now. I will do more test before i submit another patch.

Hi Kirin_Lin, do you know why $pathinfo contains other characters then the allowed ones for an URI (http://gbiv.com/protocols/uri/rfc/rfc3986.html#characters) ?

Because the $PATH_INFO is implemented by following CGI/1.1 specification(http://hoohoo.ncsa.uiuc.edu/cgi/env.html), that says:

The extra path information, as given by the client.
In other words, scripts can be accessed by their virtual pathname,
followed by extra information at the end of this path.
The extra information is sent as PATH_INFO.
This information should be decoded by the server if it comes from 
a URL before it is passed to the CGI script.

That's why...

  • Priority changed from normal to high

Well then the question is at this point of the script if $req_uri mimics the client (as of for Wordpress Rewrite Rule parsing engine) or if it mimics the server (as of CGI that should decode PATH_INFO).

according to the sideeffect you are reporting (changeset:1841) I would say that the core wordpress development team should make clear and document which behaviour has to be implemented at this point of sourcecode: client or server. this affects the rewrite part as well. Until this has not been made clear any made and further changes will produce side-effects and might even create crititcal attack-vectors on wordpress.

I would tend to say, that at this point req_uri should be encoded (not decoded) but I'm not a core wordpress developer so this is more a personal opinion. I raise the priority to high to gather some more awarness.

  • Milestone changed from 2.2.2 to 2.2.3
  • Milestone changed from 2.2.3 to 2.4
  • Milestone changed from 2.5 to 2.6

Only authoritative from a core developer is missing here. I could write a patch afterwards. Can a Developer please make a significant, written statement on this function / usage?

  • Cc hakre added
  • Priority changed from high to highest omg bbq
  • Keywords dev-feedback added
  • Component changed from General to i18n

Still waiting for a Core Developer Statement here on the Usage of the Function. How can I help?

  • Keywords needs-patch added; rewrite permalink removed
  • Milestone changed from 2.9 to 2.8

still current I take it?

probably related: #8446

  • Priority changed from highest omg bbq to normal
  • Severity changed from blocker to normal

comment:42 in reply to: ↑ description   hakre4 years ago

looks like "index_permalinks" ain't a function in wp 2.8 any longer. looks more and more bogus to me.

Yeah, this has been moved in the parse_request() method in wp-includes/classes.php.

Might be bogus by now, I've essentially no means to constructively test any of it.

  • Keywords needs-patch dev-feedback removed
  • Resolution set to fixed
  • Status changed from reopened to closed

i've taken a look through WP::parse_request(); and I must assume that it should handle IRI-Encoded URLs now properly. I do not fully understand why this all did not work with %postname% that time but with the current 2.7 / 2.8 version, %postname% is properly used in URLs even with UTF-8. I checked that with the UTF-8 css classnames ticket #8446.

WP->parse_request() is currently using rawurldecode() and is using it only on the path of the URI, not the query-info-part. That looks like a safe way of doing the IRI-Decoding to me.

This is fixed, %postname% can be used with utf-8 characters beyond 7bit ascii.

Note: See TracTickets for help on using tickets.