WordPress.org

Make WordPress Core

Opened 10 years ago

Closed 10 years ago

#14292 closed defect (bug) (fixed)

loop in tags url to same url

Reported by: gilrabbi2 Owned by:
Milestone: 3.0.1 Priority: normal
Severity: critical Version: 3.0
Component: General Keywords: reporter-feedback
Focuses: Cc:

Description

all the blog tags url with hebrew character are loop 301 to the same url.
sample:

http://www.site.com/tag/xxxx
loop to http://www.site.com/tag/xxxx
or
http://ranh.co.il/tag/וורדפרס-בעברית/
loop to http://ranh.co.il/tag/וורדפרס-בעברית/
Status: HTTP/1.1 301 Moved Permanently

google webmaster tools show all tags with errors loop redirect.
this bug also been in all wordpress with hebrew character.

Attachments (8)

14292.diff (587 bytes) - added by ryan 10 years ago.
Ignore case
14292.patch (565 bytes) - added by hakre 10 years ago.
14292.2.diff (800 bytes) - added by ryan 10 years ago.
14292.2.patch (870 bytes) - added by hakre 10 years ago.
redirect_canonical() can be called multiple times
14292.3.patch (4.3 KB) - added by hakre 10 years ago.
Introducing url_normalize() and url_compare()
14292.4.patch (4.5 KB) - added by hakre 10 years ago.
Query normalized as well
14292.5.patch (4.5 KB) - added by hakre 10 years ago.
URLs are ASCII 7Buit (US-ASCII) encoded, 8bit-upper-half is invalid
14292.6.patch (4.5 KB) - added by hakre 10 years ago.
(minor) unset $unreserved as well

Download all attachments as: .zip

Change History (37)

#1 follow-up: @ryan
10 years ago

Perhaps related to #14201?

#2 in reply to: ↑ 1 @nacin
10 years ago

Replying to ryan:

Perhaps related to #14201?

That's what I was thinking too, though that appears to be related to the base, not the slug.

#3 @gilrabbi2
10 years ago

so how can i fix that ? my category url is ok, only the tags is with this bug.
when i back to wordpress 2.9 all the url tags back to normal.
its happend only in wp3.

#4 @nacin
10 years ago

  • Milestone changed from Awaiting Review to 3.0.1

Closing #14313 as a duplicate.

#5 follow-up: @Lafirel
10 years ago

The patch in #14201 seems does not work, Google webmaster tools also report crawl errors.
Hope wordpress could pay more attention to non-ASCII language users.

@ryan
10 years ago

Ignore case

#7 in reply to: ↑ 5 @hakre
10 years ago

Replying to Lafirel:

The patch in #14201 seems does not work, Google webmaster tools also report crawl errors.
Hope wordpress could pay more attention to non-ASCII language users.

The support for UTF-8 is getting better and better over the last years.

@hakre
10 years ago

#8 @hakre
10 years ago

Simplified patch. fixes redirects with mixed-cased character triplets as well which should be transparent according to RFC.

#9 @hakre
10 years ago

  • Keywords reporter-feedback added; tags.tag.tags url.tag url loop tag removed

The Better HTTP Redirect Plugin version 1.2-beta-2 is a proof of concept on that approach. Just install it and the redirect should be gone.

Please check if patch or plugin fixes your issue.

#10 @ryan
10 years ago

http://www.ietf.org/rfc/rfc2616.txt

See section 3.2.3

   When comparing two URIs to decide if they match or not, a client
   SHOULD use a case-sensitive octet-by-octet comparison of the entire
   URIs, with these exceptions:

      - A port that is empty or not given is equivalent to the default
        port for that URI-reference;

        - Comparisons of host names MUST be case-insensitive;

        - Comparisons of scheme names MUST be case-insensitive;

        - An empty abs_path is equivalent to an abs_path of "/".

   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

   For example, the following three URIs are equivalent:

      http://abc.com:80/~smith/home.html
      http://ABC.com/%7Esmith/home.html
      http://ABC.com:/%7esmith/home.html

#11 follow-up: @ryan
10 years ago

It seems that the hex encoding is the only part that should be case-insensitive.

#12 @westi
10 years ago

That makes sense as there is no difference in meaning between the hex codes.

@ryan
10 years ago

#13 @ryan
10 years ago

Canonical doesn't seem to redirect posts that have the wrong case.

#14 in reply to: ↑ 11 @hakre
10 years ago

Replying to ryan:

It seems that the hex encoding is the only part that should be case-insensitive.

As for strictness, yes. What's the reason you place it after the first check (line 344)?

Can confirm that the regex patch works as well.

@hakre
10 years ago

redirect_canonical() can be called multiple times

#15 @ryan
10 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [15437]) Hex octets are case-insenstive. Don't 301 when the only octet case differs. Props hakre. fixes #14292 for 3.1

#16 @ryan
10 years ago

(In [15438]) Hex octets are case-insenstive. Don't 301 when only the octet case differs. Props hakre. fixes #14292 for 3.0.1

#17 @ryan
10 years ago

(In [15444]) Remove rededundant code. see #14292

#18 @hakre
10 years ago

I ran over some related issue after giving the RFC documentation about URL comparison some grip: #14347

I have something as a patch, but need to test this against your latest changes. It normalizes an URL. As in #14347, this could be useful overall in core, mabye directly by normalizing the $_SERVER['REQUEST_URI'].

FYI: Those %[a-zA-Z0-9]{2} are called character triplets btw, octet is an 8-bit entitiy.

#19 @hakre
10 years ago

Slightly extended scenario. I added a new tag called a-一样 on now the very latest trunk with the recent changesets of this ticket. So let's test:

# curl -I http://webroot.loc/wordpress/tag/a-%E4%B8%80%E6%A0%B7

HTTP/1.1 200 OK
Date: Sun, 18 Jul 2010 19:24:28 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Content-Type: text/html; charset=UTF-8

Now the same requet with a similar but different encoded URL:

# curl -I http://webroot.loc/wordpress/tag/%41-%E4%B8%80%E6%A0%B7

HTTP/1.1 301 Moved Permanently
Date: Sun, 18 Jul 2010 19:25:32 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Location: http://webroot.loc/wordpress/tag/a-%e4%b8%80%e6%a0%b7
Content-Type: text/html; charset=UTF-8

301 is there again.

#20 @hakre
10 years ago

My fault, must be %61 but result is the same.

@hakre
10 years ago

Introducing url_normalize() and url_compare()

#21 @hakre
10 years ago

Patch introduces url_normalize() which creates something that could be named a "wordpress-way" normalized url. First of all it normalizes an URL so that only those chars are encoded that need to be encoded and all the other stuff mentioned in section 3.2.x of RFC 2613 is compacted into "the one comparable" representation.

This is following the HTTP standard. The "wordpress-way" part of it is to use lowercase triplets. Both PHP and the RFC suggest uppercase as the default. I like lowercase as well, and it's compatible.

Defect: Arguments inside the queryinfo part of the URL are not alphabeitcally sorted. That's something which could be aditionally done.

This patch comes with another function called url_compare() as well which I had used prior to normalize in the URL in the entry point and left it just as a usage exmaple.

#22 @hakre
10 years ago

Here an example request with the patch applied:

# curl -I http://webroot.loc/wordpress/tag/%61-%E4%B8%80%E6%A0%B7

HTTP/1.1 200 OK
Date: Sun, 18 Jul 2010 23:40:09 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Content-Type: text/html; charset=UTF-8

no more 301 any longer.

@hakre
10 years ago

Query normalized as well

@hakre
10 years ago

URLs are ASCII 7Buit (US-ASCII) encoded, 8bit-upper-half is invalid

@hakre
10 years ago

(minor) unset $unreserved as well

#23 @hakre
10 years ago

for the log: Path could be normalized as well:

# http://webroot.loc/wordpress/tag/../tag/%61-%E4%B8%80%E6%A0%B7

HTTP/1.1 404 Not Found
Date: Mon, 19 Jul 2010 10:25:59 GMT
Server: Apache
Cache-Control: no-cache, must-revalidate, max-age=0
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Pragma: no-cache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Last-Modified: Mon, 19 Jul 2010 10:26:04 GMT
Content-Type: text/html; charset=UTF-8

#24 @hakre
10 years ago

For the log: normalizing the path is a bad idea.

#25 @hakre
10 years ago

Related: #7095

#26 @hakre
10 years ago

For the log: normalizing the path might not be a bad idea. Needs to be properly checked against RFCs first.

#27 @hakre
10 years ago

Ref: Normalize URLs plugin fixes this issue by playing the standards better as in the changes in [15437] / [15438] / [15444].

#28 @hakre
10 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Added the removal of dot segments according to Path Segment Normalization (RFC 3986 6.2.2.3.) in the latest development version of the Normalize URLs Plugin

Example request with dot segments and percent-encoded unreserved characters:

# curl -I http://webroot.loc/wordpress/t%61g/././../tag/%61pple

HTTP/1.1 200 OK
Date: Wed, 21 Jul 2010 22:52:51 GMT
Server: Apache
Content-Type: text/html

#29 @hakre
10 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Wrong control.

Note: See TracTickets for help on using tickets.