WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 5 years ago

#14292 closed defect (bug) (fixed)

loop in tags url to same url

Reported by: gilrabbi2 Owned by:
Milestone: 3.0.1 Priority: normal
Severity: critical Version: 3.0
Component: General Keywords: reporter-feedback
Focuses: Cc:

Description

all the blog tags url with hebrew character are loop 301 to the same url.
sample:

http://www.site.com/tag/xxxx
loop to http://www.site.com/tag/xxxx
or
http://ranh.co.il/tag/וורדפרס-בעברית/
loop to http://ranh.co.il/tag/וורדפרס-בעברית/
Status: HTTP/1.1 301 Moved Permanently

google webmaster tools show all tags with errors loop redirect.
this bug also been in all wordpress with hebrew character.

Attachments (8)

14292.diff (587 bytes) - added by ryan 5 years ago.
Ignore case
14292.patch (565 bytes) - added by hakre 5 years ago.
14292.2.diff (800 bytes) - added by ryan 5 years ago.
14292.2.patch (870 bytes) - added by hakre 5 years ago.
redirect_canonical() can be called multiple times
14292.3.patch (4.3 KB) - added by hakre 5 years ago.
Introducing url_normalize() and url_compare()
14292.4.patch (4.5 KB) - added by hakre 5 years ago.
Query normalized as well
14292.5.patch (4.5 KB) - added by hakre 5 years ago.
URLs are ASCII 7Buit (US-ASCII) encoded, 8bit-upper-half is invalid
14292.6.patch (4.5 KB) - added by hakre 5 years ago.
(minor) unset $unreserved as well

Download all attachments as: .zip

Change History (37)

comment:1 follow-up: @ryan5 years ago

Perhaps related to #14201?

comment:2 in reply to: ↑ 1 @nacin5 years ago

Replying to ryan:

Perhaps related to #14201?

That's what I was thinking too, though that appears to be related to the base, not the slug.

comment:3 @gilrabbi25 years ago

so how can i fix that ? my category url is ok, only the tags is with this bug.
when i back to wordpress 2.9 all the url tags back to normal.
its happend only in wp3.

comment:4 @nacin5 years ago

  • Milestone changed from Awaiting Review to 3.0.1

Closing #14313 as a duplicate.

comment:5 follow-up: @Lafirel5 years ago

The patch in #14201 seems does not work, Google webmaster tools also report crawl errors.
Hope wordpress could pay more attention to non-ASCII language users.

@ryan5 years ago

Ignore case

comment:7 in reply to: ↑ 5 @hakre5 years ago

Replying to Lafirel:

The patch in #14201 seems does not work, Google webmaster tools also report crawl errors.
Hope wordpress could pay more attention to non-ASCII language users.

The support for UTF-8 is getting better and better over the last years.

@hakre5 years ago

comment:8 @hakre5 years ago

Simplified patch. fixes redirects with mixed-cased character triplets as well which should be transparent according to RFC.

comment:9 @hakre5 years ago

  • Keywords reporter-feedback added; tags.tag.tags url.tag url loop tag removed

The Better HTTP Redirect Plugin version 1.2-beta-2 is a proof of concept on that approach. Just install it and the redirect should be gone.

Please check if patch or plugin fixes your issue.

comment:10 @ryan5 years ago

http://www.ietf.org/rfc/rfc2616.txt

See section 3.2.3

   When comparing two URIs to decide if they match or not, a client
   SHOULD use a case-sensitive octet-by-octet comparison of the entire
   URIs, with these exceptions:

      - A port that is empty or not given is equivalent to the default
        port for that URI-reference;

        - Comparisons of host names MUST be case-insensitive;

        - Comparisons of scheme names MUST be case-insensitive;

        - An empty abs_path is equivalent to an abs_path of "/".

   Characters other than those in the "reserved" and "unsafe" sets (see
   RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding.

   For example, the following three URIs are equivalent:

      http://abc.com:80/~smith/home.html
      http://ABC.com/%7Esmith/home.html
      http://ABC.com:/%7esmith/home.html

comment:11 follow-up: @ryan5 years ago

It seems that the hex encoding is the only part that should be case-insensitive.

comment:12 @westi5 years ago

That makes sense as there is no difference in meaning between the hex codes.

@ryan5 years ago

comment:13 @ryan5 years ago

Canonical doesn't seem to redirect posts that have the wrong case.

comment:14 in reply to: ↑ 11 @hakre5 years ago

Replying to ryan:

It seems that the hex encoding is the only part that should be case-insensitive.

As for strictness, yes. What's the reason you place it after the first check (line 344)?

Can confirm that the regex patch works as well.

@hakre5 years ago

redirect_canonical() can be called multiple times

comment:15 @ryan5 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [15437]) Hex octets are case-insenstive. Don't 301 when the only octet case differs. Props hakre. fixes #14292 for 3.1

comment:16 @ryan5 years ago

(In [15438]) Hex octets are case-insenstive. Don't 301 when only the octet case differs. Props hakre. fixes #14292 for 3.0.1

comment:17 @ryan5 years ago

(In [15444]) Remove rededundant code. see #14292

comment:18 @hakre5 years ago

I ran over some related issue after giving the RFC documentation about URL comparison some grip: #14347

I have something as a patch, but need to test this against your latest changes. It normalizes an URL. As in #14347, this could be useful overall in core, mabye directly by normalizing the $_SERVER['REQUEST_URI'].

FYI: Those %[a-zA-Z0-9]{2} are called character triplets btw, octet is an 8-bit entitiy.

comment:19 @hakre5 years ago

Slightly extended scenario. I added a new tag called a-一样 on now the very latest trunk with the recent changesets of this ticket. So let's test:

# curl -I http://webroot.loc/wordpress/tag/a-%E4%B8%80%E6%A0%B7

HTTP/1.1 200 OK
Date: Sun, 18 Jul 2010 19:24:28 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Content-Type: text/html; charset=UTF-8

Now the same requet with a similar but different encoded URL:

# curl -I http://webroot.loc/wordpress/tag/%41-%E4%B8%80%E6%A0%B7

HTTP/1.1 301 Moved Permanently
Date: Sun, 18 Jul 2010 19:25:32 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Location: http://webroot.loc/wordpress/tag/a-%e4%b8%80%e6%a0%b7
Content-Type: text/html; charset=UTF-8

301 is there again.

comment:20 @hakre5 years ago

My fault, must be %61 but result is the same.

@hakre5 years ago

Introducing url_normalize() and url_compare()

comment:21 @hakre5 years ago

Patch introduces url_normalize() which creates something that could be named a "wordpress-way" normalized url. First of all it normalizes an URL so that only those chars are encoded that need to be encoded and all the other stuff mentioned in section 3.2.x of RFC 2613 is compacted into "the one comparable" representation.

This is following the HTTP standard. The "wordpress-way" part of it is to use lowercase triplets. Both PHP and the RFC suggest uppercase as the default. I like lowercase as well, and it's compatible.

Defect: Arguments inside the queryinfo part of the URL are not alphabeitcally sorted. That's something which could be aditionally done.

This patch comes with another function called url_compare() as well which I had used prior to normalize in the URL in the entry point and left it just as a usage exmaple.

comment:22 @hakre5 years ago

Here an example request with the patch applied:

# curl -I http://webroot.loc/wordpress/tag/%61-%E4%B8%80%E6%A0%B7

HTTP/1.1 200 OK
Date: Sun, 18 Jul 2010 23:40:09 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Content-Type: text/html; charset=UTF-8

no more 301 any longer.

@hakre5 years ago

Query normalized as well

@hakre5 years ago

URLs are ASCII 7Buit (US-ASCII) encoded, 8bit-upper-half is invalid

@hakre5 years ago

(minor) unset $unreserved as well

comment:23 @hakre5 years ago

for the log: Path could be normalized as well:

# http://webroot.loc/wordpress/tag/../tag/%61-%E4%B8%80%E6%A0%B7

HTTP/1.1 404 Not Found
Date: Mon, 19 Jul 2010 10:25:59 GMT
Server: Apache
Cache-Control: no-cache, must-revalidate, max-age=0
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Pragma: no-cache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Last-Modified: Mon, 19 Jul 2010 10:26:04 GMT
Content-Type: text/html; charset=UTF-8

comment:24 @hakre5 years ago

For the log: normalizing the path is a bad idea.

comment:25 @hakre5 years ago

Related: #7095

comment:26 @hakre5 years ago

For the log: normalizing the path might not be a bad idea. Needs to be properly checked against RFCs first.

comment:27 @hakre5 years ago

Ref: Normalize URLs plugin fixes this issue by playing the standards better as in the changes in [15437] / [15438] / [15444].

comment:28 @hakre5 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Added the removal of dot segments according to Path Segment Normalization (RFC 3986 6.2.2.3.) in the latest development version of the Normalize URLs Plugin

Example request with dot segments and percent-encoded unreserved characters:

# curl -I http://webroot.loc/wordpress/t%61g/././../tag/%61pple

HTTP/1.1 200 OK
Date: Wed, 21 Jul 2010 22:52:51 GMT
Server: Apache
Content-Type: text/html

comment:29 @hakre5 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Wrong control.

Note: See TracTickets for help on using tickets.