Make WordPress Core

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#14313 closed defect (bug) (duplicate)

TAG permalink with chinese (Japanese) character have issue

Reported by: lafirel's profile lafirel Owned by:
Milestone: Priority: normal
Severity: major Version: 3.0
Component: General Keywords: chinese, tag, permalink, crawl
Focuses: Cc:

Description

If a TAG permalink have chinese (Japanese) characters, the permalink of the TAG is like this:"/tag/%E4%B8%80%E6%A0%B7/"

In fact, the "%E4%B8%80%E6%A0%B7" is chinese (Japanese) characters. BUT, when a browser or a search engine spider visit or crawl the tag page /tag/%E4%B8%80%E6%A0%B7/, the wordpress3.0 will auto redirect "/tag/%E4%B8%80%E6%A0%B7/" to "/tag/%e4%b8%80%e6%a0%b7/".(lowercase)

Then the problerm comes, a lot of Chinese and Japanses user report that when upgrade to Wordpress 3.0 the Google spider can NOT crawl there TAG page smoothly, the spider do not accept the 301, you can check the crawl errors and server log to see how the Wordpress give a 301 when hit the TAG url and how Google report errors.

The Wordpress2.71 or 2.92 user does not have this problem. But Wordpress3.0 user will see a lot of redirect error in Google webmaster tool.

Attachments (2)

crawl error1.jpg (163.4 KB) - added by lafirel 14 years ago.
crawl error2.jpg (29.2 KB) - added by lafirel 14 years ago.

Download all attachments as: .zip

Change History (25)

@lafirel
14 years ago

@lafirel
14 years ago

#1 follow-ups: @nacin
14 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #14292 ?

#2 in reply to: ↑ 1 @Lafirel
14 years ago

Replying to nacin:

Duplicate of #14292 ?

Yes, sorry for re-asked. I serched before but I do not see 14292.
Hope to fix soon.

#3 in reply to: ↑ 1 @hakre
14 years ago

Replying to nacin:

Duplicate of #14292 ?

Both look related on first sight but this is probably not a duplicate. I wonder why there is a redirection to a lowercase variant in this case? Do we do lower the case while normalizing character triplet (%[0-9A-F][0-9A-F]) in URLs somewhere?

#4 @hakre
14 years ago

I tried to reproduce this on current trunk but was not able to.

Instead of getting a redirect, I get a 404 Not Found, which is similar to #13413.

#5 @hakre
14 years ago

Slugs are stored lower case into the database (slug field of terms table). That explains why a redirect to the lowercase variant is made. It does not explain the 404.

#6 @hakre
14 years ago

Interestingly WP::public 'matched_query' is string 'tag=%E4%B8%80%E6%A0%B7' which contains uppercase character triplets, even if the request $_SERVERREQUEST_URI? => string '/wordpress-trunk/tag/%e4%b8%80%e6%a0%b7/' has lowercase chars.

This is in wp::main after $this->parse_request first called in wp-blog-header.php.

#7 @hakre
14 years ago

Realted: #11528 - getting closer to a fix.

#8 @hakre
14 years ago

I was able to write a patch that fixes a related issue: #13413

#9 @Lafirel
14 years ago

I give a try, but the 13413.2.patch in #13413 does not fixes this issue.

If you hit "/tag/%E4%B8%80%E6%A0%B7", the Server still Response: 301 Moved Permanently to "/tag/%e4%b8%80%e6%a0%b7".

#10 @Lafirel
14 years ago

  • Resolution duplicate deleted
  • Status changed from closed to reopened

#11 follow-up: @Lafirel
14 years ago

Hey, after do some check server header test, here comes some interest:
With no patch to my Wordpress3.0, it means I use the original 3.0.


Here is a tools called Check Server Header http://www.seoconsultants.com/tools/headers/

When I get the header with browsers like IE7 IE6 Firefox Opera, the Server Response is 200 OK.

When I get the header with bots like Googlebot or MSNbot, the Server Response is 301 Moved Permanently .

I am wondering if 3.0 have a estimate to response diffrents header via diffrents browser or search engine spider?

#12 @hakre
14 years ago

The patch in #13413 was not meant to fix this issue here, but I needed to do that fix before being able to reproduce this ticket.

So with that patch, I can reproduce the ticket here.

I've been testing now redirects with curl:

Redirect (uppercase character triplets):

# curl -I http://host/wordpress-trunk/tag/%E4%B8%80%E6%A0%B7/

HTTP/1.1 301 Moved Permanently
Date: Sat, 17 Jul 2010 10:25:40 GMT
Server: Apache
X-Pingback: http://host/wordpress-trunk/xmlrpc.php
Location: http://host/wordpress-trunk/tag/%e4%b8%80%e6%a0%b7/
Content-Type: text/html; charset=UTF-8

Redirect (mixed-case character triplets):

# curl -I http://host/wordpress-trunk/tag/%e4%B8%80%E6%A0%B7/

HTTP/1.1 301 Moved Permanently
Date: Sat, 17 Jul 2010 10:26:50 GMT
Server: Apache
X-Pingback: http://host/wordpress-trunk/xmlrpc.php
Location: http://host/wordpress-trunk/tag/%e4%b8%80%e6%a0%b7/
Content-Type: text/html; charset=UTF-8

No redirect (lowercase character triplets):

# curl -I http://host/wordpress-trunk/tag/%e4%b8%80%e6%a0%b7/

HTTP/1.1 200 OK
Date: Sat, 17 Jul 2010 10:24:31 GMT
Server: Apache
X-Pingback: http://host/wordpress-trunk/xmlrpc.php
Content-Type: text/html; charset=UTF-8

#13 @hakre
14 years ago

For a fix on live-sites I've updated the Better HTTP Redirect Plugin to take care on this issue as well. It's built in since version 1.2-beta-2: Redirect Loop Protection for Better HTTP Redirects Plugin. Just download the development version.

It works comparable like the new patch I've uploaded in the other ticket.

#14 @hakre
14 years ago

  • Resolution set to duplicate
  • Status changed from reopened to closed

From what I see now, this is a duplicate of #14292

#15 follow-up: @Lafirel
14 years ago

  • Resolution duplicate deleted
  • Status changed from closed to reopened

I download the Version 1.2-beta-2 of the plugin and upload it and then activate it.

Then I got these

Results:

http://lafirel.com/tag/%E4%B8%89%E7%BA%A2

HTTP/1.1 301 Moved Permanently
Transfer-Encoding: chunked
Date: Sat, 17 Jul 2010 14:27:35 GMT
Server: LiteSpeed
Connection: close
X-Powered-By: PHP/5.2.12
Vary: Cookie
X-Pingback: http://lafirel.com/xmlrpc.php
Content-Type: text/html; charset=UTF-8
Location: http://lafirel.com/tag/%e4%b8%89%e7%ba%a2



Do you test it before you say this plugin can take care on this issue as well?

#16 in reply to: ↑ 15 @hakre
14 years ago

Replying to Lafirel:

Do you test it before you say this plugin can take care on this issue as well?

That's the exact issue, yes. Lowercase and Uppercase URL-encoding. The three cases I posted above. I need to check it a second time, maybe I made a mistake in the plugin by accident.

#17 @hakre
14 years ago

Plugin looks good okay as far as I can look. I need to apply the other patch ( 13413.2.patch ) so that UTF-8 works in tag-slugs for me.

Please try this patch: 14292.2.diff

#18 @hakre
14 years ago

For the record, so you can see my request:

# curl -I http://webroot.loc:80/wordpress/tag/%E4%B8%80%e6%A0%B7

HTTP/1.1 200 OK
Date: Sun, 18 Jul 2010 03:33:38 GMT
Server: Apache
X-Pingback: http://webroot.loc/wordpress/xmlrpc.php
Content-Type: text/html; charset=UTF-8

This is with 14292.2.diff applied. Please test against a wordpress trunk version with no plugins enabled.

#19 @Lafirel
14 years ago

Both 14292.2.diff and 14292.2.patch are checked.

http://lafirel.com/tag/%E5%89%91%E8%B1%AA3

HTTP/1.1 200 OK
Transfer-Encoding: chunked
Date: Sun, 18 Jul 2010 05:46:37 GMT
Server: LiteSpeed
Connection: close
X-Powered-By: PHP/5.2.12
Vary: Cookie
X-Pingback: http://lafirel.com/xmlrpc.php
Content-Type: text/html; charset=UTF-8



Both two seems OK! I choose 14292.2.patch.

Can I public this patch on my blog? so that the users have the same issue can find it and fix it.

#20 @dd32
14 years ago

Can I public this patch on my blog? so that the users have the same issue can find it and fix it.

You can do whatever you wish with it.. It may go into a release as-is, or modified.. it may not be the final solution here.

#21 in reply to: ↑ 11 @hakre
14 years ago

Replying to Lafirel:

I am wondering if 3.0 have a estimate to response diffrents header via diffrents browser or search engine spider?

Robot, search, post, preview, trackback and comment popup requests and requests done by an admin user are never redirected.

You reported that requests from robots got a redirect, so this is out of synch but ignoring that for now, the list is when canonical redirects are not done.

#22 @hakre
14 years ago

  • Resolution set to duplicate
  • Status changed from reopened to closed

Duplicate of #14292

#23 @hakre
14 years ago

Related: #7095

Note: See TracTickets for help on using tickets.