Make WordPress Core

Opened 13 years ago

Closed 12 years ago

#14069 closed defect (bug) (wontfix)

do_robots() ignores charset setting

Reported by: hakre's profile hakre Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Charset Keywords: has-patch
Focuses: Cc:

Description

the do_robots() function does not reflect the blogs charset setting while hooking into plugins that could be aware of that.

a possible fix can be to add the charset setting (get_bloginfo( 'charset' )) in there.

bug has been "introduced" by implementing some related feature: [5117], looks like that setting was just forgotten that time.

Attachments (3)

14069.patch (497 bytes) - added by hakre 13 years ago.
robots.patch (823 bytes) - added by joostdevalk 13 years ago.
Patch 2nd go
14069.2.patch (978 bytes) - added by hakre 13 years ago.

Download all attachments as: .zip

Change History (12)

@hakre
13 years ago

#1 @hakre
13 years ago

Easy fix, gave it a testrun already with a blog havin the "US-ASCII" charset:

Current:

HTTP/1.1 200 OK
Date: Thu, 24 Jun 2010 10:16:18 GMT
Server: Apache
X-Pingback: http://example.com/xmlrpc.php
Transfer-Encoding: chunked
Content-Type: text/plain; charset=utf-8

User-agent: *
Disallow: /

Patched:

HTTP/1.1 200 OK
Date: Thu, 24 Jun 2010 10:16:18 GMT
Server: Apache
X-Pingback: http://example.com/xmlrpc.php
Transfer-Encoding: chunked
Content-Type: text/plain; charset=US-ASCII

User-agent: *
Disallow: /

#2 @hakre
13 years ago

Realted: #4037

#3 @ocean90
13 years ago

  • Milestone changed from Unassigned to Future Release

What happens when a user choose a charset which isn't the right charset for a robots.txt?
For example I found this:

To ensure that the search engine bots can read the directives
for blocking or allowing content to be indexed in robots.txt file
(not just with Bing, but all of them), save the file using one
of the following compatible encoding formats:
 ASCII
 ISO-8859-1
 UTF-8
 Windows-1252

I'm not the *charset king* like you, but I want to mention this here, maybe you can give me an answer.

#4 @hakre
13 years ago

I'm not the character queen either. Historically the safest route should be US-ASCII (or 7bit ascii). If robots.txt would support the encoding like we have it with URLs, then the robots.txt file can be 100% US-ASCII endocded, and the content it transport can be an urlencoded representation of any other character set then (which would not make much sense, becuase how should a robot determine a charset then?).

To make a longer story short, the charset meta-information as provided by the headers must match with the body encoding of the robots.txt file server response. The suggestion from that bing website can be useful but should not matter here. In the end a blogs admin decides which charset a blog uses. That's the charset robots.txt is encoded in as well. If it's incompatible with robots, then it's the admins choice.

Blogs should be either US-ASCII or UTF-8 btw. You can (but must not) use latin-1 for historical or performance reasons. This is how I would formulate a best practice suggestion.

Related: #14201

#5 @hakre
13 years ago

More important then the actual encoding of the file (okay, if you want that any robot reads it, make it ASCII - period.) is the encoding of the relative URLs used inside the file.

Those should be properly urlencoded.

I made a write-up here: Encoding of the robots.txt file and the resource ocen90 linked has this useful information as well:

[M]ake sure the bots can properly read the file and directory path names, regardless of whether it adheres to ASCII standards. When writing directives that include characters unavailable in ASCII, you can "escape" (aka percent-encode) them, which enables the bot to read them.

I think this is mostly important for webmasters who really want to care about these issues. My suggestion is to deliver the file in US-ASCII. It then can even mislabled as UTF-8 or Latin-1 as w/o running into any problems as long as the rules created by other parts of the webapplication are correctly urlencoded.

@joostdevalk
13 years ago

Patch 2nd go

#6 @joostdevalk
13 years ago

  • Cc joost@… added

Agreed with Hakre that it should be US-ASCII, also cleaned up some more as there was more code than needed.

@hakre
13 years ago

#7 @hakre
13 years ago

Thanks for looking into this.

Indeed, US-ASCII is the preferred mime-name (ref).

And indeed, the original code can be much simplified.

For documentation purposes: Best Practice robots.txt

Last edited 13 years ago by hakre (previous) (diff)

#8 @ryan
12 years ago

http://code.google.com/web/controlcrawlindex/docs/robots_txt.html

"The expected file format is plain text encoded in UTF-8."

And a small sampling of top sites suggests UTF-8 is the most commonly used.

#9 @ryan
12 years ago

  • Milestone Future Release deleted
  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.