Opened 13 years ago
Closed 12 years ago
#14069 closed defect (bug) (wontfix)
do_robots() ignores charset setting
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | |
Component: | Charset | Keywords: | has-patch |
Focuses: | Cc: |
Description
the do_robots() function does not reflect the blogs charset setting while hooking into plugins that could be aware of that.
a possible fix can be to add the charset setting (get_bloginfo( 'charset' )
) in there.
bug has been "introduced" by implementing some related feature: [5117], looks like that setting was just forgotten that time.
Attachments (3)
Change History (12)
#3
@
13 years ago
- Milestone changed from Unassigned to Future Release
What happens when a user choose a charset which isn't the right charset for a robots.txt?
For example I found this:
To ensure that the search engine bots can read the directives for blocking or allowing content to be indexed in robots.txt file (not just with Bing, but all of them), save the file using one of the following compatible encoding formats: ASCII ISO-8859-1 UTF-8 Windows-1252
I'm not the *charset king* like you, but I want to mention this here, maybe you can give me an answer.
#4
@
13 years ago
I'm not the character queen either. Historically the safest route should be US-ASCII (or 7bit ascii). If robots.txt would support the encoding like we have it with URLs, then the robots.txt file can be 100% US-ASCII endocded, and the content it transport can be an urlencoded representation of any other character set then (which would not make much sense, becuase how should a robot determine a charset then?).
To make a longer story short, the charset meta-information as provided by the headers must match with the body encoding of the robots.txt file server response. The suggestion from that bing website can be useful but should not matter here. In the end a blogs admin decides which charset a blog uses. That's the charset robots.txt is encoded in as well. If it's incompatible with robots, then it's the admins choice.
Blogs should be either US-ASCII or UTF-8 btw. You can (but must not) use latin-1 for historical or performance reasons. This is how I would formulate a best practice suggestion.
Related: #14201
#5
@
13 years ago
More important then the actual encoding of the file (okay, if you want that any robot reads it, make it ASCII - period.) is the encoding of the relative URLs used inside the file.
Those should be properly urlencoded.
I made a write-up here: Encoding of the robots.txt file and the resource ocen90 linked has this useful information as well:
[M]ake sure the bots can properly read the file and directory path names, regardless of whether it adheres to ASCII standards. When writing directives that include characters unavailable in ASCII, you can "escape" (aka percent-encode) them, which enables the bot to read them.
I think this is mostly important for webmasters who really want to care about these issues. My suggestion is to deliver the file in US-ASCII. It then can even mislabled as UTF-8 or Latin-1 as w/o running into any problems as long as the rules created by other parts of the webapplication are correctly urlencoded.
#6
@
13 years ago
- Cc joost@… added
Agreed with Hakre that it should be US-ASCII, also cleaned up some more as there was more code than needed.
#7
@
13 years ago
Thanks for looking into this.
Indeed, US-ASCII is the preferred mime-name (ref).
And indeed, the original code can be much simplified.
For documentation purposes: Best Practice robots.txt
#8
@
12 years ago
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
"The expected file format is plain text encoded in UTF-8."
And a small sampling of top sites suggests UTF-8 is the most commonly used.
Easy fix, gave it a testrun already with a blog havin the "US-ASCII" charset:
Current:
Patched: