#23070 closed enhancement (invalid)
Generated robots.txt file with privacy on should have Allow: /robots.txt
Reported by: | iamfriendly | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | |
Component: | General | Keywords: | |
Focuses: | Cc: |
Description
Scenario:
You're developing a site online (perhaps a new version of an existing site) and you have privacy settings disallowing (or, rather, discouraging) crawling. This generates a robots.txt file with
User-agent: * Disallow: /
Now, you've finished the new site and want google to have a look, so you stop discouraging spiders. This can take up to 48 hours for google to respider your robots.txt file. There's a way 'around' that by using Google Webmaster Tools and using the 'Fetch as Google' tool (under 'Health').
If the generated 'private' robots.txt file had
Allow: /robots.txt
in it, then you would be able to force google to respider your robots.txt file (and hence allow it to spider your sitemap.xml file) when you need it to, rather than having to wait for an indeterminate amount of time.
Granted, there are ways in which you can avoid this scenario, but I also know there are circumstances where you can't, and adding that simple one-liner can help you out.
Am I missing a trick? Am I being stupid? If not, then I'll write the very quick patch.
Change History (20)
#2
@
12 years ago
Sure, that's definitely an option. But I'm thinking more about people who don't really know how to do that. If someone has already enabled privacy and google has crawled their site, then that generated robots.txt file will be cached for an indeterminate amount of time (up to 48 hours).
My thoughts were that adding this one line to the robots.txt file generated by WordPress could stop some folk being out of google rankings for a couple of days. And it's a relatively easy process (and something I'd add to the codex) as to how to force webmaster tools to recrawl if the line is in the robots.txt file.
Adding a 'real' robots.txt file after G has already crawled doesn't do you any good, it'd need to be something you did immediately after you created the instance of WP online.
#3
in reply to:
↑ description
@
12 years ago
Replying to iamfriendly:
Am I missing a trick?
You are, in the sense that Google respects any Expires header that you set on your robots.txt file and will discard cached copies after that interval. Fetch as Google being unable to retrieve robots.txt is a Google bug, not a WordPress bug.
#4
follow-up:
↓ 5
@
12 years ago
Does WP set any expires header in the generated file it creates? I'm not saying this is a bug with WordPress in any sense of the word, I'm just trying to help with regards to a relative edge case. Adding that one line to the generated robots.txt file (rather than a manually made 'real' robots.txt file) could potentially save on support requests or spurious bug reports.
#5
in reply to:
↑ 4
@
12 years ago
Replying to iamfriendly:
Does WP set any expires header in the generated file it creates?
It does not. I believe it only sets X-Pingback, and Content-Type.
IMO, you would get more traction for an Expires header than for the original request.
#6
@
12 years ago
Setting an expires header sounds like an excellent idea. But then you get into the realm of what value to set. Also, setting the Allow: /robots.txt line negates the need for the expire header and it also gives the user the option to refresh it whenever they need.
Perhaps a combination of both is the way forward.
#7
@
12 years ago
Both are also likely to be deemed plugin territory because they can be very easily implemented without hacking on core files. The explanation of the enhancement so far sounds like it wouldn't do anything except for people who are playing with the Fetch as Google application.
#8
@
12 years ago
I think my main point has been missed (mainly by me not explaining properly).
People 'playing' with the fetch as google tool in GWT is useless if G has already crawled your site and you have privacy on. Adding a plugin to the site after G has crawled your site, likewise. You'd need to activate that plugin before activating privacy and considering you're asked what you'd like to do during the install procedure (w.r.t privacy), sometimes you don't have that option.
Adding this one line to the private robots.txt file could potentially prevent people from 2 days' worth of headaches. All I'm suggesting is that I know this is an edge case, but if it's an easy fix, and won't have any security concerns or performance issues etc. then why not?
#9
@
12 years ago
Fresh installs don't get crawled by Google, and certainly not before the small amount of time it would take to install a plugin. I think you're going to find that a plugin is the best solution and would resolve this completely.
#10
@
12 years ago
With the greatest amount of respect, I disagree. I've already outlined my reasons as to why I think this should be adjusted in core.
It's a potential snag that people may come across and find they can do nothing about. If they discover this problem, it's too late to install a plugin. If they are thinking about this issue with forethought (and therefore install the plugin immediately), then they'll simply make a manual robots.txt file. This adjustment isn't for those folks who know about how to do that, it's for the general public.
I'll go ahead and write a plugin and make a patch.
#11
@
12 years ago
The addition of Allow: /robots.txt
shouldn't significantly improve this situation, but it does look like it'll work around a bug in the Google Crawler.
Crawlers are supposed to ignore robots.txt for robots.txt access, however, they DO cache it for long periods of time (7 days is recommended from memory), but ARE supposed to verify the contents before using cached content.
Since it doesn't look like we're sending any cache control headers, it's up to the client (google in this case) to manage the caching of the files.. I'd seriously suggest that this is more of a Google bug that they should fix, rather than something we should change.
To refer to a draft specification:
3.4 Expiration Robots should cache /robots.txt files, but if they do they must periodically verify the cached copy is fresh before using its contents. Standard HTTP cache-control mechanisms can be used by both origin server and robots to influence the caching of the /robots.txt file. Specifically robots should take note of Expires header set by the origin server. If no cache-control directives are present robots should default to an expiry of 7 days.
#12
@
12 years ago
- Milestone Awaiting Review deleted
- Resolution set to wontfix
- Status changed from new to closed
I've emailed someone at Google, this is clearly a bug in the fetch as Google functionality in GWT.
#13
@
12 years ago
- Resolution wontfix deleted
- Status changed from closed to reopened
I disagree that we shouldn't add this. It's a simple fix and we should have fix this in 3.5.1. We now only seems to look at Google where this bug exists for a long time. We should also care about other search engines/platforms what also have this issue.
#14
@
12 years ago
- Resolution set to wontfix
- Status changed from reopened to closed
That's just nonsense Marko. In fact, this line would be in direct violation of the RFC (see point 3.2.2):
The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.
If other search engines are doing things wrong, feel free to comment with what they're doing wrong, I'll either explain or contact the search engine.
#15
@
12 years ago
After 3 weeks you mailed they still didn't fix it. If you can get them fix asap then I'm fine but we are talking about Google so we probably need to wait for a long time.
#16
@
12 years ago
Irregardless, we shouldn't add stuff simply because they have a bug. It's not our bug to fix.
#17
@
12 years ago
It's not our bug to fix but it our responsibility to make the experience in WordPress as good as we can. That means we sometimes need to fix stuff what isn't our mistake.
#18
@
12 years ago
Not against web standards that dozens if not hundreds of search engines, outside of Google, adhere to.
#19
@
12 years ago
Last time I checked there is a way to detect that googles bot is visiting robots.txt. so it's just partly against web standards.
#20
@
12 years ago
- Resolution changed from wontfix to invalid
The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.
That's more than enough for me. Sometimes we do need to fix stuff that isn't our mistake. But we're dealing with a hosted webmaster tool that Google could (and hopefully will) fix at any time, not a major issue that affects the masses via a poorly configured server or browser.
Why don't you just create your own robots.txt file? If one exists, WordPress' robots.txt handling will not be in effect.
You can easily handle such special cases with that.