Make WordPress Core

Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#23070 closed enhancement (invalid)

Generated robots.txt file with privacy on should have Allow: /robots.txt

Reported by: iamfriendly's profile iamfriendly Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: General Keywords:
Focuses: Cc:

Description

Scenario:

You're developing a site online (perhaps a new version of an existing site) and you have privacy settings disallowing (or, rather, discouraging) crawling. This generates a robots.txt file with

User-agent: *
Disallow: /

Now, you've finished the new site and want google to have a look, so you stop discouraging spiders. This can take up to 48 hours for google to respider your robots.txt file. There's a way 'around' that by using Google Webmaster Tools and using the 'Fetch as Google' tool (under 'Health').

If the generated 'private' robots.txt file had

Allow: /robots.txt

in it, then you would be able to force google to respider your robots.txt file (and hence allow it to spider your sitemap.xml file) when you need it to, rather than having to wait for an indeterminate amount of time.

Granted, there are ways in which you can avoid this scenario, but I also know there are circumstances where you can't, and adding that simple one-liner can help you out.

Am I missing a trick? Am I being stupid? If not, then I'll write the very quick patch.

Change History (20)

#1 @TobiasBg
12 years ago

Why don't you just create your own robots.txt file? If one exists, WordPress' robots.txt handling will not be in effect.

You can easily handle such special cases with that.

#2 @iamfriendly
12 years ago

Sure, that's definitely an option. But I'm thinking more about people who don't really know how to do that. If someone has already enabled privacy and google has crawled their site, then that generated robots.txt file will be cached for an indeterminate amount of time (up to 48 hours).

My thoughts were that adding this one line to the robots.txt file generated by WordPress could stop some folk being out of google rankings for a couple of days. And it's a relatively easy process (and something I'd add to the codex) as to how to force webmaster tools to recrawl if the line is in the robots.txt file.

Adding a 'real' robots.txt file after G has already crawled doesn't do you any good, it'd need to be something you did immediately after you created the instance of WP online.

#3 in reply to: ↑ description @miqrogroove
12 years ago

Replying to iamfriendly:

Am I missing a trick?

You are, in the sense that Google respects any Expires header that you set on your robots.txt file and will discard cached copies after that interval. Fetch as Google being unable to retrieve robots.txt is a Google bug, not a WordPress bug.

#4 follow-up: @iamfriendly
12 years ago

Does WP set any expires header in the generated file it creates? I'm not saying this is a bug with WordPress in any sense of the word, I'm just trying to help with regards to a relative edge case. Adding that one line to the generated robots.txt file (rather than a manually made 'real' robots.txt file) could potentially save on support requests or spurious bug reports.

#5 in reply to: ↑ 4 @miqrogroove
12 years ago

Replying to iamfriendly:

Does WP set any expires header in the generated file it creates?

It does not. I believe it only sets X-Pingback, and Content-Type.

IMO, you would get more traction for an Expires header than for the original request.

#6 @iamfriendly
12 years ago

Setting an expires header sounds like an excellent idea. But then you get into the realm of what value to set. Also, setting the Allow: /robots.txt line negates the need for the expire header and it also gives the user the option to refresh it whenever they need.

Perhaps a combination of both is the way forward.

#7 @miqrogroove
12 years ago

Both are also likely to be deemed plugin territory because they can be very easily implemented without hacking on core files. The explanation of the enhancement so far sounds like it wouldn't do anything except for people who are playing with the Fetch as Google application.

#8 @iamfriendly
12 years ago

I think my main point has been missed (mainly by me not explaining properly).

People 'playing' with the fetch as google tool in GWT is useless if G has already crawled your site and you have privacy on. Adding a plugin to the site after G has crawled your site, likewise. You'd need to activate that plugin before activating privacy and considering you're asked what you'd like to do during the install procedure (w.r.t privacy), sometimes you don't have that option.

Adding this one line to the private robots.txt file could potentially prevent people from 2 days' worth of headaches. All I'm suggesting is that I know this is an edge case, but if it's an easy fix, and won't have any security concerns or performance issues etc. then why not?

#9 @miqrogroove
12 years ago

Fresh installs don't get crawled by Google, and certainly not before the small amount of time it would take to install a plugin. I think you're going to find that a plugin is the best solution and would resolve this completely.

#10 @iamfriendly
12 years ago

With the greatest amount of respect, I disagree. I've already outlined my reasons as to why I think this should be adjusted in core.

It's a potential snag that people may come across and find they can do nothing about. If they discover this problem, it's too late to install a plugin. If they are thinking about this issue with forethought (and therefore install the plugin immediately), then they'll simply make a manual robots.txt file. This adjustment isn't for those folks who know about how to do that, it's for the general public.

I'll go ahead and write a plugin and make a patch.

#11 @dd32
12 years ago

The addition of Allow: /robots.txt shouldn't significantly improve this situation, but it does look like it'll work around a bug in the Google Crawler.

Crawlers are supposed to ignore robots.txt for robots.txt access, however, they DO cache it for long periods of time (7 days is recommended from memory), but ARE supposed to verify the contents before using cached content.

Since it doesn't look like we're sending any cache control headers, it's up to the client (google in this case) to manage the caching of the files.. I'd seriously suggest that this is more of a Google bug that they should fix, rather than something we should change.

To refer to a draft specification:

3.4 Expiration
   Robots should cache /robots.txt files, but if they do they must
   periodically verify the cached copy is fresh before using its
   contents.

   Standard HTTP cache-control mechanisms can be used by both origin
   server and robots to influence the caching of the /robots.txt file.
   Specifically robots should take note of Expires header set by the
   origin server.

   If no cache-control directives are present robots should default to
   an expiry of 7 days.

#12 @joostdevalk
12 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to wontfix
  • Status changed from new to closed

I've emailed someone at Google, this is clearly a bug in the fetch as Google functionality in GWT.

#13 @markoheijnen
12 years ago

  • Resolution wontfix deleted
  • Status changed from closed to reopened

I disagree that we shouldn't add this. It's a simple fix and we should have fix this in 3.5.1. We now only seems to look at Google where this bug exists for a long time. We should also care about other search engines/platforms what also have this issue.

#14 @joostdevalk
12 years ago

  • Resolution set to wontfix
  • Status changed from reopened to closed

That's just nonsense Marko. In fact, this line would be in direct violation of the RFC (see point 3.2.2):

The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.

If other search engines are doing things wrong, feel free to comment with what they're doing wrong, I'll either explain or contact the search engine.

#15 @markoheijnen
12 years ago

After 3 weeks you mailed they still didn't fix it. If you can get them fix asap then I'm fine but we are talking about Google so we probably need to wait for a long time.

#16 @joostdevalk
12 years ago

Irregardless, we shouldn't add stuff simply because they have a bug. It's not our bug to fix.

#17 @markoheijnen
12 years ago

It's not our bug to fix but it our responsibility to make the experience in WordPress as good as we can. That means we sometimes need to fix stuff what isn't our mistake.

#18 @joostdevalk
12 years ago

Not against web standards that dozens if not hundreds of search engines, outside of Google, adhere to.

#19 @markoheijnen
12 years ago

Last time I checked there is a way to detect that googles bot is visiting robots.txt. so it's just partly against web standards.

#20 @nacin
12 years ago

  • Resolution changed from wontfix to invalid

The /robots.txt URL is always allowed, and must not appear in the Allow/Disallow rules.

That's more than enough for me. Sometimes we do need to fix stuff that isn't our mistake. But we're dealing with a hosted webmaster tool that Google could (and hopefully will) fix at any time, not a major issue that affects the masses via a poorly configured server or browser.

Note: See TracTickets for help on using tickets.