Make WordPress Core

Opened 13 years ago

Closed 13 years ago

Last modified 12 years ago

#18465 closed enhancement (fixed)

Prevent search engines from indexing wp-admin and wp-includes

Reported by: viper007bond's profile Viper007Bond Owned by: ryan's profile ryan
Milestone: 3.4 Priority: lowest
Severity: trivial Version: 3.2.1
Component: General Keywords: has-patch
Focuses: Cc:

Description

Attachments (8)

18465.patch (650 bytes) - added by SergeyBiryukov 13 years ago.
18465.2.patch (770 bytes) - added by SergeyBiryukov 13 years ago.
18465.3.patch (768 bytes) - added by SergeyBiryukov 13 years ago.
18465.4.patch (770 bytes) - added by SergeyBiryukov 13 years ago.
18465.5.patch (771 bytes) - added by SergeyBiryukov 13 years ago.
Same as previous, but switched to !empty() again
noindex.patch (1.6 KB) - added by joostdevalk 13 years ago.
Noindex HTTP header patch for wp-admin
nofollow.patch (1.4 KB) - added by neoxx 13 years ago.
Reduce links to wp-login.php by rel="nofollow"
18465.diff (441 bytes) - added by ryan 13 years ago.

Download all attachments as: .zip

Change History (47)

#1 @Viper007Bond
13 years ago

One thing to note: I have a real robots.txt file (WordPress doesn't handle it), so while my site isn't a perfect example it's close enough because the same thing is experienced on a stock WordPress install.

#2 @SergeyBiryukov
13 years ago

  • Keywords has-patch added; needs-patch removed

#3 @Viper007Bond
13 years ago

  • Keywords needs-patch added; has-patch removed

I meant to mention this but forgot: the required patch is more complicated than it originally seems because WordPress can be installed in a subdirectory. Your patch Sergey won't block it on my site for example because it should be /wordpress/wp-admin/ for me.

I'm not sure the best way to build out the relative path.

#4 follow-up: @SergeyBiryukov
13 years ago

According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497

#5 in reply to: ↑ 4 @Viper007Bond
13 years ago

Replying to SergeyBiryukov:

According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497

Correct, but see http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory

This file is generated by WordPress: http://www.finalgear.com/robots.txt

However the path to my wp-admin folder is http://www.finalgear.com/wordpress/wp-admin/

#6 @Viper007Bond
13 years ago

Probably just parsing admin_url() to remove the domain is enough.

#7 @SergeyBiryukov
13 years ago

Right, I misunderstood that comment in rewrite.php. Updated the patch.

#8 @SergeyBiryukov
13 years ago

  • Keywords has-patch added; needs-patch removed

#9 @scribu
13 years ago

"Ternary operators are fine, but always have them test if the statement is true, not false. Otherwise it just gets confusing."

http://codex.wordpress.org/WordPress_Coding_Standards#Ternary_Operator

#10 @justindgivens
13 years ago

Why not just add these lines in the robots.txt?

Disallow: */wp-admin
Disallow: */wp-includes
Disallow: */wp-content/

So then, no matter where wp-admin is, it won't index it.

#11 @scribu
13 years ago

You don't want to block wp-content, as that's where your uploads are (which most people do want indexed).

As for using */wp-admin/, it would work, but it's better to be exact.

#12 @SergeyBiryukov
13 years ago

Replying to scribu:

"Ternary operators are fine, but always have them test if the statement is true, not false."

In case of empty(), the other way round seems more logical to me, but coding standards FTW :)

#13 follow-up: @dd32
13 years ago

Why not just add these lines in the robots.txt?

You can't use * in the Disallow statements as it doesnt support wildcards, It matches from the start of the Path.

Disallow: $path/wp-includes

This should probably be suffixed with / as well to allow pages/posts with wp-admin/wp-includes to be indexed

#14 @justindgivens
13 years ago

Well I tested it in Google Webmaster tool and it blocked it from the crawler access.

But if you want to not add the "*" then add <meta name="robots" content="noindex"> for the /wp-admin index file.

#15 in reply to: ↑ 13 @SergeyBiryukov
13 years ago

Replying to dd32:

This should probably be suffixed with / as well

Done in 18465.4.patch.

#16 @SergeyBiryukov
13 years ago

  • Milestone changed from Awaiting Review to 3.3

#17 @dd32
13 years ago

Well I tested it in Google Webmaster tool and it blocked it from the crawler access.

Seems that Google and a few others support an "Extended" robots.txt standard, see the "June 2008 Agreement" here: http://www.searchtools.com/robots/robots-exclusion-protocol.html

Best going for the explicit directory deny for compatibility with other crawlers, since the wildcard doesn't bring any dramatic advances over the the original standard.

#18 @nacin
13 years ago

I've been meaning to add an exception to the coding standards for !empty(). That construct should be preferred as an exception.

@SergeyBiryukov
13 years ago

Same as previous, but switched to !empty() again

#19 @kurtpayne
13 years ago

  • Cc kpayne@… added

Should readme.html be blocked as well?

Googling for specific terms brings up a lot of WordPress sites.

#20 @SergeyBiryukov
13 years ago

The main concern in #17601 were the errors in server logs due to indexing wp-admin and wp-includes. Indexing readme.html doesn't create such errors, but probably doesn't make much sense too, so we could reduce unnecessary crawling even more.

That said, I'm not sure readme.html ends up in search results, since we have index.php in the root directory. Perhaps there's a link to readme.html somewhere on those sites, or they're missing index.php in DirectoryIndex, or their hosts don't support PHP (so they can't run WordPress anyway).

For wp-includes, there are much more results (currently about 5,540,000).

#21 @dd32
13 years ago

  • Keywords commit added

All looks good to me from here on testing.

#22 @ryan
13 years ago

  • Owner set to ryan
  • Resolution set to fixed
  • Status changed from new to closed

In [18822]:

Disallow indexing wp-admin and wp-includes in robots.txt. Props SergeyBiryukov. fixes #18465

#23 @nacin
13 years ago

How will this work in a network situation?

#24 @scribu
13 years ago

  • Keywords commit removed

I guess we should use a leading wildcard, like it was suggested in #comment:10.

#25 @SergeyBiryukov
13 years ago

Replying to nacin:

How will this work in a network situation?

On a sub-domain install, any site would return this for a robots.txt request:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

On a sub-directory install, robots.txt (with the same content) is only served for the main site.

Replying to scribu:

I guess we should use a leading wildcard, like it was suggested in #comment:10.

Since wildcards in Disallow statements are not officially supported by the protocol, I guess we should leave it as is for now.

#26 @neoxx
13 years ago

  • Cc neo@… added

Maybe we should also deny wp-login.php?

Disallow: /wp-login.php
Disallow: /wp-login.php?*

#27 @SergeyBiryukov
13 years ago

wp-login.php has <meta name='robots' content='noindex,nofollow' />.

#28 follow-ups: @joostdevalk
13 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from robots.txt should tell Google to not index wp-admin and wp-includes to Prevent search engines from indexing wp-admin and wp-includes

This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page. Example of this can be seen on my Dutch domain:

https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin

The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.

#29 follow-up: @joostdevalk
13 years ago

(this fix btw has the added benefit of fixing it for people with static robots.txt files)

#30 @joostdevalk
13 years ago

  • Cc joost@… added

@joostdevalk
13 years ago

Noindex HTTP header patch for wp-admin

#31 in reply to: ↑ 29 @SergeyBiryukov
13 years ago

  • Milestone changed from 3.3 to 3.4

Replying to joostdevalk:

(this fix btw has the added benefit of fixing it for people with static robots.txt files)

Would #18546 also make any sense?

#32 in reply to: ↑ 28 @neoxx
13 years ago

Replying to joostdevalk:

While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page.

Thus, in addition I would suggest to add the attribute rel="nofollow" to all wp-login.php links to reduce the number of links pointing to the file as well as for performance/traffic reasons until #14348 has been fixed. The attached patch adopts the functions wp_register and wp_loginout.

@neoxx
13 years ago

Reduce links to wp-login.php by rel="nofollow"

#33 @Ipstenu
13 years ago

  • Cc ipstenu@… added

#35 @nacin
13 years ago

Let's issue X-Robots-Tag: noindex for wp-login.php and admin-ajax.php. Is there anything else we need to do here? auth_redirect() will send anyone away from anywhere else in wp-admin. This is pretty convincing: https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin.

#36 @ryan
13 years ago

nacin beat me to it by seconds :-), but here's my comment:

Sending the X-Robots-Tag header after auth_redirect() in admin.php seems useless since logged in pages shouldn't be crawled. And there doesn't seem like there'd be any value in sending the header before auth_redirect().

wp-login.php and wp-signup.php already use wp_no_robots(). That leaves admin-ajax.php. It doesn't have a head so using the X-Robots-Tag header seems appropriate.

@ryan
13 years ago

#37 @nacin
13 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In [20288]:

Send X-Robots-Tag: noindex in admin-ajax. props ryan, joostdevalk. fixes #18465.

#38 in reply to: ↑ 28 ; follow-up: @koebenhavn event
12 years ago

Just to clarify.

It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.

/Event

Replying to joostdevalk:

This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help Example of this can be seen on my Dutch domain:

https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin

The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

#39 in reply to: ↑ 38 @joostdevalk
12 years ago

Replying to koebenhavn event:

Just to clarify.

It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.

/Event

Do a search for inurl:wp-admin/admin-ajax.php and you'll see there are loads and loads of sites there that actually show up. That's why this is an issue.

Note: See TracTickets for help on using tickets.