#18465 closed enhancement (fixed)
Prevent search engines from indexing wp-admin and wp-includes
Reported by: | Viper007Bond | Owned by: | ryan |
---|---|---|---|
Milestone: | 3.4 | Priority: | lowest |
Severity: | trivial | Version: | 3.2.1 |
Component: | General | Keywords: | has-patch |
Focuses: | Cc: |
Description
http://www.google.com/search?q=site:viper007bond.com+inurl:wp-admin
Both of these URLs show up in that search result:
http://www.viper007bond.com/wordpress/wp-admin/js/utils.js?ver=20101110
http://www.viper007bond.com/wordpress/wp-admin/admin-ajax.php
Some more at http://www.google.com/search?q=site:viper007bond.com+inurl:wp-includes
No reason for Google to index those, so I suggest we add them to an exclude list in the default robots.txt
handler.
Attachments (8)
Change History (47)
#3
@
13 years ago
- Keywords needs-patch added; has-patch removed
I meant to mention this but forgot: the required patch is more complicated than it originally seems because WordPress can be installed in a subdirectory. Your patch Sergey won't block it on my site for example because it should be /wordpress/wp-admin/
for me.
I'm not sure the best way to build out the relative path.
#4
follow-up:
↓ 5
@
13 years ago
According to rewrite.php
, WordPress only handles robots.txt
itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497
#5
in reply to:
↑ 4
@
13 years ago
Replying to SergeyBiryukov:
According to
rewrite.php
, WordPress only handlesrobots.txt
itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497
Correct, but see http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory
This file is generated by WordPress: http://www.finalgear.com/robots.txt
However the path to my wp-admin
folder is http://www.finalgear.com/wordpress/wp-admin/
#9
@
13 years ago
"Ternary operators are fine, but always have them test if the statement is true, not false. Otherwise it just gets confusing."
http://codex.wordpress.org/WordPress_Coding_Standards#Ternary_Operator
#10
@
13 years ago
Why not just add these lines in the robots.txt?
Disallow: */wp-admin Disallow: */wp-includes Disallow: */wp-content/
So then, no matter where wp-admin is, it won't index it.
#11
@
13 years ago
You don't want to block wp-content, as that's where your uploads are (which most people do want indexed).
As for using */wp-admin/
, it would work, but it's better to be exact.
#12
@
13 years ago
Replying to scribu:
"Ternary operators are fine, but always have them test if the statement is true, not false."
In case of empty()
, the other way round seems more logical to me, but coding standards FTW :)
#13
follow-up:
↓ 15
@
13 years ago
Why not just add these lines in the robots.txt?
You can't use * in the Disallow statements as it doesnt support wildcards, It matches from the start of the Path.
Disallow: $path/wp-includes
This should probably be suffixed with / as well to allow pages/posts with wp-admin/wp-includes to be indexed
#14
@
13 years ago
Well I tested it in Google Webmaster tool and it blocked it from the crawler access.
But if you want to not add the "*" then add <meta name="robots" content="noindex">
for the /wp-admin index file.
#17
@
13 years ago
Well I tested it in Google Webmaster tool and it blocked it from the crawler access.
Seems that Google and a few others support an "Extended" robots.txt standard, see the "June 2008 Agreement" here: http://www.searchtools.com/robots/robots-exclusion-protocol.html
Best going for the explicit directory deny for compatibility with other crawlers, since the wildcard doesn't bring any dramatic advances over the the original standard.
#18
@
13 years ago
I've been meaning to add an exception to the coding standards for !empty()
. That construct should be preferred as an exception.
#19
@
13 years ago
- Cc kpayne@… added
Should readme.html be blocked as well?
Googling for specific terms brings up a lot of WordPress sites.
#20
@
13 years ago
The main concern in #17601 were the errors in server logs due to indexing wp-admin
and wp-includes
. Indexing readme.html
doesn't create such errors, but probably doesn't make much sense too, so we could reduce unnecessary crawling even more.
That said, I'm not sure readme.html
ends up in search results, since we have index.php
in the root directory. Perhaps there's a link to readme.html
somewhere on those sites, or they're missing index.php
in DirectoryIndex
, or their hosts don't support PHP (so they can't run WordPress anyway).
For wp-includes
, there are much more results (currently about 5,540,000).
#22
@
13 years ago
- Owner set to ryan
- Resolution set to fixed
- Status changed from new to closed
In [18822]:
#24
@
13 years ago
- Keywords commit removed
I guess we should use a leading wildcard, like it was suggested in #comment:10.
#25
@
13 years ago
Replying to nacin:
How will this work in a network situation?
On a sub-domain install, any site would return this for a robots.txt
request:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
On a sub-directory install, robots.txt
(with the same content) is only served for the main site.
Replying to scribu:
I guess we should use a leading wildcard, like it was suggested in #comment:10.
Since wildcards in Disallow statements are not officially supported by the protocol, I guess we should leave it as is for now.
#26
@
13 years ago
- Cc neo@… added
Maybe we should also deny wp-login.php?
Disallow: /wp-login.php Disallow: /wp-login.php?*
#28
follow-ups:
↓ 32
↓ 38
@
13 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
- Summary changed from robots.txt should tell Google to not index wp-admin and wp-includes to Prevent search engines from indexing wp-admin and wp-includes
This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page. Example of this can be seen on my Dutch domain:
https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin
The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.
#29
follow-up:
↓ 31
@
13 years ago
(this fix btw has the added benefit of fixing it for people with static robots.txt files)
#31
in reply to:
↑ 29
@
13 years ago
- Milestone changed from 3.3 to 3.4
Replying to joostdevalk:
(this fix btw has the added benefit of fixing it for people with static robots.txt files)
Would #18546 also make any sense?
#32
in reply to:
↑ 28
@
13 years ago
Replying to joostdevalk:
While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page.
Thus, in addition I would suggest to add the attribute rel="nofollow"
to all wp-login.php links to reduce the number of links pointing to the file as well as for performance/traffic reasons until #14348 has been fixed. The attached patch adopts the functions wp_register
and wp_loginout
.
#34
@
13 years ago
Just for reference: http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
#35
@
13 years ago
Let's issue X-Robots-Tag: noindex
for wp-login.php and admin-ajax.php. Is there anything else we need to do here? auth_redirect() will send anyone away from anywhere else in wp-admin. This is pretty convincing: https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin.
#36
@
13 years ago
nacin beat me to it by seconds :-), but here's my comment:
Sending the X-Robots-Tag header after auth_redirect() in admin.php seems useless since logged in pages shouldn't be crawled. And there doesn't seem like there'd be any value in sending the header before auth_redirect().
wp-login.php and wp-signup.php already use wp_no_robots(). That leaves admin-ajax.php. It doesn't have a head so using the X-Robots-Tag header seems appropriate.
#38
in reply to:
↑ 28
;
follow-up:
↓ 39
@
12 years ago
Just to clarify.
It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.
/Event
Replying to joostdevalk:
This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help Example of this can be seen on my Dutch domain:
https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin
The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.
#39
in reply to:
↑ 38
@
12 years ago
Replying to koebenhavn event:
Just to clarify.
It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.
/Event
Do a search for inurl:wp-admin/admin-ajax.php and you'll see there are loads and loads of sites there that actually show up. That's why this is an issue.
One thing to note: I have a real
robots.txt
file (WordPress doesn't handle it), so while my site isn't a perfect example it's close enough because the same thing is experienced on a stock WordPress install.