#18465 closed enhancement (fixed)
Prevent search engines from indexing wp-admin and wp-includes
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Priority: | lowest | Milestone: | 3.4 |
| Component: | General | Version: | 3.2.1 |
| Severity: | trivial | Keywords: | has-patch |
| Cc: | kpayne@…, neo@…, joost@…, ipstenu@… |
Description
http://www.google.com/search?q=site:viper007bond.com+inurl:wp-admin
Both of these URLs show up in that search result:
http://www.viper007bond.com/wordpress/wp-admin/js/utils.js?ver=20101110
http://www.viper007bond.com/wordpress/wp-admin/admin-ajax.php
Some more at http://www.google.com/search?q=site:viper007bond.com+inurl:wp-includes
No reason for Google to index those, so I suggest we add them to an exclude list in the default robots.txt handler.
Attachments (8)
Change History (47)
comment:1
Viper007Bond — 21 months ago
SergeyBiryukov — 21 months ago
comment:2
SergeyBiryukov — 21 months ago
- Keywords has-patch added; needs-patch removed
comment:3
Viper007Bond — 21 months ago
- Keywords needs-patch added; has-patch removed
I meant to mention this but forgot: the required patch is more complicated than it originally seems because WordPress can be installed in a subdirectory. Your patch Sergey won't block it on my site for example because it should be /wordpress/wp-admin/ for me.
I'm not sure the best way to build out the relative path.
comment:4
follow-up:
↓ 5
SergeyBiryukov — 21 months ago
According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497
comment:5
in reply to:
↑ 4
Viper007Bond — 21 months ago
Replying to SergeyBiryukov:
According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497
Correct, but see http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory
This file is generated by WordPress: http://www.finalgear.com/robots.txt
However the path to my wp-admin folder is http://www.finalgear.com/wordpress/wp-admin/
comment:6
Viper007Bond — 21 months ago
Probably just parsing admin_url() to remove the domain is enough.
SergeyBiryukov — 21 months ago
comment:7
SergeyBiryukov — 21 months ago
Right, I misunderstood that comment in rewrite.php. Updated the patch.
comment:8
SergeyBiryukov — 21 months ago
- Keywords has-patch added; needs-patch removed
"Ternary operators are fine, but always have them test if the statement is true, not false. Otherwise it just gets confusing."
http://codex.wordpress.org/WordPress_Coding_Standards#Ternary_Operator
SergeyBiryukov — 21 months ago
comment:10
justindgivens — 21 months ago
Why not just add these lines in the robots.txt?
Disallow: */wp-admin Disallow: */wp-includes Disallow: */wp-content/
So then, no matter where wp-admin is, it won't index it.
comment:11
scribu — 21 months ago
You don't want to block wp-content, as that's where your uploads are (which most people do want indexed).
As for using */wp-admin/, it would work, but it's better to be exact.
Replying to scribu:
"Ternary operators are fine, but always have them test if the statement is true, not false."
In case of empty(), the other way round seems more logical to me, but coding standards FTW :)
comment:13
follow-up:
↓ 15
dd32 — 21 months ago
Why not just add these lines in the robots.txt?
You can't use * in the Disallow statements as it doesnt support wildcards, It matches from the start of the Path.
Disallow: $path/wp-includes
This should probably be suffixed with / as well to allow pages/posts with wp-admin/wp-includes to be indexed
SergeyBiryukov — 21 months ago
comment:14
justindgivens — 21 months ago
Well I tested it in Google Webmaster tool and it blocked it from the crawler access.
But if you want to not add the "*" then add <meta name="robots" content="noindex"> for the /wp-admin index file.
comment:15
in reply to:
↑ 13
SergeyBiryukov — 21 months ago
- Milestone changed from Awaiting Review to 3.3
comment:17
dd32 — 21 months ago
Well I tested it in Google Webmaster tool and it blocked it from the crawler access.
Seems that Google and a few others support an "Extended" robots.txt standard, see the "June 2008 Agreement" here: http://www.searchtools.com/robots/robots-exclusion-protocol.html
Best going for the explicit directory deny for compatibility with other crawlers, since the wildcard doesn't bring any dramatic advances over the the original standard.
comment:18
nacin — 21 months ago
I've been meaning to add an exception to the coding standards for !empty(). That construct should be preferred as an exception.
comment:19
kurtpayne — 21 months ago
- Cc kpayne@… added
Should readme.html be blocked as well?
Googling for specific terms brings up a lot of WordPress sites.
The main concern in #17601 were the errors in server logs due to indexing wp-admin and wp-includes. Indexing readme.html doesn't create such errors, but probably doesn't make much sense too, so we could reduce unnecessary crawling even more.
That said, I'm not sure readme.html ends up in search results, since we have index.php in the root directory. Perhaps there's a link to readme.html somewhere on those sites, or they're missing index.php in DirectoryIndex, or their hosts don't support PHP (so they can't run WordPress anyway).
For wp-includes, there are much more results (currently about 5,540,000).
comment:22
ryan — 20 months ago
- Owner set to ryan
- Resolution set to fixed
- Status changed from new to closed
In [18822]:
comment:23
nacin — 20 months ago
How will this work in a network situation?
comment:24
scribu — 20 months ago
- Keywords commit removed
I guess we should use a leading wildcard, like it was suggested in #comment:10.
Replying to nacin:
How will this work in a network situation?
On a sub-domain install, any site would return this for a robots.txt request:
User-agent: * Disallow: /wp-admin/ Disallow: /wp-includes/
On a sub-directory install, robots.txt (with the same content) is only served for the main site.
Replying to scribu:
I guess we should use a leading wildcard, like it was suggested in #comment:10.
Since wildcards in Disallow statements are not officially supported by the protocol, I guess we should leave it as is for now.
comment:26
neoxx — 20 months ago
- Cc neo@… added
Maybe we should also deny wp-login.php?
Disallow: /wp-login.php Disallow: /wp-login.php?*
wp-login.php has <meta name='robots' content='noindex,nofollow' />.
comment:28
follow-ups:
↓ 32
↓ 38
joostdevalk — 17 months ago
- Resolution fixed deleted
- Status changed from closed to reopened
- Summary changed from robots.txt should tell Google to not index wp-admin and wp-includes to Prevent search engines from indexing wp-admin and wp-includes
This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page. Example of this can be seen on my Dutch domain:
https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin
The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.
comment:29
follow-up:
↓ 31
joostdevalk — 17 months ago
(this fix btw has the added benefit of fixing it for people with static robots.txt files)
comment:30
joostdevalk — 17 months ago
- Cc joost@… added
comment:31
in reply to:
↑ 29
SergeyBiryukov — 17 months ago
- Milestone changed from 3.3 to 3.4
Replying to joostdevalk:
(this fix btw has the added benefit of fixing it for people with static robots.txt files)
Would #18546 also make any sense?
comment:32
in reply to:
↑ 28
neoxx — 17 months ago
Replying to joostdevalk:
While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page.
Thus, in addition I would suggest to add the attribute rel="nofollow" to all wp-login.php links to reduce the number of links pointing to the file as well as for performance/traffic reasons until #14348 has been fixed. The attached patch adopts the functions wp_register and wp_loginout.
comment:33
Ipstenu — 16 months ago
- Cc ipstenu@… added
comment:34
ryan — 15 months ago
Just for reference: http://code.google.com/web/controlcrawlindex/docs/robots_meta_tag.html
comment:35
nacin — 15 months ago
Let's issue X-Robots-Tag: noindex for wp-login.php and admin-ajax.php. Is there anything else we need to do here? auth_redirect() will send anyone away from anywhere else in wp-admin. This is pretty convincing: https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin.
comment:36
ryan — 15 months ago
nacin beat me to it by seconds :-), but here's my comment:
Sending the X-Robots-Tag header after auth_redirect() in admin.php seems useless since logged in pages shouldn't be crawled. And there doesn't seem like there'd be any value in sending the header before auth_redirect().
wp-login.php and wp-signup.php already use wp_no_robots(). That leaves admin-ajax.php. It doesn't have a head so using the X-Robots-Tag header seems appropriate.
comment:37
nacin — 14 months ago
- Resolution set to fixed
- Status changed from reopened to closed
In [20288]:
comment:38
in reply to:
↑ 28
;
follow-up:
↓ 39
koebenhavn event — 12 months ago
Just to clarify.
It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.
/Event
Replying to joostdevalk:
This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help Example of this can be seen on my Dutch domain:
https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin
The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag events. ) for the files in admin and for admin-ajax.php, will add a patch.
comment:39
in reply to:
↑ 38
joostdevalk — 12 months ago
Replying to koebenhavn event:
Just to clarify.
It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.
/Event
Do a search for inurl:wp-admin/admin-ajax.php and you'll see there are loads and loads of sites there that actually show up. That's why this is an issue.

One thing to note: I have a real robots.txt file (WordPress doesn't handle it), so while my site isn't a perfect example it's close enough because the same thing is experienced on a stock WordPress install.