WordPress.org

Make WordPress Core

Opened 3 years ago

Closed 2 years ago

Last modified 2 years ago

#18465 closed enhancement (fixed)

Prevent search engines from indexing wp-admin and wp-includes

Reported by: Viper007Bond Owned by: ryan
Milestone: 3.4 Priority: lowest
Severity: trivial Version: 3.2.1
Component: General Keywords: has-patch
Focuses: Cc:

Description

Attachments (8)

18465.patch (650 bytes) - added by SergeyBiryukov 3 years ago.
18465.2.patch (770 bytes) - added by SergeyBiryukov 3 years ago.
18465.3.patch (768 bytes) - added by SergeyBiryukov 3 years ago.
18465.4.patch (770 bytes) - added by SergeyBiryukov 3 years ago.
18465.5.patch (771 bytes) - added by SergeyBiryukov 3 years ago.
Same as previous, but switched to !empty() again
noindex.patch (1.6 KB) - added by joostdevalk 2 years ago.
Noindex HTTP header patch for wp-admin
nofollow.patch (1.4 KB) - added by neoxx 2 years ago.
Reduce links to wp-login.php by rel="nofollow"
18465.diff (441 bytes) - added by ryan 2 years ago.

Download all attachments as: .zip

Change History (47)

comment:1 Viper007Bond3 years ago

One thing to note: I have a real robots.txt file (WordPress doesn't handle it), so while my site isn't a perfect example it's close enough because the same thing is experienced on a stock WordPress install.

SergeyBiryukov3 years ago

comment:2 SergeyBiryukov3 years ago

  • Keywords has-patch added; needs-patch removed

comment:3 Viper007Bond3 years ago

  • Keywords needs-patch added; has-patch removed

I meant to mention this but forgot: the required patch is more complicated than it originally seems because WordPress can be installed in a subdirectory. Your patch Sergey won't block it on my site for example because it should be /wordpress/wp-admin/ for me.

I'm not sure the best way to build out the relative path.

comment:4 follow-up: SergeyBiryukov3 years ago

According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497

comment:5 in reply to: ↑ 4 Viper007Bond3 years ago

Replying to SergeyBiryukov:

According to rewrite.php, WordPress only handles robots.txt itself when installed in root:
http://core.trac.wordpress.org/browser/tags/3.2.1/wp-includes/rewrite.php#L1497

Correct, but see http://codex.wordpress.org/Giving_WordPress_Its_Own_Directory

This file is generated by WordPress: http://www.finalgear.com/robots.txt

However the path to my wp-admin folder is http://www.finalgear.com/wordpress/wp-admin/

comment:6 Viper007Bond3 years ago

Probably just parsing admin_url() to remove the domain is enough.

SergeyBiryukov3 years ago

comment:7 SergeyBiryukov3 years ago

Right, I misunderstood that comment in rewrite.php. Updated the patch.

comment:8 SergeyBiryukov3 years ago

  • Keywords has-patch added; needs-patch removed

comment:9 scribu3 years ago

"Ternary operators are fine, but always have them test if the statement is true, not false. Otherwise it just gets confusing."

http://codex.wordpress.org/WordPress_Coding_Standards#Ternary_Operator

SergeyBiryukov3 years ago

comment:10 justindgivens3 years ago

Why not just add these lines in the robots.txt?

Disallow: */wp-admin
Disallow: */wp-includes
Disallow: */wp-content/

So then, no matter where wp-admin is, it won't index it.

comment:11 scribu3 years ago

You don't want to block wp-content, as that's where your uploads are (which most people do want indexed).

As for using */wp-admin/, it would work, but it's better to be exact.

comment:12 SergeyBiryukov3 years ago

Replying to scribu:

"Ternary operators are fine, but always have them test if the statement is true, not false."

In case of empty(), the other way round seems more logical to me, but coding standards FTW :)

comment:13 follow-up: dd323 years ago

Why not just add these lines in the robots.txt?

You can't use * in the Disallow statements as it doesnt support wildcards, It matches from the start of the Path.

Disallow: $path/wp-includes

This should probably be suffixed with / as well to allow pages/posts with wp-admin/wp-includes to be indexed

SergeyBiryukov3 years ago

comment:14 justindgivens3 years ago

Well I tested it in Google Webmaster tool and it blocked it from the crawler access.

But if you want to not add the "*" then add <meta name="robots" content="noindex"> for the /wp-admin index file.

comment:15 in reply to: ↑ 13 SergeyBiryukov3 years ago

Replying to dd32:

This should probably be suffixed with / as well

Done in 18465.4.patch.

comment:16 SergeyBiryukov3 years ago

  • Milestone changed from Awaiting Review to 3.3

comment:17 dd323 years ago

Well I tested it in Google Webmaster tool and it blocked it from the crawler access.

Seems that Google and a few others support an "Extended" robots.txt standard, see the "June 2008 Agreement" here: http://www.searchtools.com/robots/robots-exclusion-protocol.html

Best going for the explicit directory deny for compatibility with other crawlers, since the wildcard doesn't bring any dramatic advances over the the original standard.

comment:18 nacin3 years ago

I've been meaning to add an exception to the coding standards for !empty(). That construct should be preferred as an exception.

SergeyBiryukov3 years ago

Same as previous, but switched to !empty() again

comment:19 kurtpayne3 years ago

  • Cc kpayne@… added

Should readme.html be blocked as well?

Googling for specific terms brings up a lot of WordPress sites.

comment:20 SergeyBiryukov3 years ago

The main concern in #17601 were the errors in server logs due to indexing wp-admin and wp-includes. Indexing readme.html doesn't create such errors, but probably doesn't make much sense too, so we could reduce unnecessary crawling even more.

That said, I'm not sure readme.html ends up in search results, since we have index.php in the root directory. Perhaps there's a link to readme.html somewhere on those sites, or they're missing index.php in DirectoryIndex, or their hosts don't support PHP (so they can't run WordPress anyway).

For wp-includes, there are much more results (currently about 5,540,000).

comment:21 dd323 years ago

  • Keywords commit added

All looks good to me from here on testing.

comment:22 ryan3 years ago

  • Owner set to ryan
  • Resolution set to fixed
  • Status changed from new to closed

In [18822]:

Disallow indexing wp-admin and wp-includes in robots.txt. Props SergeyBiryukov. fixes #18465

comment:23 nacin3 years ago

How will this work in a network situation?

comment:24 scribu3 years ago

  • Keywords commit removed

I guess we should use a leading wildcard, like it was suggested in #comment:10.

comment:25 SergeyBiryukov3 years ago

Replying to nacin:

How will this work in a network situation?

On a sub-domain install, any site would return this for a robots.txt request:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

On a sub-directory install, robots.txt (with the same content) is only served for the main site.

Replying to scribu:

I guess we should use a leading wildcard, like it was suggested in #comment:10.

Since wildcards in Disallow statements are not officially supported by the protocol, I guess we should leave it as is for now.

comment:26 neoxx3 years ago

  • Cc neo@… added

Maybe we should also deny wp-login.php?

Disallow: /wp-login.php
Disallow: /wp-login.php?*

comment:27 SergeyBiryukov3 years ago

wp-login.php has <meta name='robots' content='noindex,nofollow' />.

comment:28 follow-ups: joostdevalk2 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Summary changed from robots.txt should tell Google to not index wp-admin and wp-includes to Prevent search engines from indexing wp-admin and wp-includes

This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page. Example of this can be seen on my Dutch domain:

https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin

The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag) for the files in admin and for admin-ajax.php, will add a patch.

comment:29 follow-up: joostdevalk2 years ago

(this fix btw has the added benefit of fixing it for people with static robots.txt files)

comment:30 joostdevalk2 years ago

  • Cc joost@… added

joostdevalk2 years ago

Noindex HTTP header patch for wp-admin

comment:31 in reply to: ↑ 29 SergeyBiryukov2 years ago

  • Milestone changed from 3.3 to 3.4

Replying to joostdevalk:

(this fix btw has the added benefit of fixing it for people with static robots.txt files)

Would #18546 also make any sense?

comment:32 in reply to: ↑ 28 neoxx2 years ago

Replying to joostdevalk:

While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help page.

Thus, in addition I would suggest to add the attribute rel="nofollow" to all wp-login.php links to reduce the number of links pointing to the file as well as for performance/traffic reasons until #14348 has been fixed. The attached patch adopts the functions wp_register and wp_loginout.

neoxx2 years ago

Reduce links to wp-login.php by rel="nofollow"

comment:33 Ipstenu2 years ago

  • Cc ipstenu@… added

comment:35 nacin2 years ago

Let's issue X-Robots-Tag: noindex for wp-login.php and admin-ajax.php. Is there anything else we need to do here? auth_redirect() will send anyone away from anywhere else in wp-admin. This is pretty convincing: https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin.

comment:36 ryan2 years ago

nacin beat me to it by seconds :-), but here's my comment:

Sending the X-Robots-Tag header after auth_redirect() in admin.php seems useless since logged in pages shouldn't be crawled. And there doesn't seem like there'd be any value in sending the header before auth_redirect().

wp-login.php and wp-signup.php already use wp_no_robots(). That leaves admin-ajax.php. It doesn't have a head so using the X-Robots-Tag header seems appropriate.

ryan2 years ago

comment:37 nacin2 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In [20288]:

Send X-Robots-Tag: noindex in admin-ajax. props ryan, joostdevalk. fixes #18465.

comment:38 in reply to: ↑ 28 ; follow-up: koebenhavn event2 years ago

Just to clarify.

It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.

/Event

Replying to joostdevalk:

This is a valid problem but the "fix" doesn't actually fix it. While the addition to robots.txt blocks the crawler from opening the URL, a URL that cannot be opened CAN still be listed in the index if Google finds enough links pointing to it, see the note on this Google help Example of this can be seen on my Dutch domain:

https://www.google.com/search?q=site%3Ayoast.nl++inurl%3Awp-admin

The solution is to not exclude the admin directory in robots.txt, but to send an X-Robots-Tag HTTP header of value noindex (the HTTP version of a robots meta tag events. ) for the files in admin and for admin-ajax.php, will add a patch.

Version 0, edited 2 years ago by koebenhavn event (next)

comment:39 in reply to: ↑ 38 joostdevalk2 years ago

Replying to koebenhavn event:

Just to clarify.

It is only partially true that the robot.txt does not inhibit/request crawlers no avoid indexing. For instance while the actual URL and anchor text might be found in the google index, if searching specificly for it, the crawler will not index the actual contend of the page. What the google help page says is that you would have to know and search for the specific URL or anchor text to find it in the google index, and would not see the actual content.

/Event

Do a search for inurl:wp-admin/admin-ajax.php and you'll see there are loads and loads of sites there that actually show up. That's why this is an issue.

Note: See TracTickets for help on using tickets.