Make WordPress Core

Opened 9 years ago

Closed 9 years ago

Last modified 3 years ago

#33156 closed enhancement (fixed)

Allow admin-ajax crawling

Reported by: joostdevalk's profile joostdevalk Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 4.4 Priority: normal
Severity: normal Version:
Component: General Keywords: 2nd-opinion has-patch
Focuses: Cc:

Description

As plugins are using admin-ajax.php on the frontend, we should add

Allow: /admin/admin-ajax.php

To the default robots.txt to prevent Google from sending out million of emails, see this article: https://www.seroundtable.com/google-warning-googlebot-css-js-20665.html

Attachments (2)

33156.patch (544 bytes) - added by dmchale 9 years ago.
Add "Allow" for /wp-admin/admin-ajax.php to the end of the default generated robots.txt file
33156.diff (436 bytes) - added by markjaquith 9 years ago.

Download all attachments as: .zip

Change History (28)

This ticket was mentioned in Slack in #core by ocean90. View the logs.


9 years ago

@dmchale
9 years ago

Add "Allow" for /wp-admin/admin-ajax.php to the end of the default generated robots.txt file

#2 follow-up: @dmchale
9 years ago

Joost, is there value in removing the Disallow of /wp-admin entirely? I know you've recommended that in the past - but do you think that would be preferable behavior for Core, or no? Just leave it like this with an exception in place for admin-ajax?

#3 in reply to: ↑ 2 ; follow-up: @peterwilsoncc
9 years ago

For what it's worth, I'd rather allow everything in robots.txt and use noindex,nofollow meta tags for private sites & preventing indexing of wp-admin. Google recommends it as a more effective method for preventing indexing.

Replying to dmchale:
The WordPress coding standards are to use tabs not spaces for indention. Would you mind refreshing the patch?

@markjaquith
9 years ago

#4 in reply to: ↑ 3 @dmchale
9 years ago

Replying to peterwilsoncc:

For what it's worth, I'd rather allow everything in robots.txt and use noindex,nofollow meta tags for private sites & preventing indexing of wp-admin. Google recommends it as a more effective method for preventing indexing.

That was the solution I was alluding to in my comment to Joost above. Back in February, he recommended getting rid of the /wp-admin block entirely. But I didn't want to create that patch without a conversation happening first, either, since that wasn't his suggestion as the OP on this ticket. Would be very easy though, we'd just have to remove everything in the "else" side of the $public check. A default file would still be returned, albeit nearly blank, and we still have the ability to write the Disallow / if the site isn't in public mode.

Replying to peterwilsoncc:

The WordPress coding standards are to use tabs not spaces for indention. Would you mind refreshing the patch?

Thanks for the heads up. New install of PHPStorm on this pc, and I forgot to turn my whitespace highlighting on. Fixed now, shouldn't happen again. :) Since Mark already submitted one with tabs, I won't clutter things up with another copy.

#5 follow-up: @pavelevap
9 years ago

admin-ajax.php is a PHP file and Google notified about CSS and JS?

#6 in reply to: ↑ 5 @dmchale
9 years ago

Replying to pavelevap:

admin-ajax.php is a PHP file and Google notified about CSS and JS?

Google has a problem with it when theme or plugin authors do something like this... :)

<link rel='stylesheet' id='style-css' href='http://mydomain.com/wp-admin/admin-ajax.php?action=style' type='text/css' media='all' />

I'm sure there's other use cases where it's causing problems as well, but this one in particular has hit a number of my client sites who are using purchased themes.

Last edited 9 years ago by dmchale (previous) (diff)

#7 follow-up: @knutsp
9 years ago

-1

I don't think this should be in core. Themes should not depend on, or access, /wp-admin. If they do, they should fix the "crawlablity" of it through hooks. Core may offer an ajax endpoint outside /wp-admin, if necessary.

One day, for some, it should be possible to delete /wp-admin and install or use an alternative admin through WP REST API. In the mean time, find another solution to this problem.

#8 in reply to: ↑ 7 @dmchale
9 years ago

Replying to knutsp:

-1

I don't think this should be in core. Themes should not depend on, or access, /wp-admin. If they do, they should fix the "crawlablity" of it through hooks. Core may offer an ajax endpoint outside /wp-admin, if necessary.

One day, for some, it should be possible to delete /wp-admin and install or use an alternative admin through WP REST API. In the mean time, find another solution to this problem.

Right now Core only offers ajax functionality through /wp-admin. Your proposal to CHANGE that fact is a much different discussion, IMO. https://codex.wordpress.org/AJAX_in_Plugins "Note 2: Both front-end and back-end Ajax requests use admin-ajax.php [...]"

#9 follow-up: @johnbillion
9 years ago

  • Keywords 2nd-opinion has-patch added; needs-patch removed

AJAX needs to go via wp-admin for authenticated requests. A front-end AJAX handler was attempted in #12400 but pulled out.

What might be the downside of allowing admin-ajax.php to be crawled? Any chance of unwanted content appearing in SERPs?

#10 in reply to: ↑ 9 @dmchale
9 years ago

Replying to johnbillion:
Any chance of unwanted content appearing in SERPs?

admin-ajax has @header( 'X-Robots-Tag: noindex' ); already, so no content found there should appear in any SERPs.

This ticket was mentioned in Slack in #core by dmchale. View the logs.


9 years ago

#12 @SergeyBiryukov
9 years ago

  • Milestone changed from Awaiting Review to 4.4

This ticket was mentioned in Slack in #core by sergey. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by sergey. View the logs.


9 years ago

#15 @SergeyBiryukov
9 years ago

  • Owner set to SergeyBiryukov
  • Status changed from new to assigned

#16 @SergeyBiryukov
9 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 34985:

In do_robots(), allow crawling for admin-ajax.php, since it's often used on front-end.

Props dmchale, joostdevalk.
Fixes #33156.

#17 follow-up: @Hube2
9 years ago

Didn't know if I should start a new ticket or not and couldn't find one covers it. The order that WP is outputting the allow/disallow rules

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

may not compatible with all crawlers. To be comparable with all crawlers the order of these rules needs to be reversed.

See: https://en.wikipedia.org/wiki/Robots_exclusion_standard#Allow_directive.

I can't find any information that contradicts what is presented in the wiki article.

#18 in reply to: ↑ 17 @rdela
9 years ago

Opened a new ticket about order.

Replying to Hube2:

Didn't know if I should start a new ticket or not and couldn't find one covers it. The order that WP is outputting the allow/disallow rules

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

may not compatible with all crawlers. To be comparable with all crawlers the order of these rules needs to be reversed.

See: https://en.wikipedia.org/wiki/Robots_exclusion_standard#Allow_directive.

I can't find any information that contradicts what is presented in the wiki article.

#19 @reidbusi
7 years ago

Google is getting a 400 error when it is crawling wp-admin/admin-ajax.php as the link is defined in javascript in the Avada theme:

				jQuery( document ).ready( function() {
					var ajaxurl = 'https://example.com/wp-admin/admin-ajax.php';
					if ( 0 < jQuery( '.fusion-login-nonce' ).length ) {
						jQuery.get( ajaxurl, { 'action': 'fusion_login_nonce' }, function( response ) {
							jQuery( '.fusion-login-nonce' ).html( response );
						});
					}
				});

So, maybe this was not a good idea?

#20 @kcaluwae
6 years ago

I just received a 400-error in my google search console on the /wp-admin/admin-ajax.php URL.
Have things changed since the start of this thread?

This ticket was mentioned in Slack in #core by kevin940726. View the logs.


4 years ago

#22 follow-ups: @KnowingArt_com
3 years ago

It seems like all the comments of concern were ignored.

WordPress does not need permission from robots.txt to access itself. Take a step back and ask yourself why robots are "allowed" access to wp-admin/admin-ajax.php. It makes no sense, as if people forgot the role of robots.txt

Robots.txt does not exist to manage all conceivable non-human interactions. It was originally created to save bandwidth, because one gig of transfer could cost over $10. If you simply crawled somebody's website back in 1999, they might threaten to sue you. Ask me how I know. Frankly, bandwidth and CPU are cheap enough now that robots.txt should be obsolete.

After some decades and bazillions of pageviews, I have never used a robots.txt, because I don't want to discourage robots from enjoying my content. I don't know the exact version of WordPress this changed, but it seems WordPress decided to make robots.txt mandatory. In my expert opinion, that was a foolish decision.

Respectful bots will avoid /wp-admin/ without being told. Disrespectful bots will do whatever they want. This auto-generated robots.txt is unnecessary, it just creates confusion and solves nothing.

I'm told the WordPress philosophy is to make decisions instead of offering options. You don't make certain decisions without asking my permission. I'm drawing the line at robots.txt, I see this as a violation where WordPress is claiming ownership of something that does not belong to WordPress. (My server, my choice.) If I "allow" a robot onto my server, that is my decision to make as a system admin. Just because the average WordPress user is not very sophisticated with technology, that doesn't mean you can just take control of whatever you want, just because I gave you permission to auto-install upgrades.

What's next? Are you going to try creeping into php.ini? Seriously, this should be a concern as larger companies are allowed to submit code to WordPress. If you're going to draw a line somewhere, might as well draw the line with robots.txt

As for admin-ajax.php, whoever added that line should at least include a robots.readme to explain why robots.txt is mandatory, why it makes no sense, the relationship to wp-sitemap.xml, and include a link back to this URL, because it took me two days to follow the breadcrumbs back to this ticket. To say the least, this is not how I wanted to spend my week. And after digging into pages and pages of explanations, I'm still wondering why anyone would add admin-ajax.php to robots.txt

"Since it's often used on front-end."

Like I said, the front-end has nothing to do with robots.txt

"What might be the downside of allowing admin-ajax.php to be crawled? Any chance of unwanted content appearing in SERPs?"

BINGO! I'm having a problem with DuckDuckGo right now, it's listing /wp-admin/ as the #1 search result for my domain.

#23 in reply to: ↑ 22 @nickdageekuk
3 years ago

Replying to KnowingArt_com:

It seems like all the comments of concern were ignored.

I would tend to agree considering that AJAX itself is a problem

Known SQL injection exploit in AJAX see https://www.exploit-db.com/exploits/48475

# Exploit Title: WordPress Plugin Ajax Load More 5.3.1 - '#1' Authenticated SQL Injection
# Exploit Author: SunCSR (Sun* Cyber Security Research) - Nguyen Khang
# Google Dork: N/A
# Date: 2020-05-18
# Vendor Homepage: https://connekthq.com/plugins/ajax-load-more/
# Software Link: https://vi.wordpress.org/plugins/ajax-load-more/
# Version: <= 5.3.1
# Tested on: Ubuntu 18.04

Description:
A blind SQL injection vulnerability is present in Ajax load more.
$wpdb->get_var("SELECT repeaterDefault FROM " . $table_name . " WHERE name

'$n'");

#24 in reply to: ↑ 22 ; follow-up: @dmchale
3 years ago

Replying to KnowingArt_com:

I'm told the WordPress philosophy is to make decisions instead of offering options. You don't make certain decisions without asking my permission.

In fairness, that's exactly what it means to make a decision and not offer an option.

That said, this ticket was discussing the default behavior of the robots.txt file. You can short-circuit what WordPress does out of the box in one of two ways.

  1. Create a physical robots.txt file on your server. If WordPress detects a physical file at the web root, it will not add to / remove from / modify that file in any way (this includes any plugins that dynamically modify the robots.txt file as well)
  1. Use the robots_txt filter to modify the contents of the WordPress defaults https://developer.wordpress.org/reference/hooks/robots_txt/

#25 in reply to: ↑ 22 @KnowingArt_com
3 years ago

Replying to KnowingArt_com:

It seems like all the comments of concern were ignored.

I'm starting to get a better understanding of the issue(s) that triggered this desire to sortof whitelist this ajax script for Googlebot. However, we should not pollute robots.txt to fix poorly-conceived AJAX themes that lazyload content without non-AJAX placeholder content, do not fail gracefully, or whatever happened to these themes.

Likewise, robots.txt is not the best place for Googlebot-specific problems either. I think WordPress is big enough to work that out with Google directly. And for really specific problems, just use your web server config.

I'm also thinking, if WordPress decides to back out of robots.txt, will that break stuff? I doubt it, but I don't know for sure. I think the alternative is worse. Because now you have wp-sitemap.xml in there, how deep will this rabbit hole go before all the custom robots.txt files out there the need to be rewritten to "catch up" with the auto-generated WordPress robots.txt?

Also, I created this last night...

https://wordpress.stackexchange.com/questions/403753/why-does-do-robots-allow-wp-admin-admin-ajax-php-by-default

#26 in reply to: ↑ 24 @KnowingArt_com
3 years ago

Replying to dmchale:

  1. Create a physical robots.txt file on your server. If WordPress detects a physical file at the web root, it will not add to / remove from / modify that file in any way (this includes any plugins that dynamically modify the robots.txt file as well)
  1. Use the robots_txt filter to modify the contents of the WordPress defaults https://developer.wordpress.org/reference/hooks/robots_txt/
  1. Nice, but without comments in robots.txt, how will the casual user know? My first impression was to 'touch robots.txt' but...

1a. How will anyone know this won't break wp-sitemap.xml? That's how I ended up here. Upon further investigation, it seems *Google* contributed the code that adds wp-sitemap.xml to robots.txt If a Google employee adds some code to the WordPress robots.txt, that tells me Google wants that code to be there, and I am going to think twice about removing it.

1b. Also casual user: If I remove this weird ajax thing, am I going to break something? Maybe I should investigate further. And down the rabbit hole we go into ajax themes with broken Googlebot renderings :-(

  1. That was my first attempt, but a) I crashed my site by trying some random Stack Exchange solution, b) I have many blogs to manage on several servers, and what if the blog changes themes? Is there a "functions.php" that affects all themes?, c) I don't really want a filter, I want to completely disable the creation of robots.txt, which is probably harder than it sounds.
Note: See TracTickets for help on using tickets.