Make WordPress Core

Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#43590 closed defect (bug) (fixed)

Search Engine Visibility option does not work as intended

Reported by: mamaedler's profile mamaedler Owned by: donmhico's profile donmhico
Milestone: 5.3 Priority: normal
Severity: normal Version: 2.1
Component: General Keywords: has-patch has-unit-tests has-dev-note
Focuses: administration Cc:

Description

In Settings -> Reading is an option called "Discourage search engines from indexing this site".

Unfortunately it does not work a intended.

Current behavior
This results in a robots.txt with the following contents:

User-agent: *
Disallow: /

This is a problem, because the page can still appear in search results in some circumstances with the text "No information is available for this page." (see attached screenshot). It's because the site's contents are not crawled, but the link itself is indexed nevertheless.

Expected behavior
The page shouldn't be listed in search engines at all.

Google has a help page with the topic Block search indexing with 'noindex'.

It states:
"Important! For the noindex directive to be effective, the page must not be blocked by a robots.txt file. If the page is blocked by a robots.txt file, the crawler will never see the noindex directive, and the page can still appear in search results, for example if other pages link to it."

In essence, WordPress should return a robots meta tag like this:

<meta name="robots" content="noindex">

and/or return an X-Robots-Tag in the http header like this:

HTTP/1.1 200 OK
(…)
X-Robots-Tag: noindex
(…)

But should not block access in the first place via robots.txt.

Attachments (4)

noinformation.png (17.7 KB) - added by mamaedler 6 years ago.
Google search result
43590.diff (1.1 KB) - added by donmhico 5 years ago.
43590.2.diff (2.7 KB) - added by donmhico 5 years ago.
Additional changes + unit test.
43590.3.diff (3.0 KB) - added by peterwilsoncc 5 years ago.

Download all attachments as: .zip

Change History (22)

@mamaedler
6 years ago

Google search result

#1 follow-up: @donmhico
5 years ago

  • Keywords dev-feedback added

Hello @mamaedler,

Welcome to our Trac and thank you for your report. I think the link to Google you gave makes this a valid issue. I'm interested in making a patch for this issue however let's ask for insights from other devs first.

This ticket was mentioned in Slack in #core by donmhico. View the logs.


5 years ago

#3 in reply to: ↑ 1 @mamaedler
5 years ago

Hi @donmhico thanks for your answer. Even after about 17 months. I appreciate it.

I tried to make this as clear as possible and I think waiting for other dev reports can take another 17 months, at least.

Registering and posting to this "trac" ticket system is kind of a hurdle.

For other dev input see this StackExchange post as an example.

I mean, usually you end up far back in the search results anyway. But when I created this issue, I had a real site that showed up on the first page of Google search results (like in the example picture attached above). It was (I think) due to the original name and searching for that name exactly lead to this situation. It is difficult to reproduce this scenario. That's why I substantiated my allegation with Google Webmasters knowledge base.

@donmhico
5 years ago

#4 @donmhico
5 years ago

  • Keywords has-patch added; dev-feedback removed

The meta tag, <meta name='robots' content='noindex,follow' />, already renders if "Discourage search engines from indexing this site". is checked. . See [3548].

The attached patch, 43590.diff, removes Disallow: / in robots.txt.

This should address this ticket. @mamaedler.

This ticket was mentioned in Slack in #core by donmhico. View the logs.


5 years ago

#6 @mamaedler
5 years ago

@donmhico Thanks, LGTM! Perfect.

#7 @peterwilsoncc
5 years ago

@donmhico Thanks for the patch, I've double checked and the unit tests are passing.

I do think it would be handy to get some feedback from experienced SEO people in the WP community just to make sure making this change would not be a mistake with regards to other search engines.

#8 @donmhico
5 years ago

@peterwilsoncc Thanks for the double check. Of course, we have time before 5.3 so we can have this ticket sit here and get more feedback from others (especially from SEO people).

#9 @jonoaldersonwp
5 years ago

Resurrecting this, as there's some nuance here.

1) As pointed out above, the Reading setting infers that it's intended to prevent search engines from indexing the content, rather than from crawling it. However, the presence of the robots disallow rule prevents search engines from ever discovering the noindex directive, and thus they may index 'fragments' (where the page is indexed without content).

2) Google recently announced that they're making efforts to prevent fragment indexing. However, until this exists (and I'm not sure it will; it's still a necessary/correct solution sometimes), we should solve for current behaviours. Let's remove the robots.txt disallow rule, and allow Google (and others) to crawl the site.

3) The output of the meta robots currently isn't in line with the documentation. It outputs a value of noindex,follow, which should be altered to noindex,nofollow in line with the documentation.

Now, here's a challenge...
Removing the robots.txt disallow rule opens up the whole site. Including images, files, bits of plugin folders, and other files which don't use x-robots-tag headers (or server indexing options) to manage/prevent exposure or indexation. That might result in, e.g., an assets folder in a plugin being crawled and indexed. E.g., anything like this: https://www.google.com/search?q=inurl%3Awp-content%2Fplugins&oq=inurl%3Awp-content%2Fplugins

This is already the case on many(!!) live sites, so, this isn't a new problem, but we'd be newly exposing it on sites which currently think that they're blocking search engines from indexing their content.

Are we comfortable with that impact?

An alternative would be to attempt to implement x-robots-tag HTTP headers site-wide, but, that might not be effective (due to static page caching systems, server setups which don't route certain resource types through WP, etc).

If so, let's :

  • Remove the robots.txt disallow rule
  • Fix the meta robots tag
  • Update the documentation
Version 1, edited 5 years ago by jonoaldersonwp (previous) (next) (diff)

#10 @mamaedler
5 years ago

@jonoaldersonwp Thanks for your insights.

I have an idea. How about adding robots.txt back in and change the ruleset, so it does allow crawlers to read the "home page" with our "noindex meta tag" and disallow everything else, all common sub folders (including assets)?

I think the best solution would be an appropriate X-Robots-Tag in the HTTP response header, like you already mentioned. Every resource (html, images, videos, even JavaScript and style sheets) would have the noindex directive in its http header if we applied this via Apache directives.

Is it save to assume WordPress runs on Apache? Then this could be done via an .htaccess file that sits in the root directory and has the following content.

<ifModule mod_headers.c>
Header set X-Robots-Tag "noindex, nofollow"
</ifModule>

Problem is, this might interfere with an already existing .htaccess file. Then, this snippet could possibly be appended.

Another problem is that some hosts disallow header manipulation via .htaccess (and a lot more).

So maybe back to the first idea.

Lastly, how is it not in line with the documentation? As per [3548] it outputs noindex, nofollow as html meta tag (see function noindex()). Isn't this correct? Did I miss something?

#11 follow-up: @peterwilsoncc
5 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from Awaiting Review to 5.3
  • Owner set to donmhico
  • Status changed from new to assigned
  • Version changed from 4.9.4 to 2.1

Thanks @jonoaldersonwp, following your feedback I've put this on the 5.3 milestone.

That might result in, e.g., an assets folder in a plugin being crawled and indexed.

  • Do you think it's worth retaining the disallow rule for wp-content on private sites?
  • For Core's silence is golden files, such as wp-content/index.php, we can add X-Robots-Tag headers on another ticket.

@donmhico I've assigned this to you for following up modifying the meta tag to:

  • retain follow on public/indexed sites (so we don't break SEO and other plugins calling the function expecting pages to be followed)
  • modify the meta tag to use nofollow on sites set to be private only, per the recommendations in the comment above.

Unit tests for the above change would be most helpful but I realise your a new contributor so may need some assistance with these.

If you don't have time to work on this, feel free to unassign yourself.

@donmhico
5 years ago

Additional changes + unit test.

#12 in reply to: ↑ 11 ; follow-up: @jonoaldersonwp
5 years ago

Replying to peterwilsoncc:

  • Do you think it's worth retaining the disallow rule for wp-content on private sites?

That'd get complex and messy; I think we leave it.

retain follow on public/indexed sites (so we don't break SEO and other plugins calling the function expecting pages to be followed)

Can you clarify the expected behaviours here? I'm confused!

#13 in reply to: ↑ 12 @peterwilsoncc
5 years ago

Replying to jonoaldersonwp:

Can you clarify the expected behaviours here? I'm confused!

Sorry for the confusion

Sites allowing indexing may have an SEO plugin that allows them to prevent indexing on individual posts or pages.

If the plugin calls wp_no_robots() on these pages, the expected output is noindex, follow. For backward compatibility this needs to be maintained.

On sites discouraging indexing, the output ought to be noindex, nofollow as discussed above.

#14 @jonoaldersonwp
5 years ago

Gotcha, makes sense. Didn't know that was an existing core function; assumed all the SEO plugins were generating/outputting their own meta robots tags. Awesome!

#15 @peterwilsoncc
5 years ago

  • Keywords has-unit-tests commit added; needs-unit-tests removed

Minor changes in 43590.3.diff:

  • Fixed typo in docblock for wp_no_robots()
  • Expanded @since description in do_robots() to include reference to meta tag
  • Minor CS fixes
  • Reformatted test slightly
  • added a link to the wp_head hook while we were changing the wp_no_robots() function (not directly related)
  • used the return early pattern for wp_no_robots() rather than an else.

All very minor to get the ticket in shape for commit.

Tests are still passing https://travis-ci.com/WordPress/wordpress-develop/builds/125369116

#16 @peterwilsoncc
5 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 45928:

#43590: Use robots meta tag to better discourage search engines.

This changes the "discourage search engines" option to output a noindex, nofollow robots meta tag. Disallow: / is removed from the robots.txt to allow search engines to discover they are requested not to index the site.

Disallowing search engines from accessing a site in the robots.txt file can result in search engines listing a site with a fragment (a listing without content).

Props donmhico, jonoaldersonwp.
Fixes #43590.

This ticket was mentioned in Slack in #core by louisf. View the logs.


5 years ago

Note: See TracTickets for help on using tickets.