Make WordPress Core

Opened 3 years ago

Last modified 6 months ago

#52536 new enhancement

Add "X-Robots-Tag: noindex" to feeds by default

Reported by: pikamander2's profile pikamander2 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: major Version:
Component: Feeds Keywords:
Focuses: Cc:

Description

We’ve noticed that spammy websites are linking to RSS search results on our site that include their domain in the search terms.

For example, if our site is wordpress.org and their site is example.com, they might link to wordpress.org/search/+example.com+best+pharmacy+pills+online/feed/rss2/

For normal WordPress searches, this isn’t a problem because the search results are set to “noindex”. However, these RSS2 pages are outputted as XML and don’t include any kind of “noindex” tag, so Google recognizes them as being indexable pages.

Looking around Google, it seems like this type of blackhat SEO technique is fairly common and most likely done in bulk by bots.

SEO plugins like Yoast and AIOSEOP appear to add a "noindex" tag to the search result pages, but neither of them seems to add that response header to the feeds, which means that most WordPress sites are vulnerable to that tactic.

Since feeds and search pages are built into WordPress's core and most people wouldn't want their results to be indexed anyway, can we add noindex to those two types of pages by default?

Attachments (1)

52536-poc.diff (1.9 KB) - added by peterwilsoncc 6 months ago.

Download all attachments as: .zip

Change History (11)

#1 @sabernhardt
3 years ago

  • Component changed from General to Feeds
  • Summary changed from Add "noindex" to search results by default and "X-Robots-Tag: noindex" to feeds by default to Add "X-Robots-Tag: noindex" to feeds by default

Ticket #52457 is already opened for the search pages, and this ticket could address the feeds.

#2 @jonoaldersonwp
3 years ago

  • We can't blanket noindex all feed views, because they need to be indexable/indexed for various external services (Google podcasts, some Facebook stuff, etc).
  • However, the recent ticket to noindex search results (#52457) could, indeed, be extended to noindex all formats of search results; including the RSS feeds thereof, via a HTTP header.
Last edited 3 years ago by jonoaldersonwp (previous) (diff)

#3 @peterwilsoncc
3 years ago

To account for sites using full page caching that doesn't include the HTTP Headers, could the robots API be used to add meta tags to RSS on the rss2_head?

<channel>
  <xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" />
  ...
</channel>

I'm unable to find the correct tag for atom feeds but hopefully such a thing exists.

#4 @jonoaldersonwp
3 years ago

Ooh, didn't know that existed. Yes.

#5 follow-up: @tteggel
3 years ago

Just wanted to add that we're seeing Google indexing some pages like this, and we're getting quite a bit of traffic too.

For others that stumble on this before the fix is ready a workaround is to add a redirect rule to your site (I used the WPEngine dashboard) to redirect from these search feed pages to your home page. The regex below might be useful.

^/search/[^/]*/feed/[^/]*/?$

#6 @Mte90
6 months ago

Ideally the comments feed should get that noindex tag as it isn't something useful but google still want to scrape them.
Of course a filter to let address those cases also if someone doesn't want that feed are scraped at all should be interesting instead to use a plugin that magically do this.

#7 in reply to: ↑ 5 @jonoaldersonwp
6 months ago

Replying to tteggel:

Just wanted to add that we're seeing Google indexing some pages like this, and we're getting quite a bit of traffic too.

For others that stumble on this before the fix is ready a workaround is to add a redirect rule to your site (I used the WPEngine dashboard) to redirect from these search feed pages to your home page. The regex below might be useful.

^/search/[^/]*/feed/[^/]*/?$

Likely an instance of this: https://yoast.com/internal-site-search-spam/

#8 @peterwilsoncc
6 months ago

IN 52536-poc.diff I've created a very rough POC that sets the X-Robots-Tag header according to the values in the default meta tag.

By default this adds the noindex header to search feeds via the existing wp_robots_noindex_search function included on the wp_robots filter.

On the downside, the HTTP headers are output much earlier than the meta tag so there is some risk of contradictory settings between the HTTP header and meta tag if a plugin alters the value after the HTTP headers are sent.

#9 follow-up: @joostdevalk
6 months ago

I feel this might be going a bit too far: it'd mean we'd output X-robots on login pages for instance, next to a robots meta tag, which feels like it's too much (one or the other is fine, not both). Also, the inconsistent timings make it feel scary to me.

My preference would be to restrict this code to just feeds.

#10 in reply to: ↑ 9 @peterwilsoncc
6 months ago

Replying to joostdevalk:

... the inconsistent timings make it feel scary to me.

My preference would be to restrict this code to just feeds.

That makes sense to me.

While I've been able to find a meta tag equivalent for RSS feeds (per comment #3), I haven't been able to find one for atom feeds so it seems there's no choice but to use HTTP headers in feeds.

Note: See TracTickets for help on using tickets.