Opened 4 years ago
Last modified 9 months ago
#52536 new enhancement
Add "X-Robots-Tag: noindex" to feeds by default
Reported by: | pikamander2 | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | major | Version: | |
Component: | Feeds | Keywords: | |
Focuses: | Cc: |
Description
We’ve noticed that spammy websites are linking to RSS search results on our site that include their domain in the search terms.
For example, if our site is wordpress.org and their site is example.com, they might link to wordpress.org/search/+example.com+best+pharmacy+pills+online/feed/rss2/
For normal WordPress searches, this isn’t a problem because the search results are set to “noindex”. However, these RSS2 pages are outputted as XML and don’t include any kind of “noindex” tag, so Google recognizes them as being indexable pages.
Looking around Google, it seems like this type of blackhat SEO technique is fairly common and most likely done in bulk by bots.
SEO plugins like Yoast and AIOSEOP appear to add a "noindex" tag to the search result pages, but neither of them seems to add that response header to the feeds, which means that most WordPress sites are vulnerable to that tactic.
Since feeds and search pages are built into WordPress's core and most people wouldn't want their results to be indexed anyway, can we add noindex to those two types of pages by default?
Attachments (1)
Change History (11)
#1
@
4 years ago
- Component changed from General to Feeds
- Summary changed from Add "noindex" to search results by default and "X-Robots-Tag: noindex" to feeds by default to Add "X-Robots-Tag: noindex" to feeds by default
#2
@
4 years ago
- We can't blanket noindex all feed views, because they need to be indexable/indexed for various external services (Google podcasts, some Facebook stuff, etc).
- However, the recent ticket to noindex search results (#52457) could, indeed, be extended to noindex all formats of search results; including the RSS feeds thereof, via a HTTP header.
#3
@
4 years ago
To account for sites using full page caching that doesn't include the HTTP Headers, could the robots API be used to add meta tags to RSS on the rss2_head
?
<channel> <xhtml:meta xmlns:xhtml="http://www.w3.org/1999/xhtml" name="robots" content="noindex" /> ... </channel>
I'm unable to find the correct tag for atom feeds but hopefully such a thing exists.
#5
follow-up:
↓ 7
@
3 years ago
Just wanted to add that we're seeing Google indexing some pages like this, and we're getting quite a bit of traffic too.
For others that stumble on this before the fix is ready a workaround is to add a redirect rule to your site (I used the WPEngine dashboard) to redirect from these search feed pages to your home page. The regex below might be useful.
^/search/[^/]*/feed/[^/]*/?$
#6
@
10 months ago
Ideally the comments feed should get that noindex tag as it isn't something useful but google still want to scrape them.
Of course a filter to let address those cases also if someone doesn't want that feed are scraped at all should be interesting instead to use a plugin that magically do this.
#7
in reply to:
↑ 5
@
10 months ago
Replying to tteggel:
Just wanted to add that we're seeing Google indexing some pages like this, and we're getting quite a bit of traffic too.
For others that stumble on this before the fix is ready a workaround is to add a redirect rule to your site (I used the WPEngine dashboard) to redirect from these search feed pages to your home page. The regex below might be useful.
^/search/[^/]*/feed/[^/]*/?$
Likely an instance of this: https://yoast.com/internal-site-search-spam/
#8
@
9 months ago
IN 52536-poc.diff I've created a very rough POC that sets the X-Robots-Tag
header according to the values in the default meta tag.
By default this adds the noindex header to search feeds via the existing wp_robots_noindex_search
function included on the wp_robots
filter.
On the downside, the HTTP headers are output much earlier than the meta tag so there is some risk of contradictory settings between the HTTP header and meta tag if a plugin alters the value after the HTTP headers are sent.
#9
follow-up:
↓ 10
@
9 months ago
I feel this might be going a bit too far: it'd mean we'd output X-robots on login pages for instance, next to a robots meta tag, which feels like it's too much (one or the other is fine, not both). Also, the inconsistent timings make it feel scary to me.
My preference would be to restrict this code to just feeds.
#10
in reply to:
↑ 9
@
9 months ago
Replying to joostdevalk:
... the inconsistent timings make it feel scary to me.
My preference would be to restrict this code to just feeds.
That makes sense to me.
While I've been able to find a meta tag equivalent for RSS feeds (per comment #3), I haven't been able to find one for atom feeds so it seems there's no choice but to use HTTP headers in feeds.
Ticket #52457 is already opened for the search pages, and this ticket could address the feeds.