WordPress.org

Make WordPress Core

Opened 9 months ago

Closed 8 months ago

Last modified 5 months ago

#52457 closed task (blessed) (fixed)

WordPress vulnerable to search-reflected webspam

Reported by: abagtcs Owned by: peterwilsoncc
Milestone: 5.7 Priority: normal
Severity: normal Version:
Component: General Keywords: has-patch has-unit-tests
Focuses: template Cc:

Description

WordPress echoes back searched-for terms on its search results page. For example, a search on an installation on www.foo.edu for "scholarship programs" would have the URL of:

https://www.foo.edu/?s=scholorship+programs

The resulting page would include the text:

Search results for: scholarship programs


Web spammers have started to abuse search features of those sites by passing in spam terms and hostnames in hopes of boosting the search rankings of the spammers’ sites. For example, www.foo.edu might be being abused by URLs that look like:

https://www.foo.edu/?s=Buy%20cheap%20viagra%20without%20prescrition%20-%20www.getcheapdrugs.com

and that produce a page that includes:

Search Results for: Buy cheap viagra without prescrition - www.getcheapdrugs.com


The spammers place these links in open wikis, blog comments, forums, and other link farms, relying upon search engines crawling their links, and then visiting and indexing the resulting search results pages and included spammy content.

This attack is surprisingly quite widespread, affecting many websites around the world. Though some CMS’s and sites powered by custom-written code may be vulnerable to this technique, (based on preliminary investigation) it appears that --at least in the .edu space-- the most targeted web platform by far is WordPress. For example, to see many examples of U.S. educational websites targeted by the attack, you can do a Google search for:

site:edu inurl:s “buy”


There are several possible ways to prevent a website from being abused by this method, but adding the appropriate header or meta tag (see https://developers.google.com/search/docs/advanced/crawling/block-indexing ) appears to be the most appropriate and effective, especially with respect to getting the spam URLs removed from search engine indexes.

I have submitted this problem as a security concern for WordPress on HackerOne, but it was closed as not a vulnerability. So I'm suggesting, then, that core be modified to either always, or by default (with the ability to disable), add the appropriate meta tag into search result page headers:

<meta name=’robots’ content=’noindex,follow’ />


This will indicate to crawlers that the content is not to be indexed and prevent the site from being abused by web spammers

Attachments (2)

52457.patch (500 bytes) - added by ayeshrajans 9 months ago.
52457.diff (2.5 KB) - added by peterwilsoncc 8 months ago.

Download all attachments as: .zip

Change History (26)

#1 @audrasjb
9 months ago

  • Milestone changed from Awaiting Review to 5.7

Hi and welcome to WordPress Trac @abagtcs,

Milestoning this issue to WordPress 5.7.

#2 @xkon
9 months ago

This is a problem indeed. According to an older Google q/a stream, if I recall correctly, it was mentioned that crawling search results add little to no value in regards to ranking and also obviously slow down crawling itself.

So there's nothing wrong at calling a noindex, nofollow in ?s=, it's basically better doing it even to avoid the spam in parallel. Feel free to correct me if I'm wrong though as I'm not an expert on the matter :).

That being said, I agree that maybe we should "protect" the search results by default with an extra option along the Search engine visibility setting.

Note that currently this is something that various SEO plugins do as well :).

#3 @jonoaldersonwp
9 months ago

Correct; this is a huge problem when it comes to spam, security, crawl budget, and more.
All internal search results should output a meta robots tag with a noindex, follow value.

#4 @burtrw
9 months ago

I've seen this a lot recently too and had a few customer complaints who thought they had been hacked. I'd be all for core behavior to not index ?s= search result pages. Would the additional option under Settings > Reading be necessary? Maybe if you really wanted search pages to be indexed, that could be left up to plugins?

#5 @jonoaldersonwp
9 months ago

Now that we have a rudimentary robots tag API in core, this should definitely be considered.

I agree on not needing a UI control, and, having this noindex by default.

There are very few valid use-cases for enabling indexing of search results, outside of having a very non-standard theme. Even Yoast SEO doesn't allow users to disable our nonindex'ing of search results via the UI (though we do have filters).

#6 @ayeshrajans
9 months ago

  • Focuses template added
  • Keywords has-patch needs-testing needs-unit-tests added

It's a pretty clever way to spam. The esc_html of course helps with XSS, but the text itself is part of the page.

Suggesting a patch with add_filter( 'wp_robots', 'wp_robots_no_robots' ); when the search query is fetched. I think it wouldn't be possible to track all use cases unless we grep-and-fix $_REQUEST['s'] for user-search related pages, but I hope the patch is a start.

@ayeshrajans
9 months ago

#7 @peterwilsoncc
8 months ago

  • Keywords needs-refresh added

Thanks for the patch @ayeshrajans

While the add_action() such as you've got is a good approach, adding it inside get_search_query() will cause problems for themes or plugins wishing to handle robots tags in there own way.

Each time get_search_query() is called, they'd need to reinitialize their own handling.

There are two possible approaches I can think of (others may have another suggestion or two):

  • in the noindex function, call wp_no_robots() if the blog is not private, or if is_search() is true
  • As the templates are loaded, add the code from your original patch once the search template is chosen.

This ticket was mentioned in PR #996 on WordPress/wordpress-develop by maniu.


8 months ago

  • Keywords needs-refresh removed

#9 @poena
8 months ago

I tested this by applying PR 996, performing a search, and viewing the source of the search result page.
<meta name='robots' content='noindex, follow, max-image-preview:large' /> is output in the head.

I then followed the testing instructions from 51511#comment:20. Removing the filter(s) works correctly. I did not test on multisite.

@peterwilsoncc
8 months ago

#10 @peterwilsoncc
8 months ago

  • Keywords has-unit-tests added; needs-testing needs-unit-tests removed

In 52457.diff:

  • no change from pull request
  • unit test to ensure noindex displays on search
  • unit test to ensure noindex does not display on other pages

Revised approach looks good to me.

@jonoaldersonwp I noticed sensitive pages include a noindex, noarchive directive. Is the latter required for search too?

If not, I've also tested this and think it's good for commit if noarchive isn't needed.

#11 @jonoaldersonwp
8 months ago

Super! Yeah, we don't need the noarchive.

#12 @sabernhardt
8 months ago

The ticket's type is "enhancement" after the 5.7 cutoff date. Could we reclassify this as an important task for follow-up to #51511?

Also, #52536 could extend this to feeds (in a future release).

#13 @johnbillion
8 months ago

  • Type changed from enhancement to task (blessed)

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


8 months ago

#15 @peterwilsoncc
8 months ago

  • Owner set to peterwilsoncc
  • Resolution set to fixed
  • Status changed from new to closed

In 50370:

General: Add noindex robots meta tag to search results.

Prevent search engines indexing internal search results to protect against reflected web spam attacks.

Props abagtcs, audrasjb, ayeshrajans, burtrw, johnbillion, jonoaldersonwp, peterwilsoncc, poena, sabernhardt, xkon.
Fixes #52457

#16 @SergeyBiryukov
8 months ago

In 50371:

Robots: Rename wp_embed_no_robots to wp_robots_noindex_embeds().

This brings the naming in line with wp_robots_noindex_search().

Follow-up to [49992], [50370].

See #51511, #52457.

This ticket was mentioned in Slack in #core by sergey. View the logs.


8 months ago

This ticket was mentioned in Slack in #forums by vladytimy. View the logs.


7 months ago

This ticket was mentioned in Slack in #core by poena. View the logs.


6 months ago

This ticket was mentioned in Slack in #core by sergey. View the logs.


6 months ago

#22 follow-up: @matt
6 months ago

Should we noindex on all search results, or just ones that return zero posts?

#23 in reply to: ↑ 22 @jonoaldersonwp
6 months ago

Replying to matt:

Should we noindex on all search results, or just ones that return zero posts?

All. Search results are almost always low value/utility from an SEO (and user) perspective, and often heavily duplicate existing taxonomy pages.

This ticket was mentioned in Slack in #meta by sergey. View the logs.


5 months ago

Note: See TracTickets for help on using tickets.