garbage query strings on URLs are not sanitized or removed
|Reported by:||rawalex||Owned by:|
Here is an interesting problem I ran into, a bug / feature that appears to be used by malicious people to cause Google to see your site as full of duplicate content.
If you visit a wordpress site, and add a garbage query string to the end of the URL, that garbage gets carried forward. Example:
When you scroll down, the "previous" and "next" links will automatically carry that query string forward.
Normally, this would not be a big issue. However, some people appear intent on specifically creating these sorts of links to wordpress sites, and Googlebot is finding those links on remote sites. Those links are followed, and then the "previous - next" situation perpetuates the problem through every page on the site. If you have 1000 posts, at 10 per page, Google just indexed 100 duplicate content pages.
So the bug is the following:
Passed query strings need to be sanitized, and junk removed - there is no reason to pass it on. In the case of a junk passed string, there should be an http 301 or 302 reply and the user / bot redirected to the proper page without the query string.
Further, query strings should not be perpetuated forward through the "previous - next" links on the pages unless they are relevant to that page change. As an example, a valid search string might be worth moving forward with. Other passed items may not be worth carrying forward.
Potentially, any unsanitized input accepted in a query is a vector for other attacks. Having that query carry forward is a real issue. As an example, full select * from queries are not accepted and not dealt with, and perpetuated forward. No, they are not currently actually causing anything to happen, but a failure to sanitize these inputs suggests a vector for a future attack, such as an input overflow or similar.