Opened 11 months ago
Last modified 4 weeks ago
#60805 new feature request
Reading Settings: add option to discourage AI services from crawling the site
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | |
Component: | Privacy | Keywords: | |
Focuses: | privacy | Cc: |
Description
I'd like to suggest a new addition to the bottom of the Reading Settings screen in the dashboard:
This new section would help site owners indicate whether or not they would like their content to be indexed by AI services and used to train future AI models.
There have been a lot of discussions about this in the past 2 years: content creators and site owners have asked whether their work could and should be used to train AI. Opinions vary, but at the end of the day I believe most would agree that as a site owner, it would be nice if I could choose for myself, for my own site.
In practice, I would imagine the feature to work just like the Search Engines feature just above: when toggled, it would edit the site's robots.txt
file and disallow a specific list of AI services from crawling the site.
There are typically 4 main approaches to discouraging AI Services from crawling your site:
- You can add
robots.txt
entries matching the different User Agents used by AI services, and asking them not to index content via aDisallow: /
.- This seems to be the cleanest approach, and the one that AI services are the most likely to respect.
- This also has an important limitation ; it relies on a list of AI User Agents that would have to be kept up to date. It would obviously be hard for that list to ever be fully exhaustive. See an example of the user agents we would have to support below.
- You can add an
ai.txt
file to your site, as suggested by Spawning AI here.- However, we have no guarantee AI services currently recognize and respect this file.
- You could add a meta tag to your site's
head
:<meta name="robots" content="noai, noimageai" />
. This is something that was apparently first implemented by DeviantArt.- I do not know if this is actually respected by AI services. It is not an HTML standard today. In fact, discussions for a new HTML standard are still in progress, and suggest a different tag (reference).
- If a standard like that were to be accepted, and if AI Services agreed to use it, it may be the best implementation in the future since we would not have to define a list of AI services.
- You can completely block specific User Agents from accessing the site.
- I believe we may not want to implement something that drastic and potentially blocking real visitors in WordPress Core. This is something that is better left to plugins.
Some plugins already exist that implement some of the approaches above. It shows that there may be interest to include such a feature in Core.
- ChatBot Blocker
- Simple NoAI and NoImageAI
- Block AI Crawlers
- Block Chat GPT via robots.txt
- Block Common Crawl via robots.txt
- WordPress Block AI Scrapers
If we were to go with the first option, here are some examples of the User Agents we would have to support:
Amazonbot
-- https://developer.amazon.com/support/amazonbotanthropic-ai
-- https://www.anthropic.com/Bytespider
-- https://www.bytedance.com/CCBot
-- https://commoncrawl.org/ccbotClaudeBot
-- https://claude.ai/cohere-ai
-- https://cohere.com/FacebookBot
-- https://developers.facebook.com/docs/sharing/botGoogle-Extended
-- https://blog.google/technology/ai/an-update-on-web-publisher-controls/GPTBot
-- https://platform.openai.com/docs/gptbotomgili
-- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/omgilibot
-- https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/SentiBot
-- https://sentione.com/sentibot
-- https://sentione.com/
This list could be made filterable so folks can extend or modify that list as they see fit.
Attachments (1)
Change History (12)
This ticket was mentioned in PR #7590 on WordPress/wordpress-develop by @rickcurran.
4 months ago
#2
- Keywords has-patch added
Enhancement: Add known AI Crawler bots to disallow list in robots.txt to prevent crawling without specific user consent
This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt
file to prevent AI bots from crawling site content without specific user consent.
This is done by changes to the do_robots function in the wp-includes/functions.php
, this updated code loads a list of known AI Bots from a JSON file ai-bots-for-robots-txt.json
add creates a User-agent: entry for each one and disallows their access.
### Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.
Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:
“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”
This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.
I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.
Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.
With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.
I have updated the Trac link to an earlier ticket, this PR has been marked as a duplicate in Trac:
Trac ticket https://core.trac.wordpress.org/ticket/60805
#3
@
4 months ago
- Keywords has-patch removed
Hi, I recently posted a ticket and PR related to the same issue (https://core.trac.wordpress.org/ticket/62257), I would like to encourage some focus on this issue if possible. It's great to see a prior ticket exists for this, but I think the necessity for a core feature to block AI bots has increased in the months since this original ticket was created, I think I make a case for this in my PR / ticket which I think you can see linked above now.
I think @jeherve's idea of a checkbox to toggle this on or off makes sense, along with the ability to filter the list of User Agents. I do think this should default to being checked, I think the issue is important enough to warrant a strong stance, but perhaps this status could be brought to the WP admin's attention in some way such as a post-upgrade message so that they can then choose how they would like this set? When "Discourage search engines from indexing this site" is checked there is a message "Search engines discouraged" in the "At a glance" widget on the Dashboard which indicates this state, so a similar message there would make sense too.
(Side issue / thought: Should there be a regular, more prominent reminder to check these settings? There is a periodic reminder post-login that asks Admins to confirm if the admin email address is still correct, so there is a precedent for reminding users about certain configuration settings, perhaps this falls in to the "Site Health" check as well?).
Thanks!
#4
follow-up:
↓ 5
@
4 months ago
Thanks for the ticket, @jeherve, and for continuing the discussion, @rickcurran 🙌🏻
Search visibility status
Regarding precedent for informing users of search engine visibility, yes, both the "At a Glance" dashboard widget and Site Health > Info > WordPress section include notices to this effect:
I agree that both would be helpful indicators/reminders to couple with this feature.
AI crawler visability
Personally, I'd prefer a default blanket option that forgoes the need to maintain an agent list, and allow extenders to limit/allow on a per-agent basis, as needed. From what I've observed in the media, concern voiced around AI companies scraping content seems quite separated from the ability to show up in search results, which would rule out a blanket AI "disallow" in robots.txt
. A blocklist versioned to a WordPress release or served by the WordPress.org API would require regular maintenance, so might not be a great fit for Core.
A separate ai.txt
file modeled after robots.txt
would keep these concerns separate, but will anybody honor it? As mentioned by @rickcurran, could WordPress lead by example here, by establishing a standard to be used by 43% of sites?
With regard to a default of allowing or blocking AI crawlers, while it would indeed send a powerful message, I don't know if all WordPress users would necessarily agree to block on Day One when this feature shipped. However, a one-time admin notice after update, and a persistent AI crawler status on "At a Glance" could serve as reminders of this option.
AI worker agents
This is another wrinkle to consider: If these controls were implemented, how should WordPress deal with AI-based agents that access sites to perform tasks, such as Anthropic's "Computer use" or Open Interpreter? This use case could ostensibly be a legit automation by a site visitor (or member/customer). Would WordPress differentiate between these types of tasks? A commerce site might be fine with an automation to re-order toilet paper, but a ticket site might not want bots gobbling up seats to an event.
#5
in reply to:
↑ 4
@
4 months ago
Replying to ironprogrammer:
Thanks for your comments / thoughts.
AI crawler visability
Personally, I'd prefer a default blanket option that forgoes the need to maintain an agent list, and allow extenders to limit/allow on a per-agent basis, as needed.
I don’t think there is any blanket method that can be used to just target AI bots, each one needs to be specified by its User Agent.
A blocklist versioned to a WordPress release or served by the WordPress.org API would require regular maintenance, so might not be a great fit for Core.
I do have the same concern, however I don’t think the amount of new AI Bots coming online is so frequent that updating the blocklist when WP core point releases come out would be too long. If it was urgent to add them in between those releases then this could be the role of a plugin which allows you to filter the list and add / remove User Agents.
A separate
ai.txt
file modeled afterrobots.txt
would keep these concerns separate, but will anybody honor it?
The ai.txt
option seems like a good idea, but I don’t know if any bots willingly use it. So I think robots.txt is the best option as it definitely works.
AI worker agents
This is another wrinkle to consider: If these controls were implemented, how should WordPress deal with AI-based agents that access sites to perform tasks, such as Anthropic's "Computer use" or Open Interpreter? This use case could ostensibly be a legit automation by a site visitor (or member/customer). Would WordPress differentiate between these types of tasks?
There are different User Agents for the different types of bots, so in theory these could be split into separate lists, so the training bots are blocked but allow task bots to access the site still. I can see that people may wish to allow one group and block the other.
#6
follow-up:
↓ 7
@
3 months ago
Thanks for opening this Trac ticket, @jeherve.
I like this idea - it gives users greater control over their content.
I can see how some users might not want their blog or site to be indexed by both search engines and AI bots - such as for staging sites and non-public content.
What do you think about adding a single checkbox to discourage both search engines and AI bots?
If left unchecked, users could be presented with separate checkboxes for more granular control over discouraging indexing by search engines and AI bots individually (similar to how it works now).
#7
in reply to:
↑ 6
;
follow-up:
↓ 8
@
3 months ago
Replying to antonvlasenko:
What do you think about adding a single checkbox to discourage both search engines and AI bots?
If left unchecked, users could be presented with separate checkboxes for more granular control over discouraging indexing by search engines and AI bots individually (similar to how it works now).
I'm not sure I understand. Do you mean to have two checkboxes, or one?
#9
@
2 months ago
My inclination for now is that this is best left for plugins for the time being.
- AI companies and crawlers are increasing at a rapid rate, so it's unlikely Core contributors would be able to keep up with the known bots. That puts WP in a position in which it is, arguably, favoring some companies over others.
- It's pretty well known that some AI crawlers respect robots.txt, while others ignore it. The most effective way of blocking them is via their IP address. This isn't something that can be done in Core due to full page caching, the use of CDNs. In effect, robots.txt in core would affect the more ethical providers.
On my personal site, I'd probably install such a plugin but that doesn't necessarily mean it's appropriate for core.
#11
@
4 weeks ago
I appreciate there is a ticket in the works to help prevent generative Ai from scraping websites for content.
Ideally this would be something that WordPress should have added as an update to block any sort of content scraping from happening, especially given that content scraping for generative Ai could be considered theft and thus subject to criminal charges in federal court, but I understand if this would end up being resolved with plugins.
One of my major concerns is that WordPress could abuse the platform as a whole for content scraping to train generative Ai in the same way that Meta and Twitter have abused their platforms through the Terms of Service and Privacy Policy. Tech companies seem very eager to push generative Ai tech and "Ai" tech in general that it's putting too many risks on users and businesses.
That said, for the original idea @jeherve posted, this shouldn't be a "Op-In" feature, but something that is always on unless the owner of a WordPress site turns it off. This ensures that all WordPress accounts are auto-protected from Ai theft.
Mockup of how such an option would look like in the WordPress dashboard