#62257 closed enhancement (duplicate)
Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent
Reported by: | rickcurran | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 6.7 |
Component: | Privacy | Keywords: | has-patch |
Focuses: | Cc: |
Description
This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt file to prevent AI bots from crawling site content without specific user consent.
This is done by changes to the do_robots
function in the wp-includes/functions.php
, this updated code loads a list of known AI Bots from a JSON file ai-bots-for-robots-txt.json
add creates a User-agent:
entry for each one and disallows their access.
Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.
Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:
“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”
This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.
I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.
Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.
With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.
(Apologies if I am missing information here, this is my first time pushing code via Trac / Github so I am still finding my feet with the process!)
Change History (3)
This ticket was mentioned in PR #7590 on WordPress/wordpress-develop by @rickcurran.
3 months ago
#1
- Keywords has-patch added
#2
follow-up:
↓ 3
@
3 months ago
- Milestone Awaiting Review deleted
- Resolution set to duplicate
- Status changed from new to closed
Hi @rickcurran This has been proposed earlier in ticket #60805. I've closed this as a duplicate to keep discussion in a single location. I think you can update the ticket number in your pull request to link it to the earlier ticket.
#3
in reply to:
↑ 2
@
3 months ago
Replying to peterwilsoncc:
Hi @rickcurran This has been proposed earlier in ticket #60805. I've closed this as a duplicate to keep discussion in a single location. I think you can update the ticket number in your pull request to link it to the earlier ticket.
Ok, thanks, I have updated my PR to link to the earlier ticket now.
Enhancement: Add known AI Crawler bots to disallow list in robots.txt to prevent crawling without specific user consent
This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual
robots.txt
file to prevent AI bots from crawling site content without specific user consent.This is done by changes to the do_robots function in the
wp-includes/functions.php
, this updated code loads a list of known AI Bots from a JSON fileai-bots-for-robots-txt.json
add creates a User-agent: entry for each one and disallows their access.### Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.
Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:
This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.
I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.
Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.
With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.