Make WordPress Core

Opened 3 months ago

Closed 3 months ago

Last modified 3 months ago

#62257 closed enhancement (duplicate)

Enhancement: Add known AI Crawler bots to robots.txt to prevent crawling content without specific consent

Reported by: rickcurran's profile rickcurran Owned by:
Milestone: Priority: normal
Severity: normal Version: 6.7
Component: Privacy Keywords: has-patch
Focuses: Cc:

Description

This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt file to prevent AI bots from crawling site content without specific user consent.

This is done by changes to the do_robots function in the wp-includes/functions.php, this updated code loads a list of known AI Bots from a JSON file ai-bots-for-robots-txt.json add creates a User-agent: entry for each one and disallows their access.

Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.

Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:

“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”

This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.

I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.

Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.

With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.

(Apologies if I am missing information here, this is my first time pushing code via Trac / Github so I am still finding my feet with the process!)

Change History (3)

This ticket was mentioned in PR #7590 on WordPress/wordpress-develop by @rickcurran.


3 months ago
#1

  • Keywords has-patch added

Enhancement: Add known AI Crawler bots to disallow list in robots.txt to prevent crawling without specific user consent

This change / enhancement is intended to add known AI Crawler bots as disallow entries to WordPress' virtual robots.txt file to prevent AI bots from crawling site content without specific user consent.

This is done by changes to the do_robots function in the wp-includes/functions.php, this updated code loads a list of known AI Bots from a JSON file ai-bots-for-robots-txt.json add creates a User-agent: entry for each one and disallows their access.

### Why is this needed?
My perspective is that having AI bots blocked by default in WordPress is a strong stance against the mass scraping of people’s content for use in AI training without their consent by companies like OpenAI, Perplexity, Google and Apple.

Microsoft’s AI CEO Mustafa Suleyman was quoted recently saying:

“With respect to content that is already on the open web, the social contract of that content since the 90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding,”

This statement seems to be saying the quiet part out loud, that many AI companies clearly believe that because content has been shared publicly on the web it is available to be used for AI training by default, so unless the publisher specifically says that it should not be used then it is no problem for this to be crawled and absorbed into their AI models.

I am aware that plugins already exist if people wish to block these but this is only useful for people who are aware of the issue and choose to block it, whereas I believe consent should be requested by these companies and given rather than the default being that companies can just presume it’s ok and scrape any websites that don’t specifically say “no”.

Having 43%+ of websites on the internet suddenly say “no” by default seems like a strong message to send out. I realise that robots.txt blocking isn’t going to stop any of the anonymous bots that do it but at least the legitimate companies who intend to honour it will take notice.

With the news that OpenAI is switching from being a non-profit organisation to a for-profit company I think a stronger stance is needed on the default permissions for content that is published using WordPress. So whilst the default would be to block the AI bots there would be a way for people / publishers to allow access to their content by using the same methods currently available to modify ‘robots.txt’ in WordPress, plugins, custom code etc.

#2 follow-up: @peterwilsoncc
3 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Hi @rickcurran This has been proposed earlier in ticket #60805. I've closed this as a duplicate to keep discussion in a single location. I think you can update the ticket number in your pull request to link it to the earlier ticket.

#3 in reply to: ↑ 2 @rickcurran
3 months ago

Replying to peterwilsoncc:

Hi @rickcurran This has been proposed earlier in ticket #60805. I've closed this as a duplicate to keep discussion in a single location. I think you can update the ticket number in your pull request to link it to the earlier ticket.

Ok, thanks, I have updated my PR to link to the earlier ticket now.

Note: See TracTickets for help on using tickets.