Make WordPress Core

Opened 3 years ago

Last modified 5 weeks ago

#52900 reopened feature request

Instantly index WordPress web sites content in Search Engines

Reported by: fabricecanel's profile fabricecanel Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: General Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Everyday Search Engines are crawling billions of WordPress URLs to maintain their search index fresh. They crawl to get the latest content, discover new outlinks or verify that URLs already indexed are still valid, not dead links. Unfortunately, Search Engines are generally inefficient at crawling, as they don’t know if the content has changed, and most web pages do not change often. Search engine crawling activity adds on to bandwidth and CPU consumption.

At Microsoft Bing, we believe in a fundamental shift in how Search Engines learn about new, updated, deleted content across the web. Instead of crawling often to detect if content has changed; Content Management Systems should notify Search Engines of content change to limit crawling and have a fresh Search Index. To support this transformation, since February 2019, we offer URL and Content Submission API allowing Web Site owners to publish to Bing thousands to millions of URLs or Content per day; and since July 2020, we offer an WordPress plugin to submit content immediately to Bing’s search index, no code required.

Today, we propose integrating in WordPress Core the ability to notify not only Bing, but also all participating Search Engines, of any WordPress URL or Content change. Microsoft to develop and maintain the open-source code in close collaboration with WordPress. WordPress to approve, validate and include code.

Behind the scenes, WordPress will automatically submit URL or Content ensuring that the WordPress content is always fresh in the Search Engines; in exchange Search Engines will limit crawl or not crawl WordPress sites. Site owners will have the ability to opt-out or select the content they don’t want to submit to search engines.

Change History (53)

This ticket was mentioned in Slack in #forums by sergey. View the logs.


3 years ago

#2 @desrosj
3 years ago

  • Keywords reporter-feedback added

Hi @fabricecanel,

Welcome to Trac!

Can you provide more information around how this API works? Is there documentation you can point to?

Also, is this a standardized approach across various search engine companies? Or will the implementation differ between provider?

#3 @Clorith
3 years ago

Hiya @fabricecanel, and welcome to the WordPress trac!

I'd also just like to add in, just to clear out any potential confusion, that your initial ticket was flagged and removed as it read like many other spam posts we receive. On closer inspection we could see that it was not, and have restored it, but as you can see now with a new ticket ID.

For reference, the plugin mentioned is https://wordpress.org/plugins/bing-webmaster-tools/

Last edited 3 years ago by Clorith (previous) (diff)

#4 @joostdevalk
3 years ago

I was reading this (and have spoken to Fabrice a couple of times before about this topic) and only just this afternoon did I realize something and I'm facepalming myself for not thinking about this sooner. Under Settings --> Writing in the WordPress admin, we have "Update Services". The functionality we have there pings the sites in that box, by default rpc.pingomatic.com, with the RSS feed URL of the site. I think we can all agree that's a bit archaic.

Maybe we can add a new version of the pingomatic API to receive the post/page URL instead of the RSS feed and allow search engines to subscribe to that feed and/or to get a similar ping themselves?

I'm not sure who maintains Pingomatic, will ask around.

This ticket was mentioned in Slack in #meta by carike. View the logs.


3 years ago

#6 @carike
3 years ago

As far as I remember additional services can be added in wp-admin. I believe it is a comma-separated list.
I do not believe it is strictly necessary to use Pingomatic as a "middle-man" so to speak. :)

#7 @carike
3 years ago

P.S. I believe the pings only go out when posts are created.
As far as I know they do not go out when posts are updated, or when pages are created or updated.

Also, the "Hello World" post is not excluded from the pings. I do not believe that those posts would be valuable to search engines and they are created before website owners have had a meaningful opportunity to make privacy choices with regards to their site.
That should be easy to fix though.

#8 @fabricecanel
3 years ago

Thanks @Clorith for restoring this feature request.

Links: Today, we offer an open source WordPress Plugin connected to such Bing Webmaster Tools API to notify Bing about content change for site adopting this API.

@joostdevalk: We are proposing a shift:

  • From pulling to pushing: it’s not about pulling (RSS feeds or similar), it’s about pushing, publishing each change, with some throttling logic as already done in Bing Webmaster Tools WordPress plugin, to avoid notification on every keystroke, every 5 seconds save, to the set of search engines having adopted this design open to all search engines and listening to change. Pulling requires crawling, crawling, and crawling again to check if the content has changed (most of the times the content didn’t change), Pulling required also at the first place discovering the site and feeds. Pushing enables search engines to be seconds to minutes behind content change and guaranty that search engines are aware of the change and minimize the need of crawling to discover if something has changed. In case of downtime, search engines will still rely on sitemaps and links to discover news URLs.
  • Open to all search engines: Search Engines having an API can be added to the notification.
  • Enabled by default: We want to lower the complexity for WordPress users to be found and indexed by search engines. If you are a newbie, your new site should be immediately found and indexed, your latest content and your latest typo fixed should be indexed in minutes… not in weeks.

Benefits: new sites indexed fast, latest content indexed fast, content removed drained fast from search engines... then far less far crawl done by search engines to discover if web sites have changed… which is good for site owners, WordPress platform owners, global warming and polar bears.

#9 follow-up: @carike
3 years ago

Pings are not a pulling technology like RSS feeds. They already push :)
Thing is, pings are made via XML-RPC. Some security plugins, as well as some specialty plugins, turn off XML-RPC, as new technologies like the REST API have largely replaced its original functions and XML-RPC was known to be exploited throughout the years.

From a privacy perspective, I'm going to have to urge opt-in rather than enabled-by-default :)

#10 @Clorith
3 years ago

@fabricecanel Totally agree, this should not be something a user needs to enable on their own if it gets added. Some things that need clarifying still remain though:

While pingomatic is a service that just tells indexers "there's new content, go crawl that site", you are proposing a more direct interaction where you tell the indexers "_here_ is a piece of new content to look at". There are some concerns that need addressed in regards to this.

As @desrosj mentioned, is this an open standard that all indexers have freely available to them, and does not require any additional code to function? This is important, because core should not have to provide special handling depending on which search engine its user favors.

Is this a centralized thing, if it should just work out of the box, then there's multiple search engines out there. There's also the concern of legislatures that watch for monopoly-benefiting actions, and might view WordPress catering to the big search engines, while the small ones aren't hard-coded in because core doesn't know of them as preventing them from having a fair advantage (I am not a lawyer, so this bit is just a potential concern I think we need ot be vary of in cases like these).

#11 @desrosj
3 years ago

Based on scanning some of the documentation, it seems that in order to submit URLs someone would need to register and receive an API key. If this is a requirement, then I'm not sure that this would be a good fit for WordPress Core at this time.

It also appears that sites are limited to 10,000 submissions per day. This is quite a bit and a large majority of sites will never hit this, but what about ones that will? How do they prioritize which pages are prioritized? What happens after the limit is hit and more updates are made?

Last edited 3 years ago by desrosj (previous) (diff)

#12 follow-up: @fabricecanel
3 years ago

@desrosj, @Clorith : For this suggested feature, we do not plan to leverage the existing proprietary solution we have in place on the Bing server side; please note that our WordPress plugin is already Open Source.

For Oath: we recommend an open approach, open to all search engines having an API to receive notification of change. We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

For Count of URLs per day: We do plan having a limit, search engines will consider, and throttle as needed on their side per site: at least Search Engines will be aware of the URLs (added, updated, deleted).

#13 follow-up: @carike
3 years ago

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

#14 @fabricecanel
3 years ago

Yes @carike, we plan offering abilities to disallow URL notifications for the whole site (opt out), part of the site (somehow like robots.txt disallow rules), per URL (somehow like NOINDEX robots meta tag) or per content type. At the end, this is webmasters and site owners responsibilities to define what should not be indexed. Same if an URL is not published to search engines, search engines are good at finding URLs from other pages and may visit and index it, so preferable to meta robots NOINDEX the content you don't want to index.

#15 follow-up: @carike
3 years ago

At the end, this is webmasters and site owners responsibilities to define what should not be indexed.

So, we're saying that it is too difficult a decision / too complicated for site owners to opt-in, but we expect them to know exactly how each plugin treats what data and what data might be sensitive and how to exclude it?

Look, I support SEO wholeheartedly. But if site owners are going to have that level of responsibility, it cannot be on-by-default. You have to give them a meaningful choice / actually make sure that they know they have a choice.

#16 in reply to: ↑ 13 @knutsp
3 years ago

Replying to carike:

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

A post type is either public or it's not. Public ones have a View link in posts.php. We had a similar discussion for core xml sitemaps. And same goes for REST API, feeds and html pages. If they can be viewed publicly it may, and will probably, be indexed after some time.

This ticket is about switching from pull to push. Push has been available in almostt all time through "Submit your URL" forms, but their use has probably approximated zero. This time there is an automated way to do it, through standardized APIs, if I understand it correctly.

Since WordPress core knows what content is to be presumed public and indexed, this can safely be implemented after other issues are solved.

#17 in reply to: ↑ 12 ; follow-up: @desrosj
3 years ago

Replying to fabricecanel:

we recommend an open approach, open to all search engines having an API to receive notification of change.

If I'm interpreting this correctly, Bing has created and implemented an internal API for this, and is recommending other search engines follow their lead. But there is no agreed upon, standardized, and widely adopted way to do this. This would make implementing difficult as each provider could have their own requirements.

We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

Unless this can happen seamlessly without user interaction, I'm still thinking this is not the best feature to include in WordPress Core directly. I'm not at all against the concept, it just seems this is too early for Core to support it industry wide. If there were an industry wide specification that everyone was adopting and using, this would be much easier.

That said, I'm not against exploring ways to make this easier to accomplish. For example, maybe expanding how update services work.

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

#18 @fabricecanel
3 years ago

Replying to @desrosj

Yes, today Bing API is Bing specific. This API helped us to learn at scale : many top large sites having adopted, and more sites directly or via our WordPress Plugin. Our feature proposal is different: we are proposing an open standard URL ping mechanism, with an oath token mechanism to verify URL ownership, to notify any Search Engines that want to listen to the data. I am reaching out to major Search Engines to collaborate with them to standardize the whole story. We should have a common approach in making crawl more efficient to reduce global warming and Content Management Systems as WordPress know more than Search Engines what must to be crawled and when we must crawl it.

I also agree with you on making it an industry search wide approach, it will not be Bing specific, I expect other major Search Engines to contribute, help designing it with the WordPress Community and leverage notifications of URLs added, updated, deleted on day one.

#19 @fabricecanel
3 years ago

On this Earth Day 🌎, I want to share an update related to this feature proposal. That's early but we should be able to share a proposal/Request For Comments supported by key actors in the industry in the following weeks. Endgame is not only to streamline indexing for all search engines; this is also to remove useless crawl footprint at trillions of URLs scale each year.

This ticket was mentioned in PR #1712 on WordPress/wordpress-develop by pingal.


3 years ago
#20

  • Keywords has-patch has-unit-tests added

This pull request provides implementation for feature request https://core.trac.wordpress.org/ticket/52900 .

Trac ticket: https://core.trac.wordpress.org/ticket/52900

#21 follow-up: @fabricecanel
3 years ago

We did our first pull request https://github.com/WordPress/wordpress-develop/pull/1712 to start integrating our industry wide solution to instantly index WordPress web sites content in all major Search Engines. Looking forward to your feedback and your guidance.

#22 in reply to: ↑ 21 ; follow-ups: @dd32
3 years ago

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

#23 in reply to: ↑ 22 @fabricecanel
3 years ago

Replying to dd32:

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors.

Next month, we will disclose far more on wide industry support.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

WordPress and other CMS admins will be able to able to select which actors in the industry they want to notify, including everybody via centralized services if they want to notify all.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.

Right... we are fighting day and night against spam, and we cannot trust and often cannot use without verification. Too easy to spam with ping random URLs on plenty of websites you don't own.

#24 @fabricecanel
3 years ago

Today I am pleased to report that the latest pull request supports API integration not only for Bing API but also for Yandex API and Baidu API, along various other improvements, including README.md ! Looking forward to your reviews https://github.com/WordPress/wordpress-develop/pull/1712.

#25 @fabricecanel
3 years ago

Today, Microsoft Bing and Yandex announced IndexNow https://www.indexnow.org/ the search engines industry open protocol behind this request. This protocol is already supported by Microsoft Bing and Yandex. With IndexNow, website owners can quickly reflect website changes in search engines results and drive more customers to their websites. With IndexNow, website owners can now provide a clear signal to search engines about their content changes, thus prioritizing crawl for these URLs Microsoft Bing blog.

#26 in reply to: ↑ 17 @fabricecanel
3 years ago

Replying to desrosj:

Replying to fabricecanel:

we recommend an open approach, open to all search engines having an API to receive notification of change.

If I'm interpreting this correctly, Bing has created and implemented an internal API for this, and is recommending other search engines follow their lead. But there is no agreed upon, standardized, and widely adopted way to do this. This would make implementing difficult as each provider could have their own requirements.

We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

Unless this can happen seamlessly without user interaction, I'm still thinking this is not the best feature to include in WordPress Core directly. I'm not at all against the concept, it just seems this is too early for Core to support it industry wide. If there were an industry wide specification that everyone was adopting and using, this would be much easier.

Replying to desrosj: Today Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry as cloudflare and many listed as adopting soon blog post

That said, I'm not against exploring ways to make this easier to accomplish. For example, maybe expanding how update services work.

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

Last edited 3 years ago by dd32 (previous) (diff)

#27 in reply to: ↑ 22 @fabricecanel
3 years ago

Replying to dd32:

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

Replying to dd32: As shared in this feature request, today Microsoft Bing and Yandex release Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry. We need a service secure (key is provided by the site), easy to integrate, scaling to the whole industry, all scenarios (web site, CMS, CDN, SEO companies), targeted for search engines as to support add, update and delete, and helping search engines to minimize crawl load. So, a broader scope. One key scenario for WordPress sites is that most sites owners expect to see their content quickly indexed (except in case of noindex tag) without having to do something to do, ability to be indexed fast should be built in the search engines, not all webmasters want to adopt a ping service to see their content stolen and duplicated all over the internet.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

Replying to dd32: Existing ping services are not open. Users of these ping systems, generally ping only a few dominant players. https://www.indexnow.org/ is open, it shares URLs submitted between all search engines having adopted. You ping one, you ping in fact all.

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

Replying to dd32: a) Sitemaps is a great way to tell search engines all the relevant URLs on your site. Search Engines attempt looking at sitemaps once a day. Do you like to wait 1+ days to see your content indexed? IndexNow https://www.indexnow.org/ allows you to have your content index now, not in few days. b) Page changes is not a great solution we have to pull often millions of sites to discover if the content has changed. Right model is IndexNow + Sitemaps... IndexNow to get indexing done fast and sitemaps to catchup if a ping is missed.

#28 in reply to: ↑ 22 @fabricecanel
3 years ago

Replying to dd32:

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

Replying to dd32: As shared in this feature request, today Microsoft Bing and Yandex release Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry. We need a service secure (key is provided by the site), easy to integrate, scaling to the whole industry, all scenarios (web site, CMS, CDN, SEO companies), targeted for search engines as to support add, update and delete, and helping search engines to minimize crawl load. So, a broader scope. One key scenario for WordPress sites is that most sites owners expect to see their content quickly indexed (except in case of noindex tag) without having to do something to do, ability to be indexed fast should be built in the search engines, not all webmasters want to adopt a ping service to see their content stolen and duplicated all over the internet.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

Replying to dd32: Existing ping services are not open. Users of these ping systems, generally ping only a few dominant players. https://www.indexnow.org/ is open, it shares URLs submitted between all search engines having adopted. You ping one, you ping in fact all.

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

Replying to dd32: a) Sitemaps is a great way to tell search engines all the relevant URLs on your site. Search Engines attempt looking at sitemaps once a day. Do you like to wait 1+ days to see your content indexed? IndexNow https://www.indexnow.org/ allows you to have your content index now, not in few days. b) Page changes is not a great solution we have to pull often millions of sites to discover if the content has changed. Right model is IndexNow + Sitemaps... IndexNow to get indexing done fast and sitemaps to catchup if a ping is missed.

#29 in reply to: ↑ 15 @fabricecanel
3 years ago

Replying to carike:

At the end, this is webmasters and site owners responsibilities to define what should not be indexed.

So, we're saying that it is too difficult a decision / too complicated for site owners to opt-in, but we expect them to know exactly how each plugin treats what data and what data might be sensitive and how to exclude it?

WordPress users have ability to exclude content from search engines as to set NOINDEX meta tag on pages they don't want them to index. This feature is just about getting the content quickly indexed in all search engines adopting https://www.indexnow.org/. Instead of having Search Engine crawlers crawling billions of WordPress pages everyday to discover if the content has change this is about guiding search engines to the content which has changed to speed up indexing and get the content indexed. Wordpress users are still in control and more in control as they don't have to wait to search engines to discover and have the latest content reflected in search engines.

Look, I support SEO wholeheartedly. But if site owners are going to have that level of responsibility, it cannot be on-by-default. You have to give them a meaningful choice / actually make sure that they know they have a choice.

#30 in reply to: ↑ 9 @fabricecanel
3 years ago

Replying to carike:

Pings are not a pulling technology like RSS feeds. They already push :)
Thing is, pings are made via XML-RPC. Some security plugins, as well as some specialty plugins, turn off XML-RPC, as new technologies like the REST API have largely replaced its original functions and XML-RPC was known to be exploited throughout the years.

From a privacy perspective, I'm going to have to urge opt-in rather than enabled-by-default :)

Remember that search engines are "sneaky", they can find find content following links, they apply heuristic on URLs to auto-discover them. Search engines offer an easy way to exclude content the NOINDEX, plus this features support abilities like the Google sitemaps the ability to exclude specific path from the notifications (at least we just propose that in the code we suggested for feature consideration, it's not about check-in this code).

#31 follow-up: @dd32
3 years ago

@fabricecanel Can you please edit/delete any comments above that were made in error? It looks like you're quoting entire comments from above without adding any extra context or answers - or you might've, but it's inline with another comment? I'm not sure I can't tell :)

Next month, we will disclose far more on wide industry support.

Thank you! That makes a huge difference, and helps this proposal as it's taken it from being a niche single-supporter use-case (Which would not be welcome in WordPress IMHO) to a industry-lead proposal which has a much better chance of support.

To provide some kind of code review on the approach taken:

  • I'm still not 100% convinced that having WordPress ping each of the engines individually is ideal, however, it's not the worst.
  • I'm still not 100% convinced that having an API key / verification callback should be allowed.
  • All supported providers would need to be defaulted in core, so as not to preference any given engine
  • No code-based configuration should be required for an end-user, at the most, a textarea with a list of search engine endpoints
  • No "exclude paths" functionality would be supported in a UI, filters should be exposed to add that
  • The WP_IndexNow_Provider class consts should probably be removed and put inline, other than perhaps WP_IndexNow_Provider::SUBMIT_API_PATH.
  • Same for the WP_IndexNow class consts - consting them doesn't help readability here at all, and only makes it harder to parse what the methods do.
  • WP_IndexNow::check_for_indexnow_page() should probably work the same way that the robots.txt loading works, through a rewrite rule. Speaking of, this also shows that indexnow doesn't work for WordPress sites which don't use URL rewrites. Potentially this is a shortcoming in the indexnow standard and it should be providing a unique call-back URI or /.well-known/ url as part of the ping payload, rather than just assuming https://example.com/base64-api-key-here.txt.

Finally:

  • I think this should be developed as a plugin first, and then proposed to WordPress core as a feature plugin, to allow development of it to occur separately and then a suggestion to add it to core once feature complete. That would also allow site owners to opt-in to using this prior to WordPress fully implementing it (Which would be in WordPress 6.0 at the absolute earliest, Q2ish 2022 at a guess)

#32 in reply to: ↑ 31 @fabricecanel
3 years ago

Replying to dd32:

@fabricecanel Can you please edit/delete any comments above that were made in error? It looks like you're quoting entire comments from above without adding any extra context or answers - or you might've, but it's inline with another comment? I'm not sure I can't tell :)

thanks for guiding me in WordPress world :)

  • I'm still not 100% convinced that having WordPress ping each of the engines individually is ideal, however, it's not the worst.

Good news, we listen to the feedback as you, and we agree to move away from the ping all, starting next month (maybe sooner), you wile have to ping only one and we will ping others (requirement of the protocol to make it open - allowing new player and small player to participate)

  • I'm still not 100% convinced that having an API key / verification callback should be allowed.

It allows webmasters/CMS to control what is crawled on the site. Evil people should not not a play at scale.

  • All supported providers would need to be defaulted in core, so as not to preference any given engine

Agree this is why we open the protocol, you ping one, you ping all: no preference. Protocol is open to all search engines having a presence in a market.

  • No "exclude paths" functionality would be supported in a UI, filters should be exposed to add that

This feature is requested for consideration for adoption, when we did some code, the wordpress experts should decide the best path to support IndexNow.

Finally:

  • I think this should be developed as a plugin first, and then proposed to WordPress core as a feature plugin, to allow development of it to occur separately and then a suggestion to add it to core once feature complete. That would also allow site owners to opt-in to using this prior to WordPress fully implementing it (Which would be in WordPress 6.0 at the absolute earliest, Q2ish 2022 at a guess)

IndexNow is an open protocol, @joostdevalk feel free to adopt in Yoast, other tools, and learn from it and suggest improvement for core.

#33 @TweetyThierry
3 years ago

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

No Search Index API without a site being verified which requires a Google account and oauth which requires a GCP project.

There is a few points about what should or should not be sent for indexing, how about relying on what is available in WordPress XML sitemaps? I imagine that developers who are purposefully including/excluding urls from their sitemaps would want that to be applicable to indexing push API too (after all, it serves the same purpose, unlike WP REST API).

#34 @joostdevalk
3 years ago

We've not implemented this in Yoast SEO yet for the same reason that @dd32 is resisting implementing it in WordPress core: so far it's just Yandex and Bing, which for most sites is a negligible part of their traffic.

As much as I appreciate @fabricecanel and his team at Microsoft, I want to see proof of this actually improving crawling for sites. I want to be sure that it actually improves either their search engine traffic or their bandwidth bills, before we add this to Yoast SEO. The cost of pinging multiple search engines when you publish a post or page is small on a per blog basis but it's non-zero if you consider doing it for millions of sites.

As soon as I see actual data on this making crawling better for sites (so Bing would crawl it less, not more) or it actually improving traffic for sites (which seems like that should not be possible) I'm happy to implement this in Yoast SEO. I like the standard, especially as it's not based on oAuth for a change.

#35 @fabricecanel
3 years ago

Time to share an update with all and @dd32, @joostdevalk. First, today Google announced that Google will be testing the IndexNow protocol. Second, more search engines are planning and working to adopt this open protocol. As listed before, starting later this month web site will have to ping only on engine as each engine will re-ping the others. Third, we plan to offer this month as suggested by @dd32 a IndexNow WordPress plugin solution to learn, I heard that some WordPress SEO solution are also planning to integrate it, allowing to learn in the following months, aiming to learn, collect feedback, and once ready maybe enable in Core.

#36 @fabricecanel
3 years ago

Time to share a new update with all and @dd32, @joostdevalk: today we released the WordPress IndexNow Plugin https://wordpress.org/plugins/indexnow/ enabling automated submission of URLs from WordPress sites adopting this plugin to the multiple search engines without the need to register and verify your site with them. Once installed, the plugin will automatically generate and host the API key on your site. It detects page creation/update/ deletion in WordPress and automatically submits the URLs in the background.

By releasing this plugin, we aim not only to benefit right away WordPress websites adopting it, but also learn, tweak as needed to someday release IndexNow it in WordPress core to benefit all websites and all existing and upcoming search engines adopting IndexNow.

SEO companies can reuse this IndexNow plugin code and logic in their SEO plugins/solutions.

#37 @fabricecanel
17 months ago

It's time for an update to share with everyone, especially @dd32 and @joostdevalk. This week, I had the opportunity to give a keynote speech at the https://www.pubcon.com/ SEO conference. During my presentation, I shared a slide https://twitter.com/metrony/status/1630585086998396929 that showcased the success of IndexNow, which has been adopted by 20 million websites. Duda, a Content Management System, has already integrated IndexNow natively, and other CMSs are currently working to do the same. For example, Wix is actively working on integrating it this semester. In addition, I revealed that at least two major search engines are also planning to adopt IndexNow this semester. I encourage you to watch this two-minute video, which highlights the benefits of IndexNow for the entire ecosystem: https://blogs.bing.com/BingBlogs/media/WebmasterBlog/2022/IndexNow.mp4. We have the code ready and would like to discuss and integrate into WordPress Core, which will benefit the entire web ecosystem by improving crawl efficiency, reducing costs, and decreasing CO2 emissions.

#38 @fabricecanel
12 months ago

@dd32 and @joostdevalk : I am pleased to relay that Naver, the top South Korea search engine has started supporting IndexNow https://searchengineland.com/naver-korean-search-engine-now-supports-indexnow-429880. We kindly request your assistance in reviewing our code and collaborating to have WordPress supporting IndexNow, leading to faster indexing and reduced crawl loads.

#39 @fabricecanel
9 months ago

@dd32 and @joostdeval: I am pleased to relay that Wix the third CMS, after WordPress and Shopify per https://w3techs.com/technologies/overview/content_management, has adopted IndexNow https://www.wix.com/seo/learn/resource/wix-joins-bing-indexnow. I again kindly request your assistance in reviewing our code and collaborating to have WordPress supporting IndexNow, leading the Web to a more open Internet (by notifying one IndexNow API you notify all IndexNow search Engines) a faster indexing latency and reduced crawl loads.

#40 @fabricecanel
7 months ago

I wish you happy Holidays and hope that WordPress Core will join the growing list of top websites, CMS, CDN, web hosting companies and search engines using already or planning to adopt IndexNow in 2024. By crawling more effectively, Search Engines can significantly reduce at scale the costs associated with crawling websites and reduce the environmental impact.

@dd32 : I heard in a recent conversation that you may have some concerns about the IndexNow key. Let me share some thoughts. First the IndexNow key is required to verify that the site submitted is owning the URLs to limit noise. Second, the IndexNow key can be unique per website and change quite often over time limiting risks of be stollen. Third the IndexNow can be secure, first by limiting its visibility to the IndexNow API crawler you API-call (we will document best practices in the protocol) as we are always listening and considering extending the standard to support private key (we do support that already to share data between search engines).

#41 @desrosj
7 months ago

Thanks for the update @fabricecanel.

For me, I still have a few concerns.

One blocker for inclusion, in my opinion, is not the security, flexibility, or the ability to rotate the API key (those are all important and still factors in the equation). It's the fact that the user or site owner needs to be concerned with the API key at all. If it can work without a site owner needing to know what IndexNow is, I would be open to discussing further.

Based on the photos of your slides you linked on Twitter/X, it appears that most of the major WordPress SEO plugins have already adopted this. In my testing, it actually works quite nicely. I'm not sure how the underlying code works yet, but it does seem like it's implemented in a "set and forget" way.

Second item is that I'd still like to see if Google will implement IndexNow or not. While they're not the only search engine, my understanding is that they are still 80-90%+ of the market share depending on which data set you use. If they choose not to implement, then there I'm a bit concerned about the long term stability of this initiative.

#42 @fabricecanel
7 months ago

Thanks for the feedback @desrosh.

Indeed, Site owners should not need to worry about IndexNow, just as they do not need to worry about sitemaps today. Technical SEO should be handled by the Content Management System, so that site owners can focus on creating high-quality content. For IndexNow, the system can generate the key for them, and they do not have to do it themselves. We have proposed a pull request https://github.com/WordPress/wordpress-develop/pull/1712 that does this, and also provides an API to change the key on demand or periodically, but this can be done by the system, not necessarily by the site owner.

Thank you for testing IndexNow. adding URLs that are relevant works indeed nicely; recrawling known URLs is something we are target to optimize early 2024. Our Open-Source WordPress plugin's open is available here: https://plugins.trac.wordpress.org/browser/indexnow/

Regarding adoption by Google and other major players, we plan making even the case that this IndexNow signal can really help all actors by sharing in 2024 data on crawl efficiency improvement we observe. Already IndexNow contributes to 12% of new URLs clicked at Bing.

#43 @jorbin
6 months ago

  • Keywords close added

For as long as this requires an API key and the current plugin lacks any serius adoption, this feels like it's clearly plugin territory. I suggest it be closed as maybelater

#44 @johnbillion
6 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to maybelater
  • Status changed from new to closed

I concur. In addition, IndexNow is supported by most of the major SEO plugins now which means those users who want to take advantage of it have several options.

In its current state this doesn't feel like it benefits the majority of WordPress users, so let's close this as maybelater. It can be revisited and discussion can continue while the ticket is closed, but this isn't on the roadmap for the foreseeable future.

#45 @fabricecanel
6 months ago

It's good to hear that you're open to looking into IndexNow again later on. But to make sure you make the right decision, let me share more information.

@jorbin: can you please share you concern for the API key to record them and think about it? To make it clear: the key comes from the website, which is in this case WordPress, not from the search engines. The key helps one search engine verify that the URLs it gets are from your website. The other CMS and big websites I work with don’t have concern about the key.

Secondly, @johnbillion yes IndexNow is supported by most popular SEO plugins like Yoast, AIOSEO, Rank Math, SEOPress and we also offer an open source IndexNow plugin for the community. However, IndexNow should not be just offer as plugin, most website owners have very little knowledge of the system and CMS should handle by default the crawling/indexing part of SEO. Consider that Google is helping adding sitemaps by default to WordPress Core and IndexNow should also be part of the core to guaranty data freshness. CMS like Wix and Duda have already integrated IndexNow in their core and enabled it by default, and other Content Management Sytems are working to enable IndexNow too.

#46 @peterwilsoncc
6 months ago

I've been considering this over the summer break and come to the same conclusion as @jorbin and @johnbillion that this is not ready for inclusion in WordPress Core.

For me, the big problem is that Google does not yet support IndexNow. Given they have over 90% of the market share for search according to Statcounter the benefit to site owners is outweighed by the maintenance burden for core contributors.

Google have gone their own way with the Google Indexing API but it does require an API key. The docs indicate this endpoint only works with limited structured data.

#47 @joostdevalk
6 months ago

I disagree with the decisions being proposed here and I think we should reverse them.

I think this is a very good thing to support because it's a much more efficient way of dealing with search engine indexation. Right now, search engines literally check pages every day to see whether they've changed. This could change that behavior to something more sensible. We should all want this process to become more efficient. If you don't agree with me, or have doubts about what I'm saying, I'd urge you to look at a site's server logs for a few hours, filtered by bots, and you'll know why I think this is needed.

I think there have been a couple of miscommunications here:

  1. This doesn't require an API key. It requires you to generate a key for a site, but you can generate that key yourself, so you can generate a key and use that to sign your requests, then have the key in a file on your domain to prove you own the domain. See the documentation, it's not an API key in the classic sense.
  2. This API is not comparable to Google's Indexing API. If Google is ever going to use an API for this sort of thing, I'd expect it to be IndexNow.

The fact that Google doesn't use IndexNow is a pity. I would like to see that change. I think the chance of that changing if/when WordPress includes IndexNow becomes much bigger. But even when it doesn't, this already improves the efficiency of crawling for many other search engines; their market share doesn't matter: they already all crawl your site. If we can improve their efficiency, that's good for the entire ecosystem.

That's also why I disagree with @johnbillion; I think this does benefit the majority of WordPress users.

So, I would say: reopen and include on the roadmap ASAP.

#48 @matt
6 months ago

I know it's not part of WP, but is part of Foundation, but let's just decide how to extend this into Pingomatic and take the opportunity to spruce that up a bit too, then there's just one external HTTP call from WP to something we run, and it can fan out from there.

Ideally the ping includes the needed context and says send this to the recommended defaults, so if new challenger engines (Bing, Perplexity) want access to this they can start getting everyone immediately.

#49 @matt
6 months ago

It's fine that we use Trac for tracking if there's not a separate tracker for Pingomatic.

#50 @johnbillion
6 months ago

  • Keywords reporter-feedback close removed
  • Milestone set to Future Release
  • Resolution maybelater deleted
  • Status changed from closed to reopened

Reopening per the above. If this is implemented via Pingomatic then perhaps there will be nothing to do in core, but we'll track it via this ticket anyway.

#51 @fabricecanel
5 months ago

How to get this ask for implementation in Pingomatic reviewed?

#52 @fabricecanel
2 months ago

Yesterday, Google announced that WordPress 6.5 has gained lastmod date for sitemaps files with WordPress developer community. This is great news, I thanked them, and this will help to backoff crawling and focus the crawler, but as highlighted in my answer sitemaps is a daily process: The right setup is IndexNow (for real-time update) and sitemaps with lastmod (daily – catchup mode) to ensure comprehensive and fresh coverage.

Owners of WordPress sites should expect their most recent updates to be reflected in search results without a 24-hour delay. WordPress/Pingomatic should support IndexNow to take control of crawl and enable real-time indexing.

In the past months Yep/Ahrefs, xenforo and small and leading websites have adopted IndexNow.

@matt, @johnbillion , @joostdevalk : How can we get Pingomatic supporting IndexNow to help the entire industry get their content indexed faster, while using less resources? Please to deliver the code and more.

#53 @stox
5 weeks ago

Hey everyone, Patrick Stox here from ahrefs.com / Yep.com. I just wanted to say that we're also interested in making this happen. It would lead to a substantial amount of resources saved for us and hosts. If I can do anything to help move this along, please let me know.

Note: See TracTickets for help on using tickets.