WordPress.org

Make WordPress Core

Opened 7 months ago

Last modified 3 days ago

#52900 new feature request

Instantly index WordPress web sites content in Search Engines

Reported by: fabricecanel Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: General Keywords: reporter-feedback has-patch has-unit-tests
Focuses: Cc:

Description

Everyday Search Engines are crawling billions of WordPress URLs to maintain their search index fresh. They crawl to get the latest content, discover new outlinks or verify that URLs already indexed are still valid, not dead links. Unfortunately, Search Engines are generally inefficient at crawling, as they don’t know if the content has changed, and most web pages do not change often. Search engine crawling activity adds on to bandwidth and CPU consumption.

At Microsoft Bing, we believe in a fundamental shift in how Search Engines learn about new, updated, deleted content across the web. Instead of crawling often to detect if content has changed; Content Management Systems should notify Search Engines of content change to limit crawling and have a fresh Search Index. To support this transformation, since February 2019, we offer URL and Content Submission API allowing Web Site owners to publish to Bing thousands to millions of URLs or Content per day; and since July 2020, we offer an WordPress plugin to submit content immediately to Bing’s search index, no code required.

Today, we propose integrating in WordPress Core the ability to notify not only Bing, but also all participating Search Engines, of any WordPress URL or Content change. Microsoft to develop and maintain the open-source code in close collaboration with WordPress. WordPress to approve, validate and include code.

Behind the scenes, WordPress will automatically submit URL or Content ensuring that the WordPress content is always fresh in the Search Engines; in exchange Search Engines will limit crawl or not crawl WordPress sites. Site owners will have the ability to opt-out or select the content they don’t want to submit to search engines.

Change History (34)

This ticket was mentioned in Slack in #forums by sergey. View the logs.


7 months ago

#2 @desrosj
7 months ago

  • Keywords reporter-feedback added

Hi @fabricecanel,

Welcome to Trac!

Can you provide more information around how this API works? Is there documentation you can point to?

Also, is this a standardized approach across various search engine companies? Or will the implementation differ between provider?

#3 @Clorith
7 months ago

Hiya @fabricecanel, and welcome to the WordPress trac!

I'd also just like to add in, just to clear out any potential confusion, that your initial ticket was flagged and removed as it read like many other spam posts we receive. On closer inspection we could see that it was not, and have restored it, but as you can see now with a new ticket ID.

For reference, the plugin mentioned is https://wordpress.org/plugins/bing-webmaster-tools/

Last edited 7 months ago by Clorith (previous) (diff)

#4 @joostdevalk
7 months ago

I was reading this (and have spoken to Fabrice a couple of times before about this topic) and only just this afternoon did I realize something and I'm facepalming myself for not thinking about this sooner. Under Settings --> Writing in the WordPress admin, we have "Update Services". The functionality we have there pings the sites in that box, by default rpc.pingomatic.com, with the RSS feed URL of the site. I think we can all agree that's a bit archaic.

Maybe we can add a new version of the pingomatic API to receive the post/page URL instead of the RSS feed and allow search engines to subscribe to that feed and/or to get a similar ping themselves?

I'm not sure who maintains Pingomatic, will ask around.

This ticket was mentioned in Slack in #meta by carike. View the logs.


7 months ago

#6 @carike
7 months ago

As far as I remember additional services can be added in wp-admin. I believe it is a comma-separated list.
I do not believe it is strictly necessary to use Pingomatic as a "middle-man" so to speak. :)

#7 @carike
7 months ago

P.S. I believe the pings only go out when posts are created.
As far as I know they do not go out when posts are updated, or when pages are created or updated.

Also, the "Hello World" post is not excluded from the pings. I do not believe that those posts would be valuable to search engines and they are created before website owners have had a meaningful opportunity to make privacy choices with regards to their site.
That should be easy to fix though.

#8 @fabricecanel
7 months ago

Thanks @Clorith for restoring this feature request.

Links: Today, we offer an open source WordPress Plugin connected to such Bing Webmaster Tools API to notify Bing about content change for site adopting this API.

@joostdevalk: We are proposing a shift:

  • From pulling to pushing: it’s not about pulling (RSS feeds or similar), it’s about pushing, publishing each change, with some throttling logic as already done in Bing Webmaster Tools WordPress plugin, to avoid notification on every keystroke, every 5 seconds save, to the set of search engines having adopted this design open to all search engines and listening to change. Pulling requires crawling, crawling, and crawling again to check if the content has changed (most of the times the content didn’t change), Pulling required also at the first place discovering the site and feeds. Pushing enables search engines to be seconds to minutes behind content change and guaranty that search engines are aware of the change and minimize the need of crawling to discover if something has changed. In case of downtime, search engines will still rely on sitemaps and links to discover news URLs.
  • Open to all search engines: Search Engines having an API can be added to the notification.
  • Enabled by default: We want to lower the complexity for WordPress users to be found and indexed by search engines. If you are a newbie, your new site should be immediately found and indexed, your latest content and your latest typo fixed should be indexed in minutes… not in weeks.

Benefits: new sites indexed fast, latest content indexed fast, content removed drained fast from search engines... then far less far crawl done by search engines to discover if web sites have changed… which is good for site owners, WordPress platform owners, global warming and polar bears.

#9 follow-up: @carike
7 months ago

Pings are not a pulling technology like RSS feeds. They already push :)
Thing is, pings are made via XML-RPC. Some security plugins, as well as some specialty plugins, turn off XML-RPC, as new technologies like the REST API have largely replaced its original functions and XML-RPC was known to be exploited throughout the years.

From a privacy perspective, I'm going to have to urge opt-in rather than enabled-by-default :)

#10 @Clorith
7 months ago

@fabricecanel Totally agree, this should not be something a user needs to enable on their own if it gets added. Some things that need clarifying still remain though:

While pingomatic is a service that just tells indexers "there's new content, go crawl that site", you are proposing a more direct interaction where you tell the indexers "_here_ is a piece of new content to look at". There are some concerns that need addressed in regards to this.

As @desrosj mentioned, is this an open standard that all indexers have freely available to them, and does not require any additional code to function? This is important, because core should not have to provide special handling depending on which search engine its user favors.

Is this a centralized thing, if it should just work out of the box, then there's multiple search engines out there. There's also the concern of legislatures that watch for monopoly-benefiting actions, and might view WordPress catering to the big search engines, while the small ones aren't hard-coded in because core doesn't know of them as preventing them from having a fair advantage (I am not a lawyer, so this bit is just a potential concern I think we need ot be vary of in cases like these).

#11 @desrosj
7 months ago

Based on scanning some of the documentation, it seems that in order to submit URLs someone would need to register and receive an API key. If this is a requirement, then I'm not sure that this would be a good fit for WordPress Core at this time.

It also appears that sites are limited to 10,000 submissions per day. This is quite a bit and a large majority of sites will never hit this, but what about ones that will? How do they prioritize which pages are prioritized? What happens after the limit is hit and more updates are made?

Last edited 7 months ago by desrosj (previous) (diff)

#12 follow-up: @fabricecanel
7 months ago

@desrosj, @Clorith : For this suggested feature, we do not plan to leverage the existing proprietary solution we have in place on the Bing server side; please note that our WordPress plugin is already Open Source.

For Oath: we recommend an open approach, open to all search engines having an API to receive notification of change. We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

For Count of URLs per day: We do plan having a limit, search engines will consider, and throttle as needed on their side per site: at least Search Engines will be aware of the URLs (added, updated, deleted).

#13 follow-up: @carike
7 months ago

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

#14 @fabricecanel
7 months ago

Yes @carike, we plan offering abilities to disallow URL notifications for the whole site (opt out), part of the site (somehow like robots.txt disallow rules), per URL (somehow like NOINDEX robots meta tag) or per content type. At the end, this is webmasters and site owners responsibilities to define what should not be indexed. Same if an URL is not published to search engines, search engines are good at finding URLs from other pages and may visit and index it, so preferable to meta robots NOINDEX the content you don't want to index.

#15 follow-up: @carike
7 months ago

At the end, this is webmasters and site owners responsibilities to define what should not be indexed.

So, we're saying that it is too difficult a decision / too complicated for site owners to opt-in, but we expect them to know exactly how each plugin treats what data and what data might be sensitive and how to exclude it?

Look, I support SEO wholeheartedly. But if site owners are going to have that level of responsibility, it cannot be on-by-default. You have to give them a meaningful choice / actually make sure that they know they have a choice.

#16 in reply to: ↑ 13 @knutsp
7 months ago

Replying to carike:

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

A post type is either public or it's not. Public ones have a View link in posts.php. We had a similar discussion for core xml sitemaps. And same goes for REST API, feeds and html pages. If they can be viewed publicly it may, and will probably, be indexed after some time.

This ticket is about switching from pull to push. Push has been available in almostt all time through "Submit your URL" forms, but their use has probably approximated zero. This time there is an automated way to do it, through standardized APIs, if I understand it correctly.

Since WordPress core knows what content is to be presumed public and indexed, this can safely be implemented after other issues are solved.

#17 in reply to: ↑ 12 ; follow-up: @desrosj
7 months ago

Replying to fabricecanel:

we recommend an open approach, open to all search engines having an API to receive notification of change.

If I'm interpreting this correctly, Bing has created and implemented an internal API for this, and is recommending other search engines follow their lead. But there is no agreed upon, standardized, and widely adopted way to do this. This would make implementing difficult as each provider could have their own requirements.

We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

Unless this can happen seamlessly without user interaction, I'm still thinking this is not the best feature to include in WordPress Core directly. I'm not at all against the concept, it just seems this is too early for Core to support it industry wide. If there were an industry wide specification that everyone was adopting and using, this would be much easier.

That said, I'm not against exploring ways to make this easier to accomplish. For example, maybe expanding how update services work.

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

#18 @fabricecanel
7 months ago

Replying to @desrosj

Yes, today Bing API is Bing specific. This API helped us to learn at scale : many top large sites having adopted, and more sites directly or via our WordPress Plugin. Our feature proposal is different: we are proposing an open standard URL ping mechanism, with an oath token mechanism to verify URL ownership, to notify any Search Engines that want to listen to the data. I am reaching out to major Search Engines to collaborate with them to standardize the whole story. We should have a common approach in making crawl more efficient to reduce global warming and Content Management Systems as WordPress know more than Search Engines what must to be crawled and when we must crawl it.

I also agree with you on making it an industry search wide approach, it will not be Bing specific, I expect other major Search Engines to contribute, help designing it with the WordPress Community and leverage notifications of URLs added, updated, deleted on day one.

#19 @fabricecanel
6 months ago

On this Earth Day 🌎, I want to share an update related to this feature proposal. That's early but we should be able to share a proposal/Request For Comments supported by key actors in the industry in the following weeks. Endgame is not only to streamline indexing for all search engines; this is also to remove useless crawl footprint at trillions of URLs scale each year.

This ticket was mentioned in PR #1712 on WordPress/wordpress-develop by pingal.


4 weeks ago

  • Keywords has-patch has-unit-tests added

This pull request provides implementation for feature request https://core.trac.wordpress.org/ticket/52900 .

Trac ticket: https://core.trac.wordpress.org/ticket/52900

#21 follow-up: @fabricecanel
3 weeks ago

We did our first pull request https://github.com/WordPress/wordpress-develop/pull/1712 to start integrating our industry wide solution to instantly index WordPress web sites content in all major Search Engines. Looking forward to your feedback and your guidance.

#22 in reply to: ↑ 21 ; follow-ups: @dd32
3 weeks ago

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

#23 in reply to: ↑ 22 @fabricecanel
3 weeks ago

Replying to dd32:

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors.

Next month, we will disclose far more on wide industry support.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

WordPress and other CMS admins will be able to able to select which actors in the industry they want to notify, including everybody via centralized services if they want to notify all.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.

Right... we are fighting day and night against spam, and we cannot trust and often cannot use without verification. Too easy to spam with ping random URLs on plenty of websites you don't own.

#24 @fabricecanel
2 weeks ago

Today I am pleased to report that the latest pull request supports API integration not only for Bing API but also for Yandex API and Baidu API, along various other improvements, including README.md ! Looking forward to your reviews https://github.com/WordPress/wordpress-develop/pull/1712.

#25 @fabricecanel
3 days ago

Today, Microsoft Bing and Yandex announced IndexNow https://www.indexnow.org/ the search engines industry open protocol behind this request. This protocol is already supported by Microsoft Bing and Yandex. With IndexNow, website owners can quickly reflect website changes in search engines results and drive more customers to their websites. With IndexNow, website owners can now provide a clear signal to search engines about their content changes, thus prioritizing crawl for these URLs Microsoft Bing blog.

#26 in reply to: ↑ 17 @fabricecanel
3 days ago

Replying to desrosj:

Replying to fabricecanel:

we recommend an open approach, open to all search engines having an API to receive notification of change.

If I'm interpreting this correctly, Bing has created and implemented an internal API for this, and is recommending other search engines follow their lead. But there is no agreed upon, standardized, and widely adopted way to do this. This would make implementing difficult as each provider could have their own requirements.

We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

Unless this can happen seamlessly without user interaction, I'm still thinking this is not the best feature to include in WordPress Core directly. I'm not at all against the concept, it just seems this is too early for Core to support it industry wide. If there were an industry wide specification that everyone was adopting and using, this would be much easier.

Replying to desrosj: Today Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry as cloudflare and many listed as adopting soon blog post

That said, I'm not against exploring ways to make this easier to accomplish. For example, maybe expanding how update services work.

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

Last edited 3 days ago by dd32 (previous) (diff)

#27 in reply to: ↑ 22 @fabricecanel
3 days ago

Replying to dd32:

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

Replying to dd32: As shared in this feature request, today Microsoft Bing and Yandex release Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry. We need a service secure (key is provided by the site), easy to integrate, scaling to the whole industry, all scenarios (web site, CMS, CDN, SEO companies), targeted for search engines as to support add, update and delete, and helping search engines to minimize crawl load. So, a broader scope. One key scenario for WordPress sites is that most sites owners expect to see their content quickly indexed (except in case of noindex tag) without having to do something to do, ability to be indexed fast should be built in the search engines, not all webmasters want to adopt a ping service to see their content stolen and duplicated all over the internet.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

Replying to dd32: Existing ping services are not open. Users of these ping systems, generally ping only a few dominant players. https://www.indexnow.org/ is open, it shares URLs submitted between all search engines having adopted. You ping one, you ping in fact all.

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

Replying to dd32: a) Sitemaps is a great way to tell search engines all the relevant URLs on your site. Search Engines attempt looking at sitemaps once a day. Do you like to wait 1+ days to see your content indexed? IndexNow https://www.indexnow.org/ allows you to have your content index now, not in few days. b) Page changes is not a great solution we have to pull often millions of sites to discover if the content has changed. Right model is IndexNow + Sitemaps... IndexNow to get indexing done fast and sitemaps to catchup if a ping is missed.

#28 in reply to: ↑ 22 @fabricecanel
3 days ago

Replying to dd32:

Replying to fabricecanel:

We did our first pull request

Hi @fabricecanel,

To follow up on some earlier comments here - have you looked into integrating with either http://pingomatic.com/ or http://blo.gs/cloud.php ?

They're admittedly not very modern API's, but benefit from millions of existing sites already making use of them, combined with existing standards such as Sitemaps it can provide what's needed without additional code on the clients side.

Replying to dd32: As shared in this feature request, today Microsoft Bing and Yandex release Microsoft Bing and Yandex, came up with this search industry wide specification https://www.indexnow.org/ open to all major search engines; already supported by Microsoft Bing, Yandex and few actors in the industry. We need a service secure (key is provided by the site), easy to integrate, scaling to the whole industry, all scenarios (web site, CMS, CDN, SEO companies), targeted for search engines as to support add, update and delete, and helping search engines to minimize crawl load. So, a broader scope. One key scenario for WordPress sites is that most sites owners expect to see their content quickly indexed (except in case of noindex tag) without having to do something to do, ability to be indexed fast should be built in the search engines, not all webmasters want to adopt a ping service to see their content stolen and duplicated all over the internet.

There might also be room in the middle to act as a middleman - consuming those API's and relaying it onto Bing and others using the API, or having Pingomattic or blo.gs to relay it onwards to those too.

Before this proposal is really viable to consider for WordPress inclusion (IMHO) there needs to be industry support on it being a generalised system that allows for all players (small and large) to be supported without additional need from site authors or software vendors. A standard is only truely open if multiple vendors support it, otherwise it's just an proprietary format that so happens to be documented publicly.

To me, it seems that having client websites actively "pinging" select search engines added in WordPress core is not exactly open, I would want anyone interested in the data being able to access a stream of the changes - and having them get their crawler added to WordPress seems like a high barrier to entry.

This seems like one of the major benefits of centralised open relay services like those mentioned above.

I'm assuming that one of the reasons for this approach, based on the inclusion of a per-site key that can be validated through a HTTP callback, is that the existing methods include a lot of spam and lack of any way to verify that whom sent the request is actually the author of it. Monitoring the Blo.gs feed definitely shows a LOT of spam. While the key verification will allow verifying it is who they say they are, it won't prevent spam being pushed into the system.


To throw some ideas in here:

  • What would need to be done to improve the existing pingback services in place?
  • Do they need to be replaced?
  • Do they need to supply extra details to clients to improve the service?

Looking at the output from blo.gs feed:

<weblog name="My Site" url="https://example.org/" service="ping" ts="20210928T08:00:00Z" />

Replying to dd32: Existing ping services are not open. Users of these ping systems, generally ping only a few dominant players. https://www.indexnow.org/ is open, it shares URLs submitted between all search engines having adopted. You ping one, you ping in fact all.

That's not super useful as-is, it doesn't say what changed, but the addition of a link to a) The sitemap and b) the page changed would benefit greatly and provide a lot of what this proposal adds.

Replying to dd32: a) Sitemaps is a great way to tell search engines all the relevant URLs on your site. Search Engines attempt looking at sitemaps once a day. Do you like to wait 1+ days to see your content indexed? IndexNow https://www.indexnow.org/ allows you to have your content index now, not in few days. b) Page changes is not a great solution we have to pull often millions of sites to discover if the content has changed. Right model is IndexNow + Sitemaps... IndexNow to get indexing done fast and sitemaps to catchup if a ping is missed.

#29 in reply to: ↑ 15 @fabricecanel
3 days ago

Replying to carike:

At the end, this is webmasters and site owners responsibilities to define what should not be indexed.

So, we're saying that it is too difficult a decision / too complicated for site owners to opt-in, but we expect them to know exactly how each plugin treats what data and what data might be sensitive and how to exclude it?

WordPress users have ability to exclude content from search engines as to set NOINDEX meta tag on pages they don't want them to index. This feature is just about getting the content quickly indexed in all search engines adopting https://www.indexnow.org/. Instead of having Search Engine crawlers crawling billions of WordPress pages everyday to discover if the content has change this is about guiding search engines to the content which has changed to speed up indexing and get the content indexed. Wordpress users are still in control and more in control as they don't have to wait to search engines to discover and have the latest content reflected in search engines.

Look, I support SEO wholeheartedly. But if site owners are going to have that level of responsibility, it cannot be on-by-default. You have to give them a meaningful choice / actually make sure that they know they have a choice.

#30 in reply to: ↑ 9 @fabricecanel
3 days ago

Replying to carike:

Pings are not a pulling technology like RSS feeds. They already push :)
Thing is, pings are made via XML-RPC. Some security plugins, as well as some specialty plugins, turn off XML-RPC, as new technologies like the REST API have largely replaced its original functions and XML-RPC was known to be exploited throughout the years.

From a privacy perspective, I'm going to have to urge opt-in rather than enabled-by-default :)

Remember that search engines are "sneaky", they can find find content following links, they apply heuristic on URLs to auto-discover them. Search engines offer an easy way to exclude content the NOINDEX, plus this features support abilities like the Google sitemaps the ability to exclude specific path from the notifications (at least we just propose that in the code we suggested for feature consideration, it's not about check-in this code).

#31 follow-up: @dd32
3 days ago

@fabricecanel Can you please edit/delete any comments above that were made in error? It looks like you're quoting entire comments from above without adding any extra context or answers - or you might've, but it's inline with another comment? I'm not sure I can't tell :)

Next month, we will disclose far more on wide industry support.

Thank you! That makes a huge difference, and helps this proposal as it's taken it from being a niche single-supporter use-case (Which would not be welcome in WordPress IMHO) to a industry-lead proposal which has a much better chance of support.

To provide some kind of code review on the approach taken:

  • I'm still not 100% convinced that having WordPress ping each of the engines individually is ideal, however, it's not the worst.
  • I'm still not 100% convinced that having an API key / verification callback should be allowed.
  • All supported providers would need to be defaulted in core, so as not to preference any given engine
  • No code-based configuration should be required for an end-user, at the most, a textarea with a list of search engine endpoints
  • No "exclude paths" functionality would be supported in a UI, filters should be exposed to add that
  • The WP_IndexNow_Provider class consts should probably be removed and put inline, other than perhaps WP_IndexNow_Provider::SUBMIT_API_PATH.
  • Same for the WP_IndexNow class consts - consting them doesn't help readability here at all, and only makes it harder to parse what the methods do.
  • WP_IndexNow::check_for_indexnow_page() should probably work the same way that the robots.txt loading works, through a rewrite rule. Speaking of, this also shows that indexnow doesn't work for WordPress sites which don't use URL rewrites. Potentially this is a shortcoming in the indexnow standard and it should be providing a unique call-back URI or /.well-known/ url as part of the ping payload, rather than just assuming https://example.com/base64-api-key-here.txt.

Finally:

  • I think this should be developed as a plugin first, and then proposed to WordPress core as a feature plugin, to allow development of it to occur separately and then a suggestion to add it to core once feature complete. That would also allow site owners to opt-in to using this prior to WordPress fully implementing it (Which would be in WordPress 6.0 at the absolute earliest, Q2ish 2022 at a guess)

#32 in reply to: ↑ 31 @fabricecanel
3 days ago

Replying to dd32:

@fabricecanel Can you please edit/delete any comments above that were made in error? It looks like you're quoting entire comments from above without adding any extra context or answers - or you might've, but it's inline with another comment? I'm not sure I can't tell :)

thanks for guiding me in WordPress world :)

  • I'm still not 100% convinced that having WordPress ping each of the engines individually is ideal, however, it's not the worst.

Good news, we listen to the feedback as you, and we agree to move away from the ping all, starting next month (maybe sooner), you wile have to ping only one and we will ping others (requirement of the protocol to make it open - allowing new player and small player to participate)

  • I'm still not 100% convinced that having an API key / verification callback should be allowed.

It allows webmasters/CMS to control what is crawled on the site. Evil people should not not a play at scale.

  • All supported providers would need to be defaulted in core, so as not to preference any given engine

Agree this is why we open the protocol, you ping one, you ping all: no preference. Protocol is open to all search engines having a presence in a market.

  • No "exclude paths" functionality would be supported in a UI, filters should be exposed to add that

This feature is requested for consideration for adoption, when we did some code, the wordpress experts should decide the best path to support IndexNow.

Finally:

  • I think this should be developed as a plugin first, and then proposed to WordPress core as a feature plugin, to allow development of it to occur separately and then a suggestion to add it to core once feature complete. That would also allow site owners to opt-in to using this prior to WordPress fully implementing it (Which would be in WordPress 6.0 at the absolute earliest, Q2ish 2022 at a guess)

IndexNow is an open protocol, @joostdevalk feel free to adopt in Yoast, other tools, and learn from it and suggest improvement for core.

#33 @TweetyThierry
3 days ago

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

No Search Index API without a site being verified which requires a Google account and oauth which requires a GCP project.

There is a few points about what should or should not be sent for indexing, how about relying on what is available in WordPress XML sitemaps? I imagine that developers who are purposefully including/excluding urls from their sitemaps would want that to be applicable to indexing push API too (after all, it serves the same purpose, unlike WP REST API).

#34 @joostdevalk
3 days ago

We've not implemented this in Yoast SEO yet for the same reason that @dd32 is resisting implementing it in WordPress core: so far it's just Yandex and Bing, which for most sites is a negligible part of their traffic.

As much as I appreciate @fabricecanel and his team at Microsoft, I want to see proof of this actually improving crawling for sites. I want to be sure that it actually improves either their search engine traffic or their bandwidth bills, before we add this to Yoast SEO. The cost of pinging multiple search engines when you publish a post or page is small on a per blog basis but it's non-zero if you consider doing it for millions of sites.

As soon as I see actual data on this making crawling better for sites (so Bing would crawl it less, not more) or it actually improving traffic for sites (which seems like that should not be possible) I'm happy to implement this in Yoast SEO. I like the standard, especially as it's not based on oAuth for a change.

Note: See TracTickets for help on using tickets.