WordPress.org

Make WordPress Core

Opened 3 weeks ago

Last modified 3 weeks ago

#52900 new feature request

Instantly index WordPress web sites content in Search Engines

Reported by: fabricecanel Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: General Keywords: reporter-feedback
Focuses: Cc:

Description

Everyday Search Engines are crawling billions of WordPress URLs to maintain their search index fresh. They crawl to get the latest content, discover new outlinks or verify that URLs already indexed are still valid, not dead links. Unfortunately, Search Engines are generally inefficient at crawling, as they don’t know if the content has changed, and most web pages do not change often. Search engine crawling activity adds on to bandwidth and CPU consumption.

At Microsoft Bing, we believe in a fundamental shift in how Search Engines learn about new, updated, deleted content across the web. Instead of crawling often to detect if content has changed; Content Management Systems should notify Search Engines of content change to limit crawling and have a fresh Search Index. To support this transformation, since February 2019, we offer URL and Content Submission API allowing Web Site owners to publish to Bing thousands to millions of URLs or Content per day; and since July 2020, we offer an WordPress plugin to submit content immediately to Bing’s search index, no code required.

Today, we propose integrating in WordPress Core the ability to notify not only Bing, but also all participating Search Engines, of any WordPress URL or Content change. Microsoft to develop and maintain the open-source code in close collaboration with WordPress. WordPress to approve, validate and include code.

Behind the scenes, WordPress will automatically submit URL or Content ensuring that the WordPress content is always fresh in the Search Engines; in exchange Search Engines will limit crawl or not crawl WordPress sites. Site owners will have the ability to opt-out or select the content they don’t want to submit to search engines.

Change History (18)

This ticket was mentioned in Slack in #forums by sergey. View the logs.


3 weeks ago

#2 @desrosj
3 weeks ago

  • Keywords reporter-feedback added

Hi @fabricecanel,

Welcome to Trac!

Can you provide more information around how this API works? Is there documentation you can point to?

Also, is this a standardized approach across various search engine companies? Or will the implementation differ between provider?

#3 @Clorith
3 weeks ago

Hiya @fabricecanel, and welcome to the WordPress trac!

I'd also just like to add in, just to clear out any potential confusion, that your initial ticket was flagged and removed as it read like many other spam posts we receive. On closer inspection we could see that it was not, and have restored it, but as you can see now with a new ticket ID.

For reference, the plugin mentioned is https://wordpress.org/plugins/bing-webmaster-tools/

Last edited 3 weeks ago by Clorith (previous) (diff)

#4 @joostdevalk
3 weeks ago

I was reading this (and have spoken to Fabrice a couple of times before about this topic) and only just this afternoon did I realize something and I'm facepalming myself for not thinking about this sooner. Under Settings --> Writing in the WordPress admin, we have "Update Services". The functionality we have there pings the sites in that box, by default rpc.pingomatic.com, with the RSS feed URL of the site. I think we can all agree that's a bit archaic.

Maybe we can add a new version of the pingomatic API to receive the post/page URL instead of the RSS feed and allow search engines to subscribe to that feed and/or to get a similar ping themselves?

I'm not sure who maintains Pingomatic, will ask around.

This ticket was mentioned in Slack in #meta by carike. View the logs.


3 weeks ago

#6 @carike
3 weeks ago

As far as I remember additional services can be added in wp-admin. I believe it is a comma-separated list.
I do not believe it is strictly necessary to use Pingomatic as a "middle-man" so to speak. :)

#7 @carike
3 weeks ago

P.S. I believe the pings only go out when posts are created.
As far as I know they do not go out when posts are updated, or when pages are created or updated.

Also, the "Hello World" post is not excluded from the pings. I do not believe that those posts would be valuable to search engines and they are created before website owners have had a meaningful opportunity to make privacy choices with regards to their site.
That should be easy to fix though.

#8 @fabricecanel
3 weeks ago

Thanks @Clorith for restoring this feature request.

Links: Today, we offer an open source WordPress Plugin connected to such Bing Webmaster Tools API to notify Bing about content change for site adopting this API.

@joostdevalk: We are proposing a shift:

  • From pulling to pushing: it’s not about pulling (RSS feeds or similar), it’s about pushing, publishing each change, with some throttling logic as already done in Bing Webmaster Tools WordPress plugin, to avoid notification on every keystroke, every 5 seconds save, to the set of search engines having adopted this design open to all search engines and listening to change. Pulling requires crawling, crawling, and crawling again to check if the content has changed (most of the times the content didn’t change), Pulling required also at the first place discovering the site and feeds. Pushing enables search engines to be seconds to minutes behind content change and guaranty that search engines are aware of the change and minimize the need of crawling to discover if something has changed. In case of downtime, search engines will still rely on sitemaps and links to discover news URLs.
  • Open to all search engines: Search Engines having an API can be added to the notification.
  • Enabled by default: We want to lower the complexity for WordPress users to be found and indexed by search engines. If you are a newbie, your new site should be immediately found and indexed, your latest content and your latest typo fixed should be indexed in minutes… not in weeks.

Benefits: new sites indexed fast, latest content indexed fast, content removed drained fast from search engines... then far less far crawl done by search engines to discover if web sites have changed… which is good for site owners, WordPress platform owners, global warming and polar bears.

#9 @carike
3 weeks ago

Pings are not a pulling technology like RSS feeds. They already push :)
Thing is, pings are made via XML-RPC. Some security plugins, as well as some specialty plugins, turn off XML-RPC, as new technologies like the REST API have largely replaced its original functions and XML-RPC was known to be exploited throughout the years.

From a privacy perspective, I'm going to have to urge opt-in rather than enabled-by-default :)

#10 @Clorith
3 weeks ago

@fabricecanel Totally agree, this should not be something a user needs to enable on their own if it gets added. Some things that need clarifying still remain though:

While pingomatic is a service that just tells indexers "there's new content, go crawl that site", you are proposing a more direct interaction where you tell the indexers "_here_ is a piece of new content to look at". There are some concerns that need addressed in regards to this.

As @desrosj mentioned, is this an open standard that all indexers have freely available to them, and does not require any additional code to function? This is important, because core should not have to provide special handling depending on which search engine its user favors.

Is this a centralized thing, if it should just work out of the box, then there's multiple search engines out there. There's also the concern of legislatures that watch for monopoly-benefiting actions, and might view WordPress catering to the big search engines, while the small ones aren't hard-coded in because core doesn't know of them as preventing them from having a fair advantage (I am not a lawyer, so this bit is just a potential concern I think we need ot be vary of in cases like these).

#11 @desrosj
3 weeks ago

Based on scanning some of the documentation, it seems that in order to submit URLs someone would need to register and receive an API key. If this is a requirement, then I'm not sure that this would be a good fit for WordPress Core at this time.

It also appears that sites are limited to 10,000 submissions per day. This is quite a bit and a large majority of sites will never hit this, but what about ones that will? How do they prioritize which pages are prioritized? What happens after the limit is hit and more updates are made?

Last edited 3 weeks ago by desrosj (previous) (diff)

#12 follow-up: @fabricecanel
3 weeks ago

@desrosj, @Clorith : For this suggested feature, we do not plan to leverage the existing proprietary solution we have in place on the Bing server side; please note that our WordPress plugin is already Open Source.

For Oath: we recommend an open approach, open to all search engines having an API to receive notification of change. We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

For Count of URLs per day: We do plan having a limit, search engines will consider, and throttle as needed on their side per site: at least Search Engines will be aware of the URLs (added, updated, deleted).

#13 follow-up: @carike
3 weeks ago

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

#14 @fabricecanel
3 weeks ago

Yes @carike, we plan offering abilities to disallow URL notifications for the whole site (opt out), part of the site (somehow like robots.txt disallow rules), per URL (somehow like NOINDEX robots meta tag) or per content type. At the end, this is webmasters and site owners responsibilities to define what should not be indexed. Same if an URL is not published to search engines, search engines are good at finding URLs from other pages and may visit and index it, so preferable to meta robots NOINDEX the content you don't want to index.

#15 @carike
3 weeks ago

At the end, this is webmasters and site owners responsibilities to define what should not be indexed.

So, we're saying that it is too difficult a decision / too complicated for site owners to opt-in, but we expect them to know exactly how each plugin treats what data and what data might be sensitive and how to exclude it?

Look, I support SEO wholeheartedly. But if site owners are going to have that level of responsibility, it cannot be on-by-default. You have to give them a meaningful choice / actually make sure that they know they have a choice.

#16 in reply to: ↑ 13 @knutsp
3 weeks ago

Replying to carike:

How will custom post types be handled?
If an invoice is saved as a custom post type, these should obviously not be indexed by search engines.
However, some common custom post types include products, or events. Those by their very nature should almost always be indexed.

A post type is either public or it's not. Public ones have a View link in posts.php. We had a similar discussion for core xml sitemaps. And same goes for REST API, feeds and html pages. If they can be viewed publicly it may, and will probably, be indexed after some time.

This ticket is about switching from pull to push. Push has been available in almostt all time through "Submit your URL" forms, but their use has probably approximated zero. This time there is an automated way to do it, through standardized APIs, if I understand it correctly.

Since WordPress core knows what content is to be presumed public and indexed, this can safely be implemented after other issues are solved.

#17 in reply to: ↑ 12 @desrosj
3 weeks ago

Replying to fabricecanel:

we recommend an open approach, open to all search engines having an API to receive notification of change.

If I'm interpreting this correctly, Bing has created and implemented an internal API for this, and is recommending other search engines follow their lead. But there is no agreed upon, standardized, and widely adopted way to do this. This would make implementing difficult as each provider could have their own requirements.

We still need to have a verification mechanism to verify that a Site A does not publish URLs for a Site B as Site B may not be happy to see search engines crawler visiting URLs that it does not have or didn’t change; that’s where the crawler will help verifying ownership with some key solution design to establish.

Unless this can happen seamlessly without user interaction, I'm still thinking this is not the best feature to include in WordPress Core directly. I'm not at all against the concept, it just seems this is too early for Core to support it industry wide. If there were an industry wide specification that everyone was adopting and using, this would be much easier.

That said, I'm not against exploring ways to make this easier to accomplish. For example, maybe expanding how update services work.

I'm wondering if @flixos90, @adamsilverstein, @westonruter, @swissspidy, or @tweetythierry can provide any insight into how and if this approach is being implemented on the Google side.

#18 @fabricecanel
3 weeks ago

Replying to @desrosj

Yes, today Bing API is Bing specific. This API helped us to learn at scale : many top large sites having adopted, and more sites directly or via our WordPress Plugin. Our feature proposal is different: we are proposing an open standard URL ping mechanism, with an oath token mechanism to verify URL ownership, to notify any Search Engines that want to listen to the data. I am reaching out to major Search Engines to collaborate with them to standardize the whole story. We should have a common approach in making crawl more efficient to reduce global warming and Content Management Systems as WordPress know more than Search Engines what must to be crawled and when we must crawl it.

I also agree with you on making it an industry search wide approach, it will not be Bing specific, I expect other major Search Engines to contribute, help designing it with the WordPress Community and leverage notifications of URLs added, updated, deleted on day one.

Note: See TracTickets for help on using tickets.