WordPress.org

Make WordPress Core

Opened 9 months ago

Closed 6 months ago

#50456 closed defect (bug) (wontfix)

Multisite robots.txt files should reference all network XML sitemaps

Reported by: jonoaldersonwp Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Sitemaps Keywords: seo close
Focuses: multisite Cc:

Description (last modified by SergeyBiryukov)

[48072] adds XML sitemaps to core, with the objective of making public URLs more 'discoverable'.

Part of this discoverability relies on alterations to the site's robots.txt file, to add a reference the URL of the sitemap index.

On multisite setups where sites run in subfolders, this mechanism breaks; a domain can only have one robots.txt file at the domain root, which means that sub-sites don't expose the location of their sitemap.

To address this, we should, in all viable cases, add the sitemap URL(s) for every site in a network to the top-level robots.txt file.

For the sake of completeness, robustness and utility, this should be extended to also include multi-site setups on multiple domains/subdomains (or in fact, on any setup).

NB, most consumers support cross-domain XML sitemap references in robots.txt files, so this isn't a concern.

E.g.,

On a theoretical multi-site setup running across multiple hostnames and folders, I'd expect https://www.example.com/robots.txt to contain something like the following:

Sitemap: https://www.example.com/wp-sitemap.xml
Sitemap: https://www.example.com/sub-site/wp-sitemap.xml
Sitemap: https://other.example.com/wp-sitemap.xml

Change History (17)

#1 @jonoaldersonwp
9 months ago

  • Description modified (diff)

#2 follow-up: @knutsp
9 months ago

Only for public sites (and not deleted/spam/archived/mature?).

#3 in reply to: ↑ 2 @jonoaldersonwp
9 months ago

Replying to knutsp:

Only for public sites (and not deleted/spam/archived/mature?).

Ah, good thought. We should probably echo the behaviour of the XML sitemap filtering; public posts only == public network sites only.

#4 follow-up: @pbiron
9 months ago

Just to confirm:

  1. the additional Sitemap: statements should be added only on the main site for the network
  2. the additional Sitemap: statements should be added regardless of whether is_subdomain_install() returns true or false

#5 in reply to: ↑ 4 @jonoaldersonwp
9 months ago

Replying to pbiron:

Just to confirm:

  1. the additional Sitemap: statements should be added only on the main site for the network
  2. the additional Sitemap: statements should be added regardless of whether is_subdomain_install() returns true or false

Correct! :)

#6 @SergeyBiryukov
9 months ago

  • Description modified (diff)
  • Focuses multisite added

#7 @pbiron
9 months ago

I've got a working patch, but this might be a problem for large networks.

The largest network I have access to has a little more than 2000 sites. With the patch applied, it takes an extra 14 seconds (on my local machine) to generate robots.txt (because of the calls to switch_to_blog()/restore_current_blog(), necessary to get the URLs for the sitemap index for each sub-subsite). I can only imagine how much extra time it would take for truly large network (e.g., 10,000+ sites).

Does anyone have an idea whether that extra time would cause problems for consumers of robots.txt? That is, would they timeout trying to retrieve it?

#8 @pbiron
9 months ago

Here's a couple of other complications (i.e., when I said I have a "working patch", it isn't 100% working):

  1. sitemaps have to be enabled on the main site, and that may not be the case even when they are enabled on sub-sites
    • we might be able to around that problem hooking WP_Sitemaps::add_robots() in multisite even when sitemaps are disabled on the main site (haven't tried that yet, but I think it should work)
  2. even when a sub-site is public (and not archived, deleted or spam) sitemaps may be disabled on that sub-site by a plugin...and there is no way to (reliably) know that from the main site
    • switch_to_blog() does not load plugins for that specific sub-site (and even if it did, they would remain active for the remainder the sites being iterated through; e.g., if sub-site 8 had a plugin that disabled sitemaps and it was loaded to test whether to output the sitemap reference in robots.txt, then sitemaps would "appear" to be disabled for all sub-sites after that, even if no other sub-site had a plugin that disabled them :-(

#9 @jonoaldersonwp
9 months ago

Hah, challenges galore!

In order:

  • That long response time on v large networks is definitely going to cause issues. I figure we can either cache the content/results of the sitemaps to include in a non-expiring transient, or, set a sensible cutoff (100 sites?) at which point we don't try and run this process at all (on the premise that such a large/complex network should really be running an SEO plugin and/or caching plugin to manage this, and that this working on some small-to-mid-sized networks is better than none at all).
  • When sitemaps are disabled on the main site but enabled on child sites, that feels like an edge-case with intentful non-default behaviour; as such they should be using a SEO plugin and/or custom robots.txt file.
  • Disabled sitemaps on child sites is problematic, but not the end of the world; a reference in a robots.txt file to a sitemap which doesn't exist shouldn't cause too much harm, beyond some potential periodic querying of that URL. Same as before, too - this is behaviour which overrides the default, in which case, folks should be managing that with a plugin or custom robots.txt file.

#10 @pbiron
9 months ago

Thanx...those responses are roughly what I was thinking as well.

Let me give this a little more thought and see what I can come up with.

#11 @swissspidy
8 months ago

If it helps, there's wp_is_large_network() to determine whether a network is large or not.

#12 @pbiron
8 months ago

After some investigation, even the "simple" approach mentioned by @swissspidy
in slack, is unacceptably slow, even for moderately sized networks (e.g., ~2000 sites):

if ( ! wp_is_large_network() ) {
        foreach( get_sites() as $site_id ) {
                echo "Sitemap: " . get_home_url( $site_id, '/wp-sitemap.xml' );
        }
}

since get_home_url() does a switch_to_blog()...and it is the blog switching which slows things down.

I've experimented with a couple of different approaches to use WP-Cron to write to the blogmeta table (or a transient on the main site), which will allow WP_Sitemaps::add_robots() to efficiently get the entries to be added to robots.txt without having to do switch_to_blog() during the generation of robots.txt.

I think something like that is ultimately the way to go, but there's not time before 5.5 Beta 1 ships next week to get that correct.

#13 @desrosj
7 months ago

  • Milestone changed from Awaiting Review to 5.5.1

I'm wondering if there are any scenarios where putting all sitemaps in the main site's robots.txt file is not desirable as this would expose that the sites are connected.

For example, if a network is mapping each site to it's own unique domain name, anyone that views the robots.txt file would know that the two sites are under the same umbrella. This could very well be a consideration for domain mapping plugins instead, but wanted to make sure this was considered.

I'm placing this in the 5.5.1 milestone for further discussion. If that's unrealistic, let's move it to 5.6.

#14 @johnbillion
7 months ago

  • Keywords close added

It's definitely not desirable to list all sites on the network in robots.txt by default. This will expose connections between sites which AFAIK are not exposed anywhere else to unauthenticated visitors.

One example off the top of my head is a small agency that hosts several separate and unrelated client sites on one network. They shouldn't do this of course, but they do.

Another example is a platform that manages a white labelled service and hosts many different brands on the same network.

Another is a domain name reseller who hosts domains for sale on the network and wouldn't wish for anybody to know about their complete portfolio.

I've seen all these types of networks.

Recommending wontfix.

#15 @jonoaldersonwp
7 months ago

Yeah, this opens up a lot of mess. Lots of unpleasant edge-cases, and no easy/generic solutions which don't require a ton of UI controls and micromanaging scenarios.

So, the conclusion looks like:

  • Multisite setups should behave as per single site setups (which means that any 'child' sitemap URLs won't be listed in the robots.txt file(s)).
  • If you're running a multisite setup, and you care about sitemaps/SEO etc, you should probably use an SEO plugin with granular controls.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


6 months ago

#17 @johnbillion
6 months ago

  • Milestone 5.5.1 deleted
  • Resolution set to wontfix
  • Status changed from new to closed

I'm going to close this off as per the discussion above. This would be a useful feature for some Multisites but definitely isn't something that can be enabled by default.

If anyone solves this in an SEO plugin, feel free to link it here.

Note: See TracTickets for help on using tickets.