Make WordPress Core

Opened 2 years ago

Last modified 18 months ago

#57483 new defect (bug)

Improve optimizations for WP_Site::get_instance and WP_Network::get_instance

Reported by: kovshenin's profile kovshenin Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 5.3
Component: Networks and Sites Keywords: has-patch
Focuses: multisite, performance Cc:

Description

Following up on #42251 and r45910.

I recently ran into a problem with this specific optimization. In one case it resulted in an existing site being redirected to the main site as if it did not exist. In a second case it went all the way to dead_db(). In both instances the existing site in question had a -1 stored in the object cache. Both problems persisted until the object cache was flushed.

I think the main problem is that in WP_Site::get_instance (and Network) we do not account for wpdb::get_row failing, we don't check last_error. This means that a brief database outage can cause a permanent -1 be saved in object cache for a site that actually does exist.

It's not easy to reproduce, as the existing/valid sites will be in cache most of the time, so that portion of the codebase is rarely executed. However, there is one core function that helps with this tremendously. It's wpmu_update_blogs_date and it invalidates the site cache every time it needs to update the timestamp (when a post is published or update for example).

I wrote this little mu-plugin to help reproduce the problem:

add_action( 'muplugins_loaded', function() {
    switch_to_blog( 3 );
    wpmu_update_blogs_date();
    restore_current_blog();
    die();
} );

I fire some traffic at the site and observed the MySQL query log to confirm a significant amount of UPDATE queries for the site timestamp, as well as SELECT * FROM wp_blogs ... queries which came from WP_Site::get_instance().

Next, I simulated some turbulence in the network, by killing all established connections on the MySQL port 3306 (using tcpkill), resulting in a few "MySQL has gone away" errors in the PHP error log. I did this a couple of times until the site with the ID of 3 disappeared from My Sites in the Network Admin.

I confirmed the object cache state in Redis:

127.0.0.1:6379> get sites:3
"-1"

I was able to reproduce this quite reliably with about 50 concurrent requests to the site, and 2-3 rounds of tcpkill. Of course it doesn't happen all that often under normal conditions, though I was unlucky enough to see it happen twice in the last 6 months (on very high traffic sites with plenty of updates).

Personally I feel like we should revert this entire optimization. As much as I appreciate all the hard work, I don't see much benefit of caching IDs that don't exist, much like we don't cache posts that do not exist.

At the very least though, we should check the last_error before caching a -1 in both cases.

Change History (4)

#1 @SergeyBiryukov
2 years ago

  • Component changed from General to Networks and Sites
  • Focuses multisite added

#2 @spacedmonkey
2 years ago

Looping in @flixos90 as he worked on the original patch.

This ticket was mentioned in Slack in #core by arunas. View the logs.


18 months ago

This ticket was mentioned in PR #4574 on WordPress/wordpress-develop by @ideag.


18 months ago
#4

  • Keywords has-patch added

This PR utilises the fact that $wpdb->get_row() will return a null when no entries were found vs false when it experienced an error. A null result is still cached (site/network does not exist), while a false one (query failed) is not.

Trac ticket: https://core.trac.wordpress.org/ticket/57483

Note: See TracTickets for help on using tickets.