Opened 2 years ago
Last modified 18 months ago
#57483 new defect (bug)
Improve optimizations for WP_Site::get_instance and WP_Network::get_instance
Reported by: | kovshenin | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | 5.3 |
Component: | Networks and Sites | Keywords: | has-patch |
Focuses: | multisite, performance | Cc: |
Description
Following up on #42251 and r45910.
I recently ran into a problem with this specific optimization. In one case it resulted in an existing site being redirected to the main site as if it did not exist. In a second case it went all the way to dead_db()
. In both instances the existing site in question had a -1
stored in the object cache. Both problems persisted until the object cache was flushed.
I think the main problem is that in WP_Site::get_instance
(and Network) we do not account for wpdb::get_row
failing, we don't check last_error
. This means that a brief database outage can cause a permanent -1
be saved in object cache for a site that actually does exist.
It's not easy to reproduce, as the existing/valid sites will be in cache most of the time, so that portion of the codebase is rarely executed. However, there is one core function that helps with this tremendously. It's wpmu_update_blogs_date
and it invalidates the site cache every time it needs to update the timestamp (when a post is published or update for example).
I wrote this little mu-plugin to help reproduce the problem:
add_action( 'muplugins_loaded', function() { switch_to_blog( 3 ); wpmu_update_blogs_date(); restore_current_blog(); die(); } );
I fire some traffic at the site and observed the MySQL query log to confirm a significant amount of UPDATE
queries for the site timestamp, as well as SELECT * FROM wp_blogs ...
queries which came from WP_Site::get_instance()
.
Next, I simulated some turbulence in the network, by killing all established connections on the MySQL port 3306 (using tcpkill), resulting in a few "MySQL has gone away" errors in the PHP error log. I did this a couple of times until the site with the ID of 3 disappeared from My Sites in the Network Admin.
I confirmed the object cache state in Redis:
127.0.0.1:6379> get sites:3 "-1"
I was able to reproduce this quite reliably with about 50 concurrent requests to the site, and 2-3 rounds of tcpkill. Of course it doesn't happen all that often under normal conditions, though I was unlucky enough to see it happen twice in the last 6 months (on very high traffic sites with plenty of updates).
Personally I feel like we should revert this entire optimization. As much as I appreciate all the hard work, I don't see much benefit of caching IDs that don't exist, much like we don't cache posts that do not exist.
At the very least though, we should check the last_error
before caching a -1
in both cases.
Change History (4)
This ticket was mentioned in Slack in #core by arunas. View the logs.
18 months ago
This ticket was mentioned in PR #4574 on WordPress/wordpress-develop by @ideag.
18 months ago
#4
- Keywords has-patch added
This PR utilises the fact that $wpdb->get_row()
will return a null
when no entries were found vs false
when it experienced an error. A null
result is still cached (site/network does not exist), while a false
one (query failed) is not.
Trac ticket: https://core.trac.wordpress.org/ticket/57483
Looping in @flixos90 as he worked on the original patch.