#62692 closed defect (bug) (fixed)
Performance regression for get_option in 6.4
Reported by: | rmccue | Owned by: | joemcgill |
---|---|---|---|
Milestone: | 6.8 | Priority: | normal |
Severity: | major | Version: | 6.4 |
Component: | Options, Meta APIs | Keywords: | has-patch has-unit-tests commit |
Focuses: | performance | Cc: |
Description
In 6.4, get_option()
was changed to pull individual items from the object cache before checking notoptions
as a performance optimisation (#58277).
Before this change, calling get_option( 'notexist' )
would do the following:
- On the first call, it would check
notoptions
(and load this from the external cache if not set). - If not set in
notoptions
, it would check the individual cache key for the option (wp_cache_get( 'notexist', 'options' )
), and add it tonotoptions
- On the second call, it would check
notoptions
and return early.
This meant that subsequent calls would only check the static notoptions
which would likely be loaded into memory.
After this change, calling get_option( 'notexist' )
would do the following:
- On every call, check the individual cache key for the option (
wp_cache_get( 'notexist', 'options' )
) - If not set, then check
notoptions
- On the second call, check the individual cache key again
This would seem like an optimization - in the original, it was noted that the common case is a cache hit, so this should be better. However, this misses two important pieces of information.
Firstly, notoptions
is a single entry for the entire page, and always exists - this means caches will store this in memory and generate at most 1 external request to the cache.
Secondly, not all object caches implement an internal cache of misses. Notably, the popular wp-redis drop-in does not. Even if they did, this would still be worse for performance, since notoptions
is a single get across the entire request.
We experienced this with an enterprise customer who upgraded recently, and this caused a massive increase in load on their caching servers (saturating the network links), as they have a large number of cache misses. (They have sparse options stored across a multisite network.)
This original change should be reverted.
To replicate this, call get_option( 'notexist' )
in a loop. Prior to 6.4, this would generate 2 total cache gets; after 6.4, it generates an infinite number, as each iteration of the cycle triggers a cache get which misses.
Attachments (7)
Change History (41)
This ticket was mentioned in PR #8004 on WordPress/wordpress-develop by @joemcgill.
5 weeks ago
#2
- Keywords has-patch has-unit-tests added
This reverts https://core.trac.wordpress.org/changeset/56595 from trunk
.
Trac ticket: https://core.trac.wordpress.org/ticket/62692
#3
follow-up:
↓ 4
@
5 weeks ago
Thanks @rmccue. I've prepared a revert PR in PR #8004 for testing purposes.
Curious, since you've already debugged this issue if you found any workarounds in the mean time?
#4
in reply to:
↑ 3
@
5 weeks ago
Replying to joemcgill:
Curious, since you've already debugged this issue if you found any workarounds in the mean time?
For the customer this affects, we worked around it at a higher level - i.e. we avoided calling get_option()
for the missing options in places where we have a high volume of calls.
The overall increase in cache traffic is observable across multiple customers who upgraded, but we also have this mitigated to some extent by caching misses in a different caching drop-in that we use on other customers. (That alternative caching drop-in knows when a cache miss is invalidated by a cache set elsewhere, so it's safer to do so.)
#5
@
5 weeks ago
(Note: the memcache and memcached drop-ins both cache misses internally so are less affected by this, but it's still less efficient than previously. Other drop-ins may or may not, it seems inconsistent.)
#6
@
5 weeks ago
Screenshot 2024-12-13 at 21.45.52.png shows the GET
-type commands being sent to our Redis cluster before and after the upgrade to 6.4+, as a point of reference for the magnitude in change caused by this.
#7
@
5 weeks ago
I have run some testing on my local and compare the code before and after the revert patch.
Before the patch:
>> wp cache flush Success: The cache was flushed. >> wp shell --url=<site-url> > global $wp_object_cache > $wp_object_cache->redis_calls; = [ "get" => 182, "set" => 198, ] > get_option('home') = "<site-url>" > $wp_object_cache->redis_calls; = [ "get" => 182, "set" => 198, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 183, "set" => 199, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 184, "set" => 199, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 185, "set" => 199, ]
After the revert patch:
>> wp cache flush Success: The cache was flushed. >> wp shell --url=<site-url> > global $wp_object_cache > $wp_object_cache->redis_calls; = [ "get" => 99, "set" => 198, ] > get_option('home') = "<site-url>" > $wp_object_cache->redis_calls; = [ "get" => 99, "set" => 198, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 100, "set" => 199, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 100, "set" => 199, ] > get_option('something_not_exists') = false > $wp_object_cache->redis_calls; = [ "get" => 100, "set" => 199, ]
The above is the experiment to proof the concept Ryan's said here:
To replicate this, call get_option( 'notexist' ) in a loop. Prior to 6.4, this would generate 2 total cache gets; after 6.4, it generates an infinite number, as each iteration of the cycle triggers a cache get which misses.
#8
follow-up:
↓ 9
@
5 weeks ago
Pinging @spacedmonkey since you worked on the original commit.
I'm not opposed to a revert given the performance feedback, but we should still weigh off whether that's the right decision. Given this feedback comes ~1 year after the 6.4 release, it also feels excessive to revert, as nobody reported this before.
Obviously that doesn't make the issue any less valid, but I think based on this being used in production for so long without known issues, we should first verify that reverting this wouldn't have adverse consequences - as in: Was the original fix just a very minor improvement that is okay to revert?
#9
in reply to:
↑ 8
@
5 weeks ago
Replying to flixos90:
I'm not opposed to a revert given the performance feedback, but we should still weigh off whether that's the right decision. Given this feedback comes ~1 year after the 6.4 release, it also feels excessive to revert, as nobody reported this before.
Such is the pace of enterprise customers :)
FWIW, we have observed a definite increase on other customers, but not to the same extent.
I've attached data from 6 other customers who all have the same pattern of an increase in cache traffic after upgrading to WP 6.4. Data is all the same window, 15 months, but they upgraded at different times.
In these cases, we didn't notice it as our systems scaled up to cope with the additional traffic, but it is clear in hindsight. In Screenshot 2024-12-17 at 13.54.37@2x.png, I have also noted when the customer switched object cache implementation to our alternative one (Afterburner).
Clearly, it's cache-dependent, but it's also not a clear bug for the cache implementation to work the way it does. In our alternative implementation, we have authoritative knowledge of which items are stored and so can determine misses much more effectively in the cache itself. (The traffic would have also reduced in general just based on the way that cache works, so I would not read too much into that.)
Obviously that doesn't make the issue any less valid, but I think based on this being used in production for so long without known issues, we should first verify that reverting this wouldn't have adverse consequences - as in: Was the original fix just a very minor improvement that is okay to revert?
My larger point here is that I actually do not think the original change would have improved performance. The metric used to justify that change was the number of calls to the cache, which is not a useful figure to judge based off.
Calling wp_cache_get( 'notoptions', 'options' )
repeatedly does not cause remote calls, because it's part of the cache API that a successfully fetched value is stored in memory in PHP (formalised in #55080); in other words, calling this repeatedly is effectively guaranteed to be an O(1) lookup from memory.
It was noted in the original ticket that any speedup was minimal (and potentially within the margin of error) - I think effectively, it is optimising for the wrong metric.
#10
@
5 weeks ago
From diving into implementations:
- Automattic wp-memcached: Not affected
- Automattic wp-cache-memcached: Not affected
- Pantheon wp-redis: Affected
- Litespeed Cache: Not affected (available on .org, 6M installs)
- W3 Total Cache: Affected (available on .org, 1M installs)
- Rhubarb redis-cache: Affected (available on .org, 200k installs)
- Docket Cache: Affected (available on .org, 20k installs)
This seems common enough that it's not just a wp-redis problem.
#11
@
5 weeks ago
Thanks for the added context, @rmccue. I tend to agree that the original commit was likely not resulting in the improvement that it was meant to produce. Given that a revert of this change will not be backported to prior versions of WordPress, it may be worthwhile for effected caching plugins to put in place a mechanism to mitigate cache gets against known cache misses.
#12
@
5 weeks ago
I am personally against reverting this before we can explore other options. This has lived in core 15 months now. When making this change, profiles were run against object caching and none object caching alike. Sites with cache missing and none object cache site has notable benefits. Revert this was effect those users to super serve a number of users using redis implementations.
I guess I don't understand why the mountain is moving here, when this is for the object cache plugin maintainer to pick to fix this. It is also worth noting, this change was well sign posted, with a dev note - https://make.wordpress.org/core/2023/10/17/improvements-to-object-caching-in-wordpress-6-4/.
#13
follow-up:
↓ 14
@
5 weeks ago
Instead of reverting the change entirely, it is it possible to fix it so that it no longer has the issue?
Looking at the code, it looks like the old behavior before [56595] was (in very high-level pseudocode):
- Check
notoptions
- Check
alloptions
- Check the cache for the specific option
- Check the database for the specific option
The current behavior (since [56595] was committed) is:
- Check
alloptions
- Check the cache for the specific option
- Check
notoptions
- Check the database for the specific option
It seems like what you really want is this:
- Check
alloptions
- Check
notoptions
- Check the cache for the specific option
- Check the database for the specific option
Does that make sense?
#14
in reply to:
↑ 13
@
5 weeks ago
Replying to siliconforks:
It seems like what you really want is this:
- Check
alloptions
- Check
notoptions
- Check the cache for the specific option
- Check the database for the specific option
Does that make sense?
If we can, I can see that making sense; seems a good split between the original intention (lowering calls for the common path) and improving the worst-case scenario. There's a slight possibility of cache confusion if a value is set but isn't deleted from notoptions
due to a race condition or similar - I think the likelihood of that is low (not zero; see #31245 for prior alloptions bugs along those lines), and no worse than before from what I can see.
Replying to spacedmonkey:
Sites with cache missing and none object cache site has notable benefits. Revert this was effect those users to super serve a number of users using redis implementations.
I understand this perspective, but I'm also comparing a slight performance decrease for many users against a catastrophic hit to performance at scale which caused a multiple day intermittent outage and massively increased cache server traffic. We can ignore the singular incident if you'd like for this conversation, but even just from a purely theoretical standpoint the original patch was problematic in causing excess reads.
Also from the original ticket, I don't see that there were notable benefits, but maybe I'm misreading those results.
I take your point on the delay in feedback - we didn't notice for other customers as our systems automatically scaled up for the additional load, and we're using alternative object caches which don't have this problem.
I guess I don't understand why the mountain is moving here, when this is for the object cache plugin maintainer to pick to fix this.
I think we should likely do some advocacy to have these caches act consistently, but as noted in my investigation above, it's inconsistent at best; it affects most implementations that I checked.
This ticket was mentioned in PR #8030 on WordPress/wordpress-develop by @joemcgill.
4 weeks ago
#15
Trac ticket: https://core.trac.wordpress.org/ticket/62692
This is an alternate approach to #8004 which moves the check of 'notoptions' before checking the 'options' cache while leaving the 'alloptions' check first.
#16
@
4 weeks ago
For the sake of comparison, I've opened PR #8030 to implement @siliconforks suggestion. This will still avoid calls to the notoptions cache if the value was already autoloaded, but will load check for notoptions prior to checking the object cache.
@peterwilsoncc commented on PR #8030:
4 weeks ago
#17
I wrote some rough tests for this and while it fixes the problem for a non-existent option, it appears to introduce the same problem to an option that exists but is not autoloaded.
These tests pass with this patch but as you can see, the cache hits for exists, not autoloaded
goes up from three to twelve.
The test I was running follows. It will almost certainly break on some of the third party tests so is not ready for commit.
/**
* Test that get_option() does not hit the external cache multiple times for the same option.
*
* @ticket 62692
*
* @dataProvider data_get_option_does_not_hit_the_external_cache_multiple_times_for_the_same_option
*
* @param int $expected_connections Expected number of connections to the memcached server.
* @param bool $option_exists Whether the option should be set. Default true.
* @param string $autoload Whether the option should be auto loaded. Default true.
*/
public function test_get_option_does_not_hit_the_external_cache_multiple_times_for_the_same_option( $expected_connections, $option_exists = true, $autoload = true ) {
if ( ! wp_using_ext_object_cache() ) {
$this->markTestSkipped( 'This test requires an external object cache.' );
}
if ( ! function_exists( 'wp_cache_get_stats' ) ) {
$this->markTestSkipped( 'This test requires the Memcached PECL extension.' );
}
if ( $option_exists ) {
add_option( 'ticket-62692', 'value', '', $autoload );
}
wp_cache_delete_multiple( array( 'ticket-62692', 'notoptions', 'alloptions' ), 'options' );
$stats = wp_cache_get_stats();
$connections_start = array_shift( $stats )['cmd_get'];
$call_getter = 10;
while ( $call_getter-- ) {
get_option( 'ticket-62692' );
}
$stats = wp_cache_get_stats();
$connections_end = array_shift( $stats )['cmd_get'];
$this->assertSame( $expected_connections, $connections_end - $connections_start );
}
/**
* Data provider.
*
* @return array[]
*/
public function data_get_option_does_not_hit_the_external_cache_multiple_times_for_the_same_option() {
return array(
'exists, autoload' => array( 1, true, true ), // 1 on trunk.
'exists, not autoloaded' => array( 12, true, false ), // 3 on trunk.
'does not exist' => array( 3, false ), // 12 on trunk.
);
}
#18
follow-up:
↓ 19
@
4 weeks ago
I've done some testing with an object cache in a pull request. The tests are hacked up messes, but the additional test is the important bit, determining how often the external cache is hit when calling get_option()
10 times.
This is the number of calls to the external cache I am seeing in each case (testing for get
)
Option | Trunk | PR#8030 (redo) | Original reverted |
---|---|---|---|
Exists, autoloaded | 1 | 1 | 2 |
Exists, not autoloaded | 3 | 12 | 3 |
Non-existent | 12 | 3 | 3 |
The redo removes the excess calls for non-existent options, but appears to move the problem to options that are not autoloaded.
Given this, I think the best approach initially is to
- revert the original
- add tests for external cache hits (they need work so as not to break host tests using a different persistent cache implementation)
- continue work throughout the release to see if we can get the cache hits down.
#19
in reply to:
↑ 18
@
4 weeks ago
Replying to peterwilsoncc:
Option Trunk PR#8030 (redo) Original reverted Exists, autoloaded 1 1 2 Exists, not autoloaded 3 12 3 Non-existent 12 3 3 The redo removes the excess calls for non-existent options, but appears to move the problem to options that are not autoloaded.
I think that might be because PR #8030 appears to be missing a line?
Line 178 in the current trunk seems to have gotten lost in the PR.
@joemcgill commented on PR #8030:
4 weeks ago
#20
As @siliconforks noticed, I had failed to cache the empty 'notoptions' array, which was causing extra hits to external cache when the option existed and was not autoloaded, rather than pulling the 'notoptions' value from memory. This has been addressed in e3c900e.
@joemcgill commented on PR #8030:
4 weeks ago
#21
@peterwilsoncc I went ahead and committed your tests to this PR but noticed you mention this in Trac:
...they need work so as not to break host tests using a different persistent cache implementation
Can you share more details about what needs to be updated?
@peterwilsoncc commented on PR #8030:
4 weeks ago
#22
I ran out of time to test yesterday but I think the test will need to ensure cmd_get exists as it’s only reported on memcached.
Other caches use a different structure for the stats, Redis reports it as ‘get’.
All this is from memory while checking my email over lunch.
@peterwilsoncc commented on PR #8030:
3 weeks ago
#23
@joemcgill I've taken the liberty of pushing an update to my earlier tests to your branch. edae41577d7bad750ddf57d235091dfdfdd4ea2b introduces a helper to get the number of get calls to the object cache.
Your changes in e3c900ed807c6cc92a848f31a0e4d4709bf97885 cause Tests_Option_Option::test_get_option_notoptions_do_not_load_cache()
to fail, as the notoptions cache is now expected to exist whenever get_option()
is called, regardless of whether the option exists or not. The test was introduced in Core-58277 so it might be a simple case of removing it.
@joemcgill commented on PR #8030:
2 weeks ago
#24
Thanks @peterwilsoncc. I think the new unit test added in c3c6b56 is sufficient so that we can just remove the failing test that is no longer accurately representing the expected behavior. I've removed that test and added made a few small tweaks to the existing ones. I think this change is pretty much ready for final sign-off.
#25
@
2 weeks ago
- Milestone changed from Awaiting Review to 6.8
- Owner set to joemcgill
- Status changed from new to assigned
Even once a fix for this is committed, there will still be many sites affected by this issue due to object cache implementations not handling cache misses when the site is on an affected version of WP (i.e. 6.4–6.7)
I've opened issues on each of the affected implementations listed in Ryan's comment) so sites using any of those plugins can receive a fix prior to updating to the first major release where this is addressed.
We could consider backporting this to 6.7 since the release date for 6.8 is still quite a ways off, but do not backport bugfixes to any older branches.
@joemcgill commented on PR #8004:
2 weeks ago
#26
Closing in favor of #8030.
#27
@
2 weeks ago
I've approved PR#8030 with the updated approach.
I took a quick look at a plugin running on pre_option
to see if I could create a plugin to work around the issue but was unable to do so as the get_option()
continues running if the return value is false
.
This ticket was mentioned in Slack in #core-performance by joemcgill. View the logs.
13 days ago
#29
@
5 days ago
Based on the recent activity here, I assume one of you @joemcgill @peterwilsoncc will handle this one? It's probably good to commit.
And I believe nothing is in the way of backporting to the 6.7 branch either.
#30
follow-up:
↓ 33
@
5 days ago
- Component changed from Cache API to Options, Meta APIs
- Keywords commit added
I will commit shortly, I've changed the component to reflect that the error was how the options API was using caching rather than a bug with the cache per se.
I'm hesitant to backport the fix to the 6.7 branch as the issue appears to be solvable in persistent cache drop-ins, so a workaround is available. Please correct me if I am incorrect.
@peterwilsoncc commented on PR #8030:
5 days ago
#32
#33
in reply to:
↑ 30
@
4 days ago
Replying to peterwilsoncc:
I'm hesitant to backport the fix to the 6.7 branch as the issue appears to be solvable in persistent cache drop-ins, so a workaround is available. Please correct me if I am incorrect.
I agree with this logic along with the fact that this bug has only resulted in this significant report over the past year. Thanks all!
cc: @spacedmonkey who worked on the original ticket.