Make WordPress Core

Opened 22 months ago

Closed 21 months ago

Last modified 9 months ago

#57625 closed enhancement (fixed)

WP_Query cache memory leak

Reported by: owi's profile owi Owned by: spacedmonkey's profile spacedmonkey
Milestone: 6.3 Priority: normal
Severity: normal Version:
Component: Cache API Keywords: has-patch early commit has-unit-tests has-dev-note
Focuses: performance Cc:

Description (last modified by peterwilsoncc)

Hey,

I am not sure whether that qualifies as a bug or should we consider that as a possible configuration enhancement.

When the queries are cached the generated cache keys are suffixed with the
wp_cache_get_last_changed for the posts group (if the post contains tax query as well then additionally it takes last_changed cache key from the terms)

Now this creates a problem with most of persistent cache storages which store keys with an infinite TTL (because why not?).
As soon as last_changed key changes its value it immediately invalidates all the previous keys but the obsolete keys obviously remain in the storage forever.

That may lead to memory problems as the number of keys grows over time (quite quickly on high traffic sites) and only way to avoid OOM issues is to set the expiry of the wp_query cache keys to some arbitrary number.

The problem is easily reproducible with a bare WordPress 6.1 install and persistent object storage like Redis.

  1. Start with empty cache storage and fill it up a bit by refreshing few pages through the admin or front site
  2. You should notice that multiple refreshes of the same URL do not increase the amount of query cache keys
  3. Open the new post screen (its enough to just refresh it) or do any action against posts.
  4. Navigate back to previously open post/screen and just refresh it
  5. Notice how new keys are just popping up in the DB as last_changed was bumped up thus invalidating previous keys.

There is even shorter way to reproduce it.

  1. Bare WP 6.1 install with object cache
  2. Flush cache
  3. Navigate to new post screen and just refresh it few times (no need to create post at all)
  4. Each refresh creates new set of cache keys for queries like revisions etc. as every refresh touches that last_changed timestamp.

I hope that my explanations are clear, if not then I am more than happy to bring more clarification.

---

Edited per comment 2 below -- peterwilsoncc

Attachments (1)

redis.png (72.9 KB) - added by skithund 22 months ago.
Redis graphs

Download all attachments as: .zip

Change History (52)

#2 in reply to: ↑ description @owi
22 months ago

That part

  1. Refresh previously open post/screen and just refresh it

Should be sth like:

  1. Navigate back to previously open...

I would appreciate If anyone can fix that in original description ;)

#3 follow-up: @peterwilsoncc
22 months ago

I think this is a little of two things:

  • a configuration error on the persistent cache: I think setting a TTL or maxmemory and an eviction policy would fix OOM errors
  • using the last_changed times in WordPress isn't optimal

The latter is difficult to resolve as not all persistent caches allow for items to be deleted by group or cache prefix and WordPress needs to account for that situation.

For systems that do allow for bulk deletion of cache keys via group or prefix, I suspect there are some ways WordPress could optimise for this but will need to think it through a little and discuss with some experts.

#4 @peterwilsoncc
22 months ago

  • Description modified (diff)

#5 in reply to: ↑ 3 @owi
22 months ago

Thank you for your insight @peterwilsoncc !

  • a configuration error on the persistent cache: I think setting a TTL or maxmemory and an eviction policy would fix OOM errors

I completely agree the need for eviction policy for edge cases where server is reaching max memory - and that's a simple config setting. Redis (or other provider) should not choke and act before it gets OOM.

But that is just a cure for the symptom and not the cause. Why should we be evicting potentially valid keys just because one of our cache groups is misbehaving and causing storage to operate constantly on the high (near max) mem usage?

On a high traffic sites the ratio of wp_query cache keys versus other keys is like 10:1 just after 3-5 days. That's definitely unhealthy situation which should not happen. Turning on shorter TTLs or eviction policies for the whole server is IMO counterproductive for the caching in general.

I went around it by developing a small solution for setting short TTL for the keys with the wp_query (posts group) and get_term (terms group) prefixes. Nevertheless as written in the ticket description I believe setting arbitrary number is still a sub-optimal thing though works in the short term.

Alternatively I was thinking about writing a cron job to check last_updated key and remove the keys which are invalid (and thats probably the way to go)

The latter is difficult to resolve as not all persistent caches allow for items to be deleted by group or cache prefix and WordPress needs to account for that situation.

That was my thinking - is there a possibility that query caches belong to its own cache group and we flush it on certain events? Or even better by setting an event in cron as it might be tricky/countrproductive doing that during regular operation? Is moving keys between cache groups backwards compatible in the WordPress context?

While not all providers support flushing by prefix or by the group - I believe we still should handle that in some way for the ones who do. As all providers will suffer from that issue and we should reduce the impact of that problem.

I am in a good position where I have full control over my redis servers and pretty advanced infrastructure. Many people might get a memcached or redis instance and have problems with their provider complaining about storage/memory use.

Let me know if I can be of any help!

#6 @peterwilsoncc
22 months ago

@owi I'm considering whether an action can be created to allow persistent caches to detect when the last changed value has been updated.

Something like this but it would also need to consider the creation of values in wp_cache_get_last_changed().

<?php
/**
 * Sets the last changed time for the 'posts' cache group.
 *
 * @since 5.0.0
 */
function wp_cache_set_posts_last_changed() {
        /*
         * This value is obtained via the cache key directly to avoid the
         * generation of a new key by `wp_cache_get_last_changed()`. This
         * allows the action to indicate whether the value was set prior
         * to this function call.
         */
        $previous_value = wp_cache_get( 'last_changed', 'posts' );

        $value = microtime();
        wp_cache_set( 'last_changed', $value, 'posts' );

        /**
         * Fires after the 'posts' cache group last changed time is updated.
         *
         * @since x.x.x
         *
         * @param string       $value          The new last changed time.
         * @param string|false $previous_value The previous last changed time.
         *                                     False indicates no value was set.
         */
        do_action( 'wp_cache_set_posts_last_changed', $value, $previous_value );
}

Persistent cache developers could use the action to clear cache keys via wildcard.

#7 @owi
22 months ago

@peterwilsoncc I am not a maintainer of any of the persistent storage plugins so can't say anything in their name but I have few notes which may give some insights from my perspective.

Notes are in no particular order :) :

  • AFAIK the selective keys removal (in Redis) is an atomic operation of O(n) complexity and might take a while to execute - do we want that to happen on an action? What are possible side effects if that gets stuck? Most probably an action can be async (depending on plugin and engine) but still that could pose certain problems especially when actions pile up.
  • As the same issue happens with the get_terms cache therefore we should plan for building the same action for both cases.
  • On one hand it is a proper approach to shift that work to plugin developers so they can use native methods to resolve flushing (as who knows their storage engine better?). On the other hand - do we want such critical functionality to be implemented in 50 different ways?
  • and lastly I am not a maintainer but I can imagine that some of them may say that it is a Core issue (because, frankly, it is) and please ask Core devs to fix it.

I think that such mass operation most probably should be detached from a regular WP request and happen on cron or similar.

I am going to open a ticket within my favourite plugin - redis-cache and ask whats their opinion.

#8 @tillkruess
22 months ago

Having a filter like wp_cache_set_posts_last_changed would allow expired cache keys to be invalidated atomically (or be queued for deletion). A bit annoying to implement, but easy to add in core.

Something less annoying would be a better API.

<?php
$key          = md5( $query );
$last_changed = wp_cache_get_last_changed( 'posts' );

// current implementation
$cache_key    = "find_post_by_old_slug:$key:$last_changed";
$cache        = wp_cache_get( $cache_key, 'posts' );

// better approach
$cache_key    = "find_post_by_old_slug:$key";
$cache        = wp_cache_get_latest( $cache_key, 'posts', $last_changed );

That would allow plugins like Redis Object Cache or Object Cache Pro to invalidate more efficiently and on the fly.

The "find_post_by_old_slug:$key" could be a hash map, so the entire cache's keyspace doesn't need to be scanned, which is always slow.

#9 @tillkruess
22 months ago

#56135 was marked as a duplicate.

#10 follow-up: @spacedmonkey
22 months ago

How about changing the cache group that query are stored in?

$cached_results = wp_cache_get( $cache_key, 'posts', false, $cache_found );

Becomes

$cached_results = wp_cache_get( $cache_key, 'post-query', false, $cache_found );

Then we could do this

function clean_post_cache( $post ) {
...
if( wp_cache_supports( 'flush_group' ) ) {
   wp_cache_flush_group( 'post-query' );
}

If we can this change, it should be allowed to all query caches, so, WP_Comment_Query, WP_Term_Query, WP_Site_Query and WP_Network_Query, caches found in get_pages and get_page_by_path.

There are other query caches that been in core for years, so is this such a problem now?

#11 @owi
22 months ago

@spacedmonkey it seems that could be a good separation (maybe?) but that circles back to my original question

That was my thinking - is there a possibility that query caches belong to its own cache group and we flush it on certain events? Or even better by setting an event in cron as it might be tricky/countrproductive doing that during regular operation? Is moving keys between cache groups backwards compatible in the WordPress context?

Is moving keys to its own cache group backwards compatible? Can we do it at all considering how WordPress dev tries to not break backwards compatibility?

Regarding your last question:

There are other query caches that been in core for years, so is this such a problem now?

My perspective may be anegdotic but I am managing quite high traffic WP installations since years (hundreds thousands daily), all built on composer, redis with CI/CD etc. and all previous query caches were behaving quite well and if there was a mem leak then it was usually a developers fault (mostly due to going around WP built-in functions).

Since 6.0 with the terms and 6.1 with the posts query the redis can fill up within day or two... And for example generate 1.2m keys out of which 1.1m is wp_query out of which 95-99%+ are already invalid ;) And once editor just opens up gutenberg screen then last_changed is bumped up and the story starts again :)

The best example of how this story is problematic is trying to reproduce the steps from my original posts. Fresh WP, fresh DB, opening new posts screen depending on conditions spawns few wp_query keys and only last ones are valid. Refresh it few times and you have 20 invalid keys. Multiply it by 10k or 100k+ posts and 20+ users and its a recipe for a disaster.

@skithund
22 months ago

Redis graphs

#12 follow-up: @skithund
22 months ago

There are other query caches that been in core for years, so is this such a problem now?

You can see pretty clearly in my attached graph when we upgraded to WP 6.0 on a busyish multisite WooCommerce installation. After couple of OOM hiccups on this site we added WP_REDIS_CACHE_TTL to WordPress and maxmemory to Redis as a quick-fix "eviction policy" to all our WP installations (200+).

I still haven't found time to debug further what's the root cause for terms invalidating pretty much on every timed cron run. Possibly a badly behaving "Something for WooCommerce Pro" plugin.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

#13 @spacedmonkey
22 months ago

@owi It is worth noting, that caching can be disabled for WP Query if you so desire. See the original blog post. https://make.wordpress.org/core/2022/10/07/improvements-to-wp_query-performance-in-6-1/ . Until the fix is merged, then this might be best path forward.

As for this issue. The point of an object cache, is should should allow it to fill up and the the object cache is in charge of how it decides to get rid of objects. The idea is to fill up the cache and let objects fall out of cache. Not all objects in cache should live forever, there should be evocation done at the object cache level ( memcache or Redis ). As notes in the description, a TTL for forever does not seem right. Posts and terms, may not change but other caches will only live until the last change value is updated.

#14 @owi
22 months ago

@spacedmonkey I have addressed all of your concerns in my previous messages so I am not sure if I should repeat myself in here. If you do not mind please reread them to get a better understanding of my position.

But a short TLDR:

  • eviction is not a solution because we are evicting also a lot of valid keys. And when operating around max capacity we are constantly evicting "good" keys as well regardless of the policy we use. Its just a choice between lesser evil.
  • why infinite or very long TTL is wrong? Assuming no parameters change the cache key is just as valid a minute and a month after creation.
  • short lived TTL is just a magic number to try to cover a design issue - there is no good number - make it too small then your cache sucks, make it too high and you need to evict keys as you max out. And TTL fluctuates over time and depends on traffic and other factors.
  • we as WP users we do not store only posts but also for example transients. Transients can be leveraged to store things which are expensive to compute. Short TTL or eviction makes that less efficient.

Regarding turning it off:
For myself I have developed a solution where I filter the cache keys and when I stumble upon wp_query or get_terms I just explicitly store them with 1-2 day TTL. That is kind of a rotten compromise to have a cookie and eat a cookie.

I am sorry that this sounds harsh but there is a clear flaw in how we handle wp_query keys in Core so we need a better solution and not questionable workarounds. I understand that this is not something which can be quick or easy but definitely we can do better ;)

#15 in reply to: ↑ 10 @tillkruess
22 months ago

Replying to spacedmonkey:

How about changing the cache group that query are stored in?

$cached_results = wp_cache_get( $cache_key, 'post-query', false, $cache_found );

The issue with grouping the comment-queries and post-queries would be that we'd invalidate other queries that don't need to be invalidated.

<?php
// `comment-queries` group
"get_comments:$key:$last_changed"
"get_comment_child_ids:$parent_id:$key:$last_changed"

// `post-queries` group
"get_page_by_path:$hash:$last_changed"
"adjacent_post:$key:$last_changed"
"wp_get_archives:$key:$last_changed"

I wonder if it would be good to have a dedicated helper for "query" caching, since it's quite repetitive. We do have ~20 occurrences.

<?php
function wp_cache_query( $type, $key, $last_changed, $group );

wp_cache_query( 'get_objects_in_term', md5( $sql ), $last_changed,  'terms' );

That way object cache implementation knows:

  1. it's a query
  2. the query's group
  3. the group's last-changed timestamp

And they can handle expiration, invalidation and storage however they like.

Last edited 22 months ago by tillkruess (previous) (diff)

#16 follow-up: @owi
22 months ago

This conversation get a bit stuck since a while. Is there anything I can do to help you to move it forward?

#17 in reply to: ↑ 16 @tillkruess
22 months ago

Replying to owi:

This conversation get a bit stuck since a while. Is there anything I can do to help you to move it forward?

Let's hope it's not 10+ years, like other tickets. You could open a PR to core with my proposed wp_cache_query() method, I'm happy to review it.

#18 @spacedmonkey
22 months ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release
  • Version set to 6.1

Currently there is no patch or PR to fix this issue. A good starting place would be to get one ready so we can start the process.

To be realistic, there fix will not go into the 6.2, this is too big of change to get in, so the earliest it will go into core, is 6.3. Which is months away. We have time here to work out a solution. If this is effecting your site on a day to day basis, then I would recommend turning the caching off, until a solution can be found in core. This is just returning you back to what core was like in 6.0-, so should not be too bad. You can also implement your solution using existing filters.

To be clear, WordPress has to support lot of different types of object caching, including memcache, sqlite and redis. Not all of these support flushing cache groups. So any solution, will have to be a progrive enhancement. I will likely require object cache drop-in creators to do something. So again, this is not going to be a quick fix.

I am sorry that this sounds harsh but there is a clear flaw in how we handle wp_query keys in Core so we need a better solution and not questionable workarounds.

I am not sure this kind of discourse is helpful to the discussion. It is not clear to me that there is a flaw in the core. I think it suboptimal is better way of putting it. But we trying to make solution for works for many type of object caching.

I am willing to work on this ticket, let's try and work towards a solution and keep the conversion on a workable patch.

This ticket was mentioned in PR #4077 on WordPress/wordpress-develop by @tillkruess.


22 months ago
#19

  • Keywords has-patch added; needs-patch removed

As per discussion with @spacedmonkey on Slack.

Trac ticket: https://core.trac.wordpress.org/ticket/57625

#20 @tillkruess
22 months ago

As per discussion with @spacedmonkey on Slack, I opened a PR that changes to storing all WordPress 6.1 query-caching in the queries group for more flexibility.

Any thoughts?

@tillkruess commented on PR #4077:


22 months ago
#21

@peterwilsoncc Done 👍

#22 @peterwilsoncc
22 months ago

I'm happy with the linked pull request as of 9c0c4fdee7d as an initial step.

It's possible a function similar to the suggestion in comment #15 will be needed in addition to assist with persistent cache authors but setting the groups as an initial step will help move WordPress towards that.

#23 @peterwilsoncc
22 months ago

  • Keywords needs-dev-note early added
  • Milestone changed from Future Release to 6.3
  • Type changed from defect (bug) to enhancement
  • Version 6.1 deleted

Let's aim to get this in to 6.3.

I've switched the type over to enhancement and tagged it as requiring a dev note. It would be good to get it in early to allow for testing.

#24 @tillkruess
22 months ago

Love it. Do we need a new Trac ticket for it, since this one won't be solved by the PR?

@tillkruess commented on PR #4077:


22 months ago
#25

I'm not 100% convinced, but I feel like this would be a candidate that requires needs-dev-note on the Trac ticket.

That seems reasonable.

@tillkruess commented on PR #4077:


22 months ago
#26

Yep, I went with dashes since it was used more recently it appears. Kills my OCD.

@tillkruess commented on PR #4077:


22 months ago
#27

I really like this change.

It was your idea after all ❤️‍🔥

#28 @spacedmonkey
22 months ago

  • Keywords commit added

#29 @mukesh27
22 months ago

This ticket was discussed in the recent Performance bug scrub.

PR got enough approval for merge early for 6.3.

Props @spacedmonkey

#30 @spacedmonkey
21 months ago

  • Owner set to spacedmonkey
  • Status changed from new to assigned

#31 @spacedmonkey
21 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 55526:

Cache API: Introduce new queries cache groups.

Give developers more control over how query caches are handled within an object caches. Now all caches that cache the result of a query, are cached in a group that is suffixed with -queries. Developers can use these groups, to add custom cache invalidation rules or to make them none persistent.

Props spacedmonkey, owi, tillkruess, skithund, peterwilsoncc, flixos90, sergeybiryukov, mukesh27.
Fixes #57625.

#33 @spacedmonkey
21 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening, as I noticed a bug. Site and network cache groups will need to be global cache groups.

This ticket was mentioned in PR #4214 on WordPress/wordpress-develop by @tillkruess.


21 months ago
#34

This is a follow up to #4077 as suggest by @peterwilsoncc: https://core.trac.wordpress.org/ticket/57625#comment:6

This PR adds the wp_cache_set_last_changed() function to fire the wp_cache_set_last_changed action when a cache group's last_changed timestamp was updated.

Trac ticket: https://core.trac.wordpress.org/ticket/57625

#35 @tillkruess
21 months ago

@peterwilsoncc I turned your comment into a followup PR: https://github.com/WordPress/wordpress-develop/pull/4214

This ticket was mentioned in PR #4215 on WordPress/wordpress-develop by @spacedmonkey.


21 months ago
#36

  • Keywords has-unit-tests added

@spacedmonkey commented on PR #4215:


21 months ago
#37

@felixarntz Happy for me to commit?

#38 @spacedmonkey
21 months ago

In 55537:

Cache API: Make network-queries and site-queries global cache groups.

Fixes a bug introduced in [55526]. The cache groups network-queries and site-queries store query results for network and site. These are network level data and should be stored in a global cache group.

Props spacedmonkey, tillkruess, flixos90.
See #57625.

@spacedmonkey commented on PR #4214:


21 months ago
#40

I think this should be it's own ticket. See https://core.trac.wordpress.org/ticket/37464#comment:17

I will helpfully merge it as it's own ticket.

@tillkruess commented on PR #4214:


21 months ago
#42

Thanks @tillkruss, Left some nit-pick feedback.

Thanks. Should I always use the version number, instead of n.e.x.t?

@tillkruess commented on PR #4214:


21 months ago
#43

This is pretty much what I was thinking, does this need an update following 82cd434?

I don't think so, because the groups won't change. posts last-changed will be stored in post-queries group.

#44 @spacedmonkey
21 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

@spacedmonkey commented on PR #4214:


20 months ago
#45

@tillkruss This PR needs to be rebased after. https://github.com/WordPress/wordpress-develop/commit/19bd759db57b4f1cc72157a446915ffbafdd838e. I will re-review after this is done.

@tillkruess commented on PR #4214:


20 months ago
#46

@peterwilsoncc @SergeyBiryukov Any more input on this?

@tillkruess commented on PR #4214:


19 months ago
#47

I think this is ready to be merged. Any feedback @SergeyBiryukov?

#50 @nickchomey
14 months ago

Perhaps I'm missing something fundamental, but it seems to me that (at least part of) the conversation here is focusing on the wrong thing - cache key TTL rather than having an appropriate cache eviction policy for the cache data (e.g. LRU, which I believe is the default for Redis). https://redis.io/docs/reference/eviction/

If you use something like allkeys-lru, then the only way actively-used keys could get evicted is if your cache is too small - a hosting error, not WP. If you do want to use TTL/expire on specific keys, then perhaps volatile-lru could be used. It seems to me that this decision should be up to the site owner, not WP Core.

This all seems to rougly echo what Peter Wilson said in his initial comment: https://core.trac.wordpress.org/ticket/57625#comment:3

Perhaps a solution the present problems is to give users and plugins more control over caching policies such as whether to apply a TTL or not to a specific cache key? Also, to provide documentation/guidance on proper server cache (redis) config - both for users who manage their own servers, as well as for hosting providers to do a better job at.

This article from Ruby on Rails goes into detail about using key-based cache expiration, so it seems worth reviewing for inspiration.
https://signalvnoise.com/posts/3113-how-key-based-cache-expiration-works

I hope this is helpful rather than irrelevant/a distraction!

Version 6, edited 14 months ago by nickchomey (previous) (next) (diff)

#51 in reply to: ↑ 12 @manos2000
9 months ago

with WP_REDIS_CACHE_TTL set, expired keys will be evicted before max memory reached?

Replying to skithund:

There are other query caches that been in core for years, so is this such a problem now?

You can see pretty clearly in my attached graph when we upgraded to WP 6.0 on a busyish multisite WooCommerce installation. After couple of OOM hiccups on this site we added WP_REDIS_CACHE_TTL to WordPress and maxmemory to Redis as a quick-fix "eviction policy" to all our WP installations (200+).

I still haven't found time to debug further what's the root cause for terms invalidating pretty much on every timed cron run. Possibly a badly behaving "Something for WooCommerce Pro" plugin.

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors.

Last edited 9 months ago by manos2000 (previous) (diff)
Note: See TracTickets for help on using tickets.