WordPress.org

Make WordPress Core

Opened 18 months ago

Last modified 2 months ago

#37189 new enhancement

In wp_term_query on cache ids

Reported by: spacedmonkey Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.6
Component: Taxonomy Keywords: has-patch needs-testing
Focuses: performance Cc:

Description

Currently when a wp_term_query is run with fields => 'all', an array of the entire WP_Term objects is stored in cache. This is a bad for a number of results. With popular object caching memcache has a limit of 1MB per object in cache. By storing the whole term object, you might have a descriptions on the object that long. This may mean the terms array is unable to be stored in memcache.

Going forward we should only store the ids in cache, like the wp_site_query and wp_comment_query.

Attachments (5)

37189.diff (1.4 KB) - added by spacedmonkey 18 months ago.
37189.patch (5.6 KB) - added by spacedmonkey 6 months ago.
37189.2.patch (5.7 KB) - added by spacedmonkey 6 months ago.
37189.3.patch (16.0 KB) - added by boonebgorges 6 months ago.
37189.4.patch (14.8 KB) - added by boonebgorges 6 months ago.

Download all attachments as: .zip

Change History (24)

@spacedmonkey
18 months ago

This ticket was mentioned in Slack in #core by spacedmonkey. View the logs.


17 months ago

#2 @spacedmonkey
16 months ago

This should be in 4.7. What do you think @boonebgorges ?

#3 @boonebgorges
15 months ago

Yes, let's do it.

A couple comments on 37189.diff:

  • It's not necessary to delete the cache incrementor on WP update. WP flushes all caches during the update process.
  • When fields is all, two queries will take place: SELECT * ... WHERE [query conditions] and SELECT * ... WHERE t.term_id IN $non_cached_array. This seems inefficient. Can we fix this so that the first query will be SELECT t.term_id instead?
  • Related: You handle fields=all one way and and all other values of fields another. I guess this takes advantage of the fact that t.term_id is one of the SELECTed fields for each other value of fields. This seems like a hack to me. Would the hack go away if we were doing a SELECT t.term_id query for all values of fields? What do you think is the wisest thing to do here?
  • Values of fields other than 'all' and 'ids' will *set* the cache, but they won't *use* the cache. Can we fix this? Should we?

#4 @boonebgorges
6 months ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

Marking as needs-patch for a refresh based on my comments above.

@spacedmonkey
6 months ago

#5 @spacedmonkey
6 months ago

  • Keywords has-patch needs-unit-tests added; needs-patch removed

@boonebgorges thanks for the feedback. My latest patch is a rewrite so I am not sure that you feedback is valid. The new patch does the following.

  • There are now only two query types, everything and count. This makes the code much more simple.
  • All caches now are stored as a list of ids, no matter the format.
  • The cache key is generated differently. As all query are cached as a list of ids, the same cache can be used for all formats of return (apart from count). This means for example, two queries one with fields=>'ids' and one with fields=>'all' they will hit the same cache. This could be a massive performance win for many :D

Looping @flixos90 into this ticket, as he did the work on wp_term_query.

#6 @boonebgorges
6 months ago

  • Keywords needs-unit-tests removed
  • Milestone changed from Future Release to 4.9

Thanks for continuing to work on this, @spacedmonkey.

Your patch generated a couple of PHPUnit failures for me (outside of the taxonomy component). As I started digging into why, it became clear to me that our efforts would be better spent doing the right thing and simply splitting the query.

Specifically: The fact that you switched to SELECT t.*, t.** for most queries got me concerned that we'd cause memory issues for people using the fields parameter for bulk operations. Yet, because your patch worked within the existing parameters of term-query caching, there was no easy way to work around this problem - no way to actually utilize the single-term cache during get_terms() queries. The possibility of additional overhead in the main SELECT query only seems worth the trade-off if we also offer a more efficient caching layer for single terms. The way we handle this elsewhere is by doing a SELECT {id} query and then filling the objects from the cache.

37189.3.patch is a first pass at making this happen. It allows us to simplify the internals of WP_Term_Query::query(), but requires some other fixes. Some notes:

  • Your technique for building keys is a bit redundant. Once we've built the request, we have a key that's guaranteed to be unique. So let's use it, and do away with the default_args parsing.
  • I've written the patch in such a way that fields=ids queries generally do not require the second query to populate the term objects. The exception is in the case of hierarchical queries, where the tree must be descended after the initial query takes place. So this is a fairly big performance win for most fields=ids queries.
  • The patch uncovered a pretty deep bug (actually, two related bugs) in _split_shared_term() that was agonizing to figure out. Briefly, the way that taxonomy hierarchy caches were being regenerated ({$taxonomy}_children) was not consistent: certain child terms were not properly triggering taxonomy cache clears, and the calls to clean_term_cache() in the context of a foreach() loop was causing race conditions in cases where get_terms() descends the hierarchy tree. The fixes in _split_shared_term() should go in, no matter what happens with the rest of the patch.
  • As part of these fixes, I broke the taxonomy-cache busting functionality into a new clean_taxonomy_cache() function.
  • On an installation without a persistent cache (or with an empty cache), term queries now generate two queries instead of one. This requires a reworking of certain unit tests that count num_queries. I've refactored most of them so that they're not counting the queries specifically, but instead check for data invalidation (which is the important part anyway). A few tests - those that rely on examining a last_query that no longer points to the right thing - had to be removed.

I'd welcome a careful review from @spacedmonkey, @flixos90 or anyone else with some experience in how term caching has historically differed from caching for posts, comments, etc.

#7 @spacedmonkey
6 months ago

My feedback so.

  • How the key is generated is how over query classes with cache (such as site query) generate the key. Because of the nature of the code in wp_term_query, the request (sql query) is in place. I don't see a problem with using it is a cache key. It just makes it an outlier.
  • The patch remove the cache expiry time of DAY_IN_SECONDS on the term cache. I have never understood why this was put in, as if cache invalidation is correct, then it should never be needed. Caching for just one day might end up in surges on the databases when all caches invalidate at once on high traffic sites.
  • Querying by id then a second query to populate the terms, is how over query classes work. It results in more queries, but each query is simpler, returning faster from the db.
  • Completely forgot to mention the bug in WP_Term, my bad. Seems unrelated to the changes and was there before.
  • I like breaking out the cache invalidation term into two. Current the clean_term_cache action is broken. The current action looks like this do_action( 'clean_term_cache', $ids, $taxonomy, $clean_taxonomy ); in the patch it looks like this do_action( 'clean_term_cache', $taxonomy ); . The current action should be maintained and a new clean_taxonomy_cache added.
  • Any tests that query count should be removed. Whenever caching is added in core, lots of random tests start to fail.

#8 @flixos90
6 months ago

Great work on this so far, I agree it generally makes sense to follow the other query classes and query IDs only to make the caches applicable more generally.

I jumped into the code for a while and reviewed it. Here are my thoughts:

  • Even though it looks a bit less elegant, I prefer generating the cache key in the way that @spacedmonkey did in 37189.2.patch. His idea has parity with other query classes and I don't see any harm in manually removing the fields key from the arguments unless its value is count or all_with_object_id (see below).
  • The all_with_object_id value is (in addition of count) the only value for the fields key that requires a different information that simply an array of IDs being in the cache. Given my above point, I think it would make sense to add this field to the condition in 37189.2.patch, in order to leave it in the key generator array under these circumstances. Some logic would need to be added then in 37189.3.patch in the same location where the special count cache is handled, in order to set an array of objects in the cache where each object contains term_id and object_id properties. This information would be sufficient to create the objects. In case of a cache hit, the term_id field could be used to create an array of term objects via calls to get_term() and then the object_id simply needs to be set for each object.
  • In 37189.3.patch the check for ! empty( $args['object_ids'] ) is missing: The tr.object_id field should only be queried if actual object IDs have been passed, similar as before.
  • While I like the idea of not running the get_term() queries when only ids are queried a lot, I'm a bit wary this could cause unexpected issues, especially surrounding the hacky logic of manually creating stdClass objects with the term_id property set. Looking for any unexpected calls to other term properties (which would not exist in that case), I wasn't able to detect any issue since all such cases should be prevented by respective if-clauses. However, since this part is rather complex, let's pay particular attention to this change. We need thorough unit tests for this especially.
  • +1 on fixing the _split_shared_term() bug and introducing a clean_taxonomy_cache() function. The action name, as noticed by @spacedmonkey above, should be clean_taxonomy_cache, I assume this was just a typo.

#9 @boonebgorges
6 months ago

In 40919:

Introduce clean_taxonomy_cache() function.

Previously, taxonomy-wide caches were cleaned only inside of
clean_term_cache(). This made it hard to clean taxonmy caches
in isolation from specific taxonomy terms.

Props spacedmonkey.
See #37189.

#10 @boonebgorges
6 months ago

In 40920:

Improve cache invalidation when splitting shared terms.

This changeset addresses two related issues:

  • When splitting shared terms from hierarchical taxonomies, the process of regenerating the taxonomy hierarchy (_get_taxonomy_hierarchy()) requires recursive calls to get_terms() in order to descend the tree. By waiting until all shared terms in a term group have been invalidated before regenerating their taxonomy hierarchies, we avoid certain race conditions.
  • Previously, a coding error prevented single-term caches from being invalidated for children of split terms. This error dates from [31418].

See #37189.

#11 @boonebgorges
6 months ago

Thanks to @spacedmonkey and @flixos90 for the thorough feedback. I've gone ahead with the clean_taxonomy_cache() and _split_shared_term() fixes, since they're not dependent on the rest of this ticket.

Regarding cache keys: Yes, I've just reread a comment from previous-me and I think there are non-stylistic reasons why we need to hash the args: https://core.trac.wordpress.org/ticket/21267#comment:17 $taxonomies is part of the $args array, so that part is redundant. I will make the change.

The patch remove the cache expiry time of DAY_IN_SECONDS on the term cache. I have never understood why this was put in, as if cache invalidation is correct, then it should never be needed.

Good catch. The one-day timeout was introduced in [15583]; see #11431. Since this time, we have had a clearer position on invalidation: it's generally the job of the cache engine (and its administrator) to set a policy for evacuating old content. So I think it's fine to remove the restriction.

The all_with_object_id value is (in addition of count) the only value for the fields key that requires a different information that simply an array of IDs being in the cache.

Yes, this is a good point. It will take some non-trivial refactoring to make this work right. It also appears that we don't have the right kinds of tests to confirm the cache behavior of this sort of query (surprising, since it was added recently). I'll work on this.

While I like the idea of not running the get_term() queries when only ids are queried a lot, I'm a bit wary this could cause unexpected issues, especially surrounding the hacky logic of manually creating stdClass objects with the term_id property set.

Yes, this part of the patch did not feel great to write. I'm fairly sure that there is no current possibility for bugs due to the stdClass hack (as you note, none of the relevant if clauses later in the function are tripped). But future WordPressnauts may be tripped up by this bit of weirdness. There may be a way to accomplish much the same thing in a way that feels less unpleasant, maybe by explicitly checking for $_fields = 'ids' on some of those latter if clauses, and/or having a separate if block for formatting the return value when we've only got an array of integers (to avoid stdClass juggling). I'll think about it.

I'll try to wrestle all of this into a new patch in the upcoming days.

#12 @boonebgorges
6 months ago

37189.4.patch makes the following changes:

  • Cache key is now generated using a technique close to what @spacedmonkey originally suggested.
  • The cache is now populated by an array of stdClass objects (the raw results of the get_results() query). In most cases, it looks like this: [ { term_id: 15 }, { term_id: 17 } ]. In the case of all_with_object_id, it looks like this: [ { term_id: 15, object_id: 123 } ... ]. Storing data like this means it's possible to share the same logic when populating object_id from the cache vs from a database query. I've also added a test demonstrating cache support for these queries.

The second bullet point also means that I can drop the stdClass trick for fields=ids, since the results of $wpdb->get_results( "SELECT term_id FROM ..." ) will be an array of stdClass objects. There's still the matter of ensuring that post-query processing later in the method doesn't attempt to operate on these items, but I feel that this little bit of maintenance overhead is worth the benefit of a true fields=ids query.

#14 @boonebgorges
6 months ago

Ah, thanks for the link to that ticket - I thought I remembered creating it, but a quick search didn't turn it up.

#15 @boonebgorges
6 months ago

#34239 was marked as a duplicate.

#16 @jbpaul17
3 months ago

  • Keywords needs-testing added

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


3 months ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


2 months ago

#19 @jbpaul17
2 months ago

  • Milestone changed from 4.9 to Future Release

Punting to Future Release per today's 4.8 bug scrub.

Note: See TracTickets for help on using tickets.