#37189 closed enhancement (fixed)
In wp_term_query on cache ids
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 6.0 | Priority: | normal |
Severity: | normal | Version: | 4.6 |
Component: | Taxonomy | Keywords: | has-patch needs-testing has-unit-tests has-dev-note |
Focuses: | performance | Cc: |
Description
Currently when a wp_term_query is run with fields => 'all', an array of the entire WP_Term objects is stored in cache. This is a bad for a number of results. With popular object caching memcache has a limit of 1MB per object in cache. By storing the whole term object, you might have a descriptions on the object that long. This may mean the terms array is unable to be stored in memcache.
Going forward we should only store the ids in cache, like the wp_site_query and wp_comment_query.
Attachments (7)
Change History (33)
This ticket was mentioned in Slack in #core by spacedmonkey. View the logs.
9 years ago
#3
@
8 years ago
Yes, let's do it.
A couple comments on 37189.diff:
- It's not necessary to delete the cache incrementor on WP update. WP flushes all caches during the update process.
- When
fields
isall
, two queries will take place:SELECT * ... WHERE [query conditions]
andSELECT * ... WHERE t.term_id IN $non_cached_array
. This seems inefficient. Can we fix this so that the first query will beSELECT t.term_id
instead? - Related: You handle
fields=all
one way and and all other values offields
another. I guess this takes advantage of the fact thatt.term_id
is one of the SELECTed fields for each other value offields
. This seems like a hack to me. Would the hack go away if we were doing aSELECT t.term_id
query for all values offields
? What do you think is the wisest thing to do here? - Values of
fields
other than 'all' and 'ids' will *set* the cache, but they won't *use* the cache. Can we fix this? Should we?
#4
@
8 years ago
- Keywords needs-patch added
- Milestone changed from Awaiting Review to Future Release
Marking as needs-patch for a refresh based on my comments above.
#5
@
8 years ago
- Keywords has-patch needs-unit-tests added; needs-patch removed
@boonebgorges thanks for the feedback. My latest patch is a rewrite so I am not sure that you feedback is valid. The new patch does the following.
- There are now only two query types, everything and count. This makes the code much more simple.
- All caches now are stored as a list of ids, no matter the format.
- The cache key is generated differently. As all query are cached as a list of ids, the same cache can be used for all formats of return (apart from count). This means for example, two queries one with fields=>'ids' and one with fields=>'all' they will hit the same cache. This could be a massive performance win for many :D
Looping @flixos90 into this ticket, as he did the work on wp_term_query.
#6
@
8 years ago
- Keywords needs-unit-tests removed
- Milestone changed from Future Release to 4.9
Thanks for continuing to work on this, @spacedmonkey.
Your patch generated a couple of PHPUnit failures for me (outside of the taxonomy component). As I started digging into why, it became clear to me that our efforts would be better spent doing the right thing and simply splitting the query.
Specifically: The fact that you switched to SELECT t.*, t.**
for most queries got me concerned that we'd cause memory issues for people using the fields
parameter for bulk operations. Yet, because your patch worked within the existing parameters of term-query caching, there was no easy way to work around this problem - no way to actually utilize the single-term cache during get_terms()
queries. The possibility of additional overhead in the main SELECT
query only seems worth the trade-off if we also offer a more efficient caching layer for single terms. The way we handle this elsewhere is by doing a SELECT {id}
query and then filling the objects from the cache.
37189.3.patch is a first pass at making this happen. It allows us to simplify the internals of WP_Term_Query::query()
, but requires some other fixes. Some notes:
- Your technique for building keys is a bit redundant. Once we've built the
request
, we have a key that's guaranteed to be unique. So let's use it, and do away with thedefault_args
parsing. - I've written the patch in such a way that
fields=ids
queries generally do not require the second query to populate the term objects. The exception is in the case of hierarchical queries, where the tree must be descended after the initial query takes place. So this is a fairly big performance win for mostfields=ids
queries. - The patch uncovered a pretty deep bug (actually, two related bugs) in
_split_shared_term()
that was agonizing to figure out. Briefly, the way that taxonomy hierarchy caches were being regenerated ({$taxonomy}_children
) was not consistent: certain child terms were not properly triggering taxonomy cache clears, and the calls toclean_term_cache()
in the context of aforeach()
loop was causing race conditions in cases whereget_terms()
descends the hierarchy tree. The fixes in_split_shared_term()
should go in, no matter what happens with the rest of the patch. - As part of these fixes, I broke the taxonomy-cache busting functionality into a new
clean_taxonomy_cache()
function. - On an installation without a persistent cache (or with an empty cache), term queries now generate two queries instead of one. This requires a reworking of certain unit tests that count
num_queries
. I've refactored most of them so that they're not counting the queries specifically, but instead check for data invalidation (which is the important part anyway). A few tests - those that rely on examining alast_query
that no longer points to the right thing - had to be removed.
I'd welcome a careful review from @spacedmonkey, @flixos90 or anyone else with some experience in how term caching has historically differed from caching for posts, comments, etc.
#7
@
8 years ago
My feedback so.
- How the key is generated is how over query classes with cache (such as site query) generate the key. Because of the nature of the code in wp_term_query, the request (sql query) is in place. I don't see a problem with using it is a cache key. It just makes it an outlier.
- The patch remove the cache expiry time of DAY_IN_SECONDS on the term cache. I have never understood why this was put in, as if cache invalidation is correct, then it should never be needed. Caching for just one day might end up in surges on the databases when all caches invalidate at once on high traffic sites.
- Querying by id then a second query to populate the terms, is how over query classes work. It results in more queries, but each query is simpler, returning faster from the db.
- Completely forgot to mention the bug in WP_Term, my bad. Seems unrelated to the changes and was there before.
- I like breaking out the cache invalidation term into two. Current the clean_term_cache action is broken. The current action looks like this
do_action( 'clean_term_cache', $ids, $taxonomy, $clean_taxonomy );
in the patch it looks like this
do_action( 'clean_term_cache', $taxonomy );
. The current action should be maintained and a new
clean_taxonomy_cache
added. - Any tests that query count should be removed. Whenever caching is added in core, lots of random tests start to fail.
#8
@
8 years ago
Great work on this so far, I agree it generally makes sense to follow the other query classes and query IDs only to make the caches applicable more generally.
I jumped into the code for a while and reviewed it. Here are my thoughts:
- Even though it looks a bit less elegant, I prefer generating the cache key in the way that @spacedmonkey did in 37189.2.patch. His idea has parity with other query classes and I don't see any harm in manually removing the
fields
key from the arguments unless its value iscount
orall_with_object_id
(see below). - The
all_with_object_id
value is (in addition ofcount
) the only value for thefields
key that requires a different information that simply an array of IDs being in the cache. Given my above point, I think it would make sense to add this field to the condition in 37189.2.patch, in order to leave it in the key generator array under these circumstances. Some logic would need to be added then in 37189.3.patch in the same location where the specialcount
cache is handled, in order to set an array of objects in the cache where each object containsterm_id
andobject_id
properties. This information would be sufficient to create the objects. In case of a cache hit, theterm_id
field could be used to create an array of term objects via calls toget_term()
and then theobject_id
simply needs to be set for each object. - In 37189.3.patch the check for
! empty( $args['object_ids'] )
is missing: Thetr.object_id
field should only be queried if actual object IDs have been passed, similar as before. - While I like the idea of not running the
get_term()
queries when onlyids
are queried a lot, I'm a bit wary this could cause unexpected issues, especially surrounding the hacky logic of manually creatingstdClass
objects with theterm_id
property set. Looking for any unexpected calls to other term properties (which would not exist in that case), I wasn't able to detect any issue since all such cases should be prevented by respective if-clauses. However, since this part is rather complex, let's pay particular attention to this change. We need thorough unit tests for this especially. - +1 on fixing the
_split_shared_term()
bug and introducing aclean_taxonomy_cache()
function. The action name, as noticed by @spacedmonkey above, should beclean_taxonomy_cache
, I assume this was just a typo.
#11
@
8 years ago
Thanks to @spacedmonkey and @flixos90 for the thorough feedback. I've gone ahead with the clean_taxonomy_cache()
and _split_shared_term()
fixes, since they're not dependent on the rest of this ticket.
Regarding cache keys: Yes, I've just reread a comment from previous-me and I think there are non-stylistic reasons why we need to hash the args: https://core.trac.wordpress.org/ticket/21267#comment:17 $taxonomies
is part of the $args
array, so that part is redundant. I will make the change.
The patch remove the cache expiry time of DAY_IN_SECONDS on the term cache. I have never understood why this was put in, as if cache invalidation is correct, then it should never be needed.
Good catch. The one-day timeout was introduced in [15583]; see #11431. Since this time, we have had a clearer position on invalidation: it's generally the job of the cache engine (and its administrator) to set a policy for evacuating old content. So I think it's fine to remove the restriction.
The all_with_object_id value is (in addition of count) the only value for the fields key that requires a different information that simply an array of IDs being in the cache.
Yes, this is a good point. It will take some non-trivial refactoring to make this work right. It also appears that we don't have the right kinds of tests to confirm the cache behavior of this sort of query (surprising, since it was added recently). I'll work on this.
While I like the idea of not running the get_term() queries when only ids are queried a lot, I'm a bit wary this could cause unexpected issues, especially surrounding the hacky logic of manually creating stdClass objects with the term_id property set.
Yes, this part of the patch did not feel great to write. I'm fairly sure that there is no current possibility for bugs due to the stdClass
hack (as you note, none of the relevant if
clauses later in the function are tripped). But future WordPressnauts may be tripped up by this bit of weirdness. There may be a way to accomplish much the same thing in a way that feels less unpleasant, maybe by explicitly checking for $_fields = 'ids'
on some of those latter if
clauses, and/or having a separate if
block for formatting the return value when we've only got an array of integers (to avoid stdClass
juggling). I'll think about it.
I'll try to wrestle all of this into a new patch in the upcoming days.
#12
@
8 years ago
37189.4.patch makes the following changes:
- Cache key is now generated using a technique close to what @spacedmonkey originally suggested.
- The cache is now populated by an array of
stdClass
objects (the raw results of theget_results()
query). In most cases, it looks like this:[ { term_id: 15 }, { term_id: 17 } ]
. In the case ofall_with_object_id
, it looks like this:[ { term_id: 15, object_id: 123 } ... ]
. Storing data like this means it's possible to share the same logic when populatingobject_id
from the cache vs from a database query. I've also added a test demonstrating cache support for these queries.
The second bullet point also means that I can drop the stdClass
trick for fields=ids
, since the results of $wpdb->get_results( "SELECT term_id FROM ..." )
will be an array of stdClass
objects. There's still the matter of ensuring that post-query processing later in the method doesn't attempt to operate on these items, but I feel that this little bit of maintenance overhead is worth the benefit of a true fields=ids
query.
#14
@
8 years ago
Ah, thanks for the link to that ticket - I thought I remembered creating it, but a quick search didn't turn it up.
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
7 years ago
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
7 years ago
#19
@
7 years ago
- Milestone changed from 4.9 to Future Release
Punting to Future Release per today's 4.8 bug scrub.
This ticket was mentioned in PR #2343 on WordPress/wordpress-develop by spacedmonkey.
3 years ago
#20
- Keywords has-unit-tests added
NOT passing tests
Trac ticket:
This ticket was mentioned in PR #2346 on WordPress/wordpress-develop by spacedmonkey.
3 years ago
#21
Trac ticket: https://core.trac.wordpress.org/ticket/37189
#22
@
3 years ago
In #2346, I have completely rewritten the patch. All tests are passing now, which is pretty neat.
Some notes on the patch.
- One just querying by term id, it means that another query has be used to prime term caches. This is too be exepected with this change. This change brings this query cache inline with other query classes like the post query class.
- One test had to removed, as the last query does not match what it used, making the rest invalid.
- Adding a groupby to query is needed to duplicate queries.
- Formatting and comments need work.
#23
@
3 years ago
@boonebgorges Is there any chance you could take a look at my PR #2346. It is very close to being mergeable.
Working noting, in my testing, this PR does increase the number of SQL queries. The use of _prime_term_caches
, means one query for getting the ids and another for get term data. This is likely to be a benefit in my cases. Get term data via _prime_term_caches
and get_term
means that any term loaded into cache on a request will not be reloaded again. This is an expected behaviour.
The key goal of this ticket is not to store massive term objects in cache. There are limits to what can be stored in object caches. Not to mention, the fact if terms are updated, these caches may not be correctly invalidated.
#24
@
3 years ago
- Keywords commit added
- Milestone changed from Future Release to 6.0
- Owner set to spacedmonkey
- Status changed from new to assigned
The linked pull request looks good for commit. I'll leave you to do the honors @spacedmonkey
This should be in 4.7. What do you think @boonebgorges ?