WordPress.org

Make WordPress Core

Opened 23 months ago

Last modified 2 days ago

#21760 assigned enhancement

get_term_by() calls are not cached

Reported by: wonderboymusic Owned by: wonderboymusic
Milestone: 4.0 Priority: normal
Severity: normal Version: 2.3
Component: Taxonomy Keywords: has-patch commit
Focuses: Cc:

Description (last modified by wonderboymusic)

get_term() is the simplest way to retrieve one term, but it requires term_id and taxonomy. Because of this, terms are cached with term_id as key and $taxonomy as bucket. As a result, you can't easily grab a term by slug, unless you use get_term_by( 'slug' ). get_term_by( 'slug' ) and get_term_by( 'name' ) don't even have a query cache, so they go to the database every time. Because you can't get a term by slug without hitting the db, every place you want to get a term by slug: you first have to transform it into a term_id where it will then be cached. This is inefficient because the user may query by slug constantly and never by term_id.

Attachments (15)

get-term-mixed.diff (11.5 KB) - added by wonderboymusic 23 months ago.
tests-posts-tax-cache-group.diff (1.3 KB) - added by wonderboymusic 22 months ago.
term-cache.diff (13.0 KB) - added by wonderboymusic 22 months ago.
term-cache-group.diff (549 bytes) - added by wonderboymusic 18 months ago.
21760.diff (10.8 KB) - added by wonderboymusic 18 months ago.
21760.2.diff (9.8 KB) - added by ryan 16 months ago.
Refresh patch, don't deprecate list_terms_exclusions, back compat cache keys
21760.patch (7.7 KB) - added by johnjamesjacoby 14 months ago.
21760.3.diff (2.6 KB) - added by wonderboymusic 6 days ago.
21760.4.diff (3.0 KB) - added by wonderboymusic 6 days ago.
21760.5.diff (5.2 KB) - added by wonderboymusic 6 days ago.
21760.6.diff (5.0 KB) - added by wonderboymusic 6 days ago.
21760.7.diff (6.8 KB) - added by wonderboymusic 6 days ago.
21760.8.diff (6.9 KB) - added by tollmanz 6 days ago.
s/$bucket/$group and hash name keys
21760.9.diff (7.0 KB) - added by tollmanz 6 days ago.
Add notion of forcing the incrementor update
21760.10.diff (9.2 KB) - added by tollmanz 5 days ago.

Download all attachments as: .zip

Change History (64)

comment:1 nacin23 months ago

Cache keys are cast to strings. That means there would be a collision of the term_id of 1, the slug of '1', and the name of '1'. Even without the casting woes, slug and name would conflict.

For users, we handle this with separate cache buckets: users (by ID), userlogins, useremail, and userslugs.

This can just as easily be handled in the same bucket with prefixes - IDs are stored as is, login-$login, email-$email, slug-$slug. Or for terms, you'd have IDs, slug-$slug, name-$name.

update_term_cache() should set slug and name cache keys, in addition to ID, and that should then be leveraged by get_term_by() and friends.

comment:2 nacin23 months ago

See also #11531.

We should probably rename these buckets. Rather than $taxonomy and {$taxonomy}_relationships, we should use something that isn't valid in taxonomy/CPT name. taxonomy:$taxonomy and relationships:$taxonomy seem sufficient. These are the only buckets I know of that have a dynamic component to them.

comment:3 wonderboymusic23 months ago

Per chat with Nacin - gonna work on separate cache buckets for each type and will resurrect get_term_by() which can leverage them. Also, get_term() would have a hard time distinguishing between id, name, and slug for the category named, slugged, term_id'd "1" etc

comment:4 scribu23 months ago

  • Keywords needs-patch added; has-patch removed

comment:5 scribu23 months ago

  • Summary changed from Allow get_term() to accept string as first argument to get_term_by() calls are not cached

comment:6 dd3223 months ago

FWIW, by overloading get_term( $id ... to accept int|string|object, it brings up conditions where

  • Slug is numeric
  • a term_id is passed in, but return term_ids as strings in some locations, so you can't rely upon type being an int for term_ids

comment:7 wonderboymusic23 months ago

Attached a new patch:

1) get_term_by() calls are cached
2) term_cache_group() abstracts cache bucket for terms
3) update_term_cache() adds to cache of all 3 buckets
4) delete_term_cache() deletes cache of all 3 buckets
5) get_terms() calls update_term_cache(), deprecates the buggy list_terms_excusions in favor of get_terms filter
6) Lots of cleanup and streamlining of clean_term_cache() calls

comment:8 scribu23 months ago

  • Keywords has-patch needs-unit-tests added; needs-patch removed

This looks like an awesome patch, but it's tiring to read, due to all the whitespace fixes. I'll try again in the morning.

In related news, the test suite doesn't seem to have a single call to get_term_by( 'slug', ... );

comment:9 scribu23 months ago

Also, the patch seems to fail an existing test:

There was 1 failure:

1) Tests_Term::test_set_object_terms_by_name
Failed asserting that '1' matches expected 5.

/home/scribu/svn/wp-tests/tests/term.php:135

Finally, we get some use out of these darn things. :)

comment:10 wonderboymusic22 months ago

Updated the patch - the problem was a $wpdb->prepare thing in the IN query in clean_term_cache(). All tests pass now.

WITH ONE CAVEAT:
tests-posts-tax-cache-group.diff needs to be applied to Unit Tests so that the Post tests read from the proper term cache group

comment:11 wonderboymusic22 months ago

  • Description modified (diff)

comment:12 scribu22 months ago

Right, except term_cache_group() is introduced in your main patch. :)

In this case, just add additional test cases to tests-posts-tax-cache-group.diff

comment:13 scribu22 months ago

Also, I'm not sure term_cache_group() should be a public function; maybe rename it to _term_cache_group() and mark it as @private.

comment:14 ryan22 months ago

Instead of caching multiple copies of an object, we usually cache IDs that can be used to look up a single cached copy. slug -> ID, ID -> object.

I'm not really digging term_cache_group(). I consider cache keys and groups public API that shouldn't change.

comment:15 nacin22 months ago

Some thoughts:

  • The code in delete_term_cache() seems like it belongs in clean_term_cache(). (This code is similarly in clean_user_cache().)
  • Not *entirely* sure that term_cache_group() is necessary. It only gets called in get_term(), get_term_by(), get_terms(), and a few relationship functions. Doing "taxonomy:$taxonomy:slugs" etc. seems pretty safe, and not unreasonably more complicated (or longer) than _term_cache_group(). Though... we might need to special-case the core taxonomies to avoid renaming those buckets, as ryan pointed out.
  • With user login, slug, and email buckets, we don't cache the user object, but rather, just the ID. That way, we can then consult the ID-based bucket for the object. We should do the same here. See WP_User::get_data_by() for how that works.
  • Removing list_terms_exclusions should probably come in the separate ticket, #21267.
  • In get_cat_ID(), you changed if ( $cat ) to an is_object() check. Is that necessary? Will get_term_by() ever turn a value that evaluates to true that is *not* an object?
  • I had to do an svn diff -x --ignore-all-space to be able to review the patch effectively. Whitespace changes, especially in functions not quite touched by the patch, make things a bit more difficult to review.

comment:16 wonderboymusic22 months ago

Updated patch to only cache IDs for slug and name - sorry about the whitespace, some of the term functions are esoteric and hard to read without some daylight. I can revert those changes in a subsequent .diff

My reasoning for term_cache_group was 1) standardizing group generation 2) flexibility in making changes in the future, only has to be done in one place. If this is deemed unnecessary, is it to retain the old groups?

comment:17 wonderboymusic22 months ago

Another piece of this - get_terms calls update_term_cache if you got all the fields, regardless of checking the hide_empty arg.

$terms = $wpdb->get_results( $query );
if ( 'all' == $fields ) {
	update_term_cache( $terms );
}

I did a performance test where I created 20,000 tags and called get_terms( 'post_tag', array( 'hide_empty' => false ). This will create 60,000 cache entries which exhausted memory until I bumped up to 256M. I tested posts with 1000 tags and they seemed fine, but this one may need to exclude terms with count of 0 before calling update_term_cache with every returned term

comment:18 wonderboymusic22 months ago

updated the patch:

1) removed all of my whitespace fixes (that was a gift from heaven, let's just say I learned my lesson)
2) restored all return-by-reference things
3) moved delete_term_cache code to clean_term_cache

The patch is WAY more light now - next up is removing term_cache_group as per Boren's feedback

comment:19 nacin22 months ago

Why were the return-by-references added back? I wonder if we still need them anymore.

We might need to keep term_cache_group() as we'll need to keep the standard category, post_tag, and link_category buckets, but we should be able to get away with prefixing the buckets of any custom taxonomy.

wonderboymusic22 months ago

comment:20 wonderboymusic22 months ago

return-by-references need to die everywhere, they are PHP4 fluff. I just added them back because I was trying to tighten to scope of my changes. The only scenario where they make sense to use is $a =& $b, not $a =& my_function();

Patch has been updated with ret-by-refs removed.

"we'll need to keep the standard category, post_tag, and link_category buckets" - I took a stab at this, does it accomplish what you want?

comment:21 bpetty21 months ago

  • Keywords needs-refresh added

I don't feel very comfortable with refreshing this patch myself with the latest changes, but I'll be happy to jump back in and do some testing if someone else can jump on this quickly for 3.5.

comment:22 scribu21 months ago

  • Milestone changed from Awaiting Review to Future Release

Considering it's an enhancement, the patch isn't trivial, it doesn't have unit tests yet and that we're 5 weeks away from release, it's not for 3.5.

comment:23 aaroncampbell18 months ago

  • Cc aaroncampbell added

wonderboymusic18 months ago

comment:24 wonderboymusic18 months ago

  • Keywords needs-refresh removed
  • Milestone changed from Future Release to 3.6

[21760.diff] Patch refreshed for trunk, passes all Unit Tests.

[term-cache-group.diff] Patch for Unit Tests so term_cache_group() works. i recall Ryan hating term_cache_group() - I am very open to suggestions on it, but I like it.

Version 1, edited 18 months ago by wonderboymusic (previous) (next) (diff)

comment:25 husobj16 months ago

  • Cc ben@… added

comment:26 follow-up: ryan16 months ago

I don't mind the function. I mind changing cache locations. There is too much plugin code that directly accesses the current cache locations for both builtin and custom taxonomies.

list_terms_exclusions shouldn't be deprecated. I see too many plugins using it.

ryan16 months ago

Refresh patch, don't deprecate list_terms_exclusions, back compat cache keys

comment:27 in reply to: ↑ 26 nacin15 months ago

Replying to ryan:

I don't mind the function. I mind changing cache locations. There is too much plugin code that directly accesses the current cache locations for both builtin and custom taxonomies.

At that point, the function isn't needed. But, would be nice to namespace these as much as possible (and thus keep the function)? The only existing buckets are $taxonomy and {$taxonomy}_relationships. Maybe new buckets can all be taxonomy:$taxonomy:$group. I'd personally love to escape the generic $taxonomy bucket, at least for custom taxonomies, as the problem there is gonna continue to exist — #11531 — but I'll take what I can get.

comment:28 nacin15 months ago

  • Milestone changed from 3.6 to Future Release

This is cool but let's review it early 3.7.

comment:29 johnjamesjacoby14 months ago

Late to this party, but already wrote up a patch based on a function we're using for WordPress.com VIP's.

Patch also cleans up get_term()

johnjamesjacoby14 months ago

comment:30 ethitter14 months ago

  • Cc erick@… added

comment:31 wonderboymusic11 months ago

  • Owner set to johnjamesjacoby
  • Status changed from new to assigned

Assigning to JJJ to investigate his patch

comment:32 Chouby10 months ago

  • Cc frederic.demarle@… added

comment:33 nacin9 months ago

Can someone step up to opine on this? How are we doing for unit tests in this area? Would be good to make sure there are some tests to assert there is no cache pollution.

comment:34 nacin9 months ago

#11531 was marked as a duplicate.

comment:35 wonderboymusic3 months ago

  • Keywords needs-refresh added
  • Milestone changed from Future Release to 4.0
  • Owner changed from johnjamesjacoby to wonderboymusic

this should definitely be tackled as part of our Taxonomy work in 4.0 - I will take another stab at this

comment:36 helen6 days ago

wonderboymusic - speak/unit test now or punt from 4.0, leaving it to you.

comment:37 wonderboymusic6 days ago

I will either look at this with gusto in the next few days or punt with gusto

wonderboymusic6 days ago

comment:38 wonderboymusic6 days ago

  • Keywords needs-refresh removed

I'm gonna do some profiling/testing, but 21760.3.diff should do it for get_term_by( 'slug', ... ).

wonderboymusic6 days ago

comment:39 wonderboymusic6 days ago

21760.4.diff is better.

Unit test query counts

Install (before patch): 403
Install (after patch): 384

Test suite (before patch): 208474
Test suite (after patch): 208257

wonderboymusic6 days ago

comment:40 wonderboymusic6 days ago

  • Keywords needs-unit-tests removed

21760.5.diff adds unit tests

wonderboymusic6 days ago

comment:41 wonderboymusic6 days ago

21760.6.diff is a simpler approach and passively invalidates slugs via an increment on the bucket. In my previous attempts, clean_term_cache() ended up producing 1000-2000 more database queries when running the entire test suite. No bueno.

wonderboymusic6 days ago

comment:42 wonderboymusic6 days ago

After adding entries for slug AND name in 21760.7.diff:

Test suite (after patch): 208,212 (-262 database calls)

The homepage of my test install makes 6 less queries.

Added unit tests for get_term_by( 'name', .... )

comment:43 wonderboymusic6 days ago

  • Keywords commit added

I like this, but I will throw some cold water on my face and wait

comment:44 tollmanz6 days ago

Nice work! I love the idea of wp_get_last_changed(). An incrementor API like this would be most excellent in core.

A couple thoughts on the patch:

  • Line 980 of src/wp-includes/taxonomy.php should use $term->term_id as the key, not the $term object.
  • $bucket is not a term used in core (other than fleetingly in a few comments). I think for clarity and consistency, we should stick with the $group nomenclature.
  • The keys for the term names will prove problematic. Whitespace is not allowed in Memcached keys. Additionally, these keys can only be a grand total of 250 characters in Memcached (and I'm assuming there are other limits in other object caches). These should be hashed or sanitized in someway before being used as keys.
  • Playing devil's advocate, if a term is not found in get_term_by(), should that result still be cached to avoid the redundant look up in the future? I think that would be nice, but complicates checking for the "non-existent" values, as well as the invalidation of that cache.
  • I don't think update_term_cache() will work. 1) Using wp_cache_add() will fail to save new data as it is already set. A quick and dirty fix for this is to use wp_cache_set(). It'll guarantee new data; however, there is a slight performance penalty. 2) Since we are using incrementors, we should...well...use them. We should generate a new incrementor and save the new values with the new incrementor. This causes a cache miss on the old incrementor and saves the new one. This could be accomplished by adding a $force parameter to wp_get_last_changed(). If set to true, it would not pull the incrementor from cache and instead regenerate it and save over the old one. Alternatively, it could just erase the old one.

Finally, as excited as I am about the incrementor idea, I do have some concerns. Incrementors work great with caches that use an LRU eviction policy (e.g., Memcached). You can fairly carelessly add incrementors to cause cache misses/invalidations. This works well because the old incrementors are no longer accessed and eventually fall off the slab. With some caching systems, the eviction policy is a bit problematic for an incrementor system (e.g., APC fills the cache until full, them dumps everything). We need to be really careful about how we do this to make sure it will consistently work across object caches. Ideally, when we change the incrementor, we delete the old one or overwrite it.

I think we need to be very careful as we implement this. An incrementor system can be extremely powerful, yet also really dangerous. Perhaps we should start a separate ticket to explore this? If we were to implement this in core, it would be great to have a more robust system for handling this with methods for setting, getting, and invalidating incrementors.

Last edited 6 days ago by tollmanz (previous) (diff)

comment:45 tollmanz6 days ago

Scratch my comment about using $term->term_id instead of $term. I see that $term is changed from the object into the term ID earlier in the code.

tollmanz6 days ago

s/$bucket/$group and hash name keys

tollmanz6 days ago

Add notion of forcing the incrementor update

comment:46 tollmanz6 days ago

21760.8.diff and 21760.9.diff change the $bucket nomenclature to $group, hash the term name key, and add a way to invalidate the incrementor. The first two changes are in .8 and all three are in .9.

comment:47 rmccue5 days ago

Psst, while we're in the term caching arena, #28743 (especially if we want to hit clean_term_cache anyway).

tollmanz5 days ago

comment:48 tollmanz5 days ago

21760.10.diff iterates on this patch by adding /updating the following:

  • I found a slight bug in the get_term_by() adds the $term object to the cache. The values were cached before the $term object was run through the sanitization function. The sanitization function converts ints to strings. Therefore, a term grabbed from the cache would be different than a term grabbed directly from MySQL since the term grabbed from cache would never be sanitized. I updated two of the unit tests to specifically test for this.
  • I added suspended cached invalidation awareness to address #28743.
  • A unit test was added to test #28743.

Thoughts/questions:

  • I'm starting to think that we shouldn't add the if ( $cache ) conditional. If the value is cached, the filters and sanitization functions are completely missed. I can see an argument that the terms will have already been run through the filters and sanitization routines; however, this would cause problems if a user, using a persistent cache, installs a plugin that should further manipulate these terms. The user would need to wait until the next cache invalidation for it to take effect. These things cause devs to go batty.
  • We should probably add suspended cached invalidation awareness to update_term_cache(), right? I looked at how this is handled in update_post_cache() and it is not aware of suspended cached invalidation; however, that is because the function is not an invalidation function. It will add items to cache, but does not overwrite it or delete cached items. It will never do this. update_term_cache() operates differently. It nukes the cached items and forces an update. As such, it should respect suspended cached invalidation, right?

comment:49 wonderboymusic2 days ago

tollmanz: good stuff, waiting for feedback from @nacin

Note: See TracTickets for help on using tickets.