Opened 14 years ago
Last modified 5 years ago
#16706 assigned enhancement
Queries using "category__and" are slow on large databases
Reported by: | tigertech | Owned by: | chriscct7 |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.1 |
Component: | Query | Keywords: | has-patch needs-testing needs-refresh bulk-reopened |
Focuses: | performance | Cc: |
Description
Summary: Using "category__and" in query_posts() generates a dependent subquery in MySQL with extremely poor performance ("Using temporary, Using filesort") when one of the categories has a large number of posts. The result is so slow that WordPress appears to completely hang. Changing the query structure avoids the filesort and solves it.
Details:
We have a customer doing this as part of a theme in WordPress 3.1:
query_posts(array( "category__and" => array(1, 461), "posts_per_page" => 6 ));
The database is fairly large. There are 45,610 posts in category 1, and 167 posts in category 461. The resulting database query runs for so long that it effectively hangs (it doesn't finish within 30 minutes).
The generated MySQL query looks like:
SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts WHERE 1=1 AND ( wp_posts.ID IN ( SELECT object_id FROM wp_term_relationships WHERE term_taxonomy_id IN (1,461) GROUP BY object_id HAVING COUNT(object_id) = 2 ) ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 6;
An "explain" on this query shows that it generates a dependent subquery that devolves to a filesort:
+----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+ | 1 | PRIMARY | wp_posts | ref | type_status_date | type_status_date | 44 | const,const | 8177 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | wp_term_relationships | range | term_taxonomy_id | term_taxonomy_id | 8 | NULL | 25665 | Using where; Using index; Using temporary; Using filesort | +----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+
I've tried adding indexes, optimizing, and so on to get the filesort to go away, but I can't get it to do so, at least not with MySQL 5.0.x.
The filesort comes from the "GROUP BY ... HAVING". Changing the query to avoid those fixes this. For example, this appears to produce the same results:
SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts WHERE 1=1 AND ( SELECT COUNT(1) FROM wp_term_relationships WHERE term_taxonomy_id IN (1,461) AND object_id = wp_posts.ID ) = 2 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 6;
But it finishes in a fraction of a second and generates cleaner "explain" output with no filesort:
+----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+ | 1 | PRIMARY | wp_posts | ref | type_status_date | type_status_date | 44 | const,const | 8177 | Using where | | 2 | DEPENDENT SUBQUERY | wp_term_relationships | ref | PRIMARY,term_taxonomy_id | PRIMARY | 8 | database.wp_posts.ID | 1 | Using where; Using index | +----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+
So my suggestion is that the "category__and" query be changed to something like this that avoids filesorts. (I could provide a patch if people agree that this change is a reasonable approach.)
Attachments (4)
Change History (61)
#2
follow-up:
↓ 4
@
14 years ago
- Keywords close added
Sure, they seem to return the same posts, but they don't. That modified query is equivalent to using 'category__in'
.
#3
@
14 years ago
- Owner set to scribu
- Status changed from new to reviewing
A correct solution would involve doing an extra join for each category in the array.
#4
in reply to:
↑ 2
@
14 years ago
Replying to scribu:
Sure, they seem to return the same posts, but they don't. That modified query is equivalent to using
'category__in'
.
Hmmmm -- what makes you say that? I've tried it with some simpler queries than the example (which I unfortunately can't compare directly because the original WordPress version never finishes), and it seems to be correct.
In either version, the 'category__and'
nature of it comes from loading all the rows of wp_term_relationships that match both the object_id and term_taxonomy_id, then counting them and making sure that the number of found rows is the same as the number of categories passed into the query (2 in this example).
Even if my suggested change is wrong for a 'category__and'
fix, I'm pretty sure it's not equivalent to 'category__in'
. Using 'category__in'
on our customer's data with "461" as one of the categories always results in at least 167 posts if I remove the LIMIT. But using my replacement query with "461" as one of the categories always results in fewer.
#5
follow-up:
↓ 11
@
14 years ago
Here's an experiment to try on a clean install:
Make two categories A and B.
Make Post 1 and assign it both category A and category B.
Make Post 2 and assign it to just category A.
Do the category__and => (A, B)
query and compare.
The query without GROUP By will return Post 1 as well as Post 2, which is incorrect.
#6
@
14 years ago
- Component changed from Database to Performance
- Keywords needs-patch added; close removed
- Milestone changed from Awaiting Review to Future Release
- Type changed from defect (bug) to enhancement
By the way, you can get the more efficient query I was talking about by using the more verbose syntax:
query_posts(array( 'tax_query' => array( array( 'taxonomy' => 'category', 'terms' => 1 ), array( 'taxonomy' => 'category', 'terms' => 461 ), ) ));
SELECT SQL_CALC_FOUND_ROWS wp_posts . * FROM wp_posts INNER JOIN wp_term_relationships ON ( wp_posts.ID = wp_term_relationships.object_id ) INNER JOIN wp_term_relationships AS tt1 ON ( wp_posts.ID = tt1.object_id ) WHERE 1 =1 AND ( wp_term_relationships.term_taxonomy_id IN ( 38 ) AND tt1.term_taxonomy_id IN ( 37 )
SIMPLE | wp_term_relationships | ref | PRIMARY,term_taxonomy_id | term_taxonomy_id | 8 | const | 1 | Using temporary; Using filesort |
SIMPLE | tt1 | eq_ref | PRIMARY,term_taxonomy_id | PRIMARY | 16 | wp-trunk.wp_term_relationships.object_id,const | 1 | Using index |
SIMPLE | wp_posts | eq_ref | PRIMARY,type_status_date | PRIMARY | 8 | wp-trunk.wp_term_relationships.object_id | 1 | Using where |
So, it's just a matter of making the 'category__and'
rewrite internally to that more efficient way.
#8
@
14 years ago
16706.diff performs the transformation only if there are less than 10 terms involved per query, but I haven't done any testing to see at which point the number of JOINs causes worse performance than the GROUP BY method.
#9
@
14 years ago
Also, it only works if the relation between queries is also 'AND', because if it's 'OR', you have to group the IN queries that come from an AND one.
#11
in reply to:
↑ 5
@
14 years ago
Replying to scribu:
Here's an experiment to try on a clean install:
Make two categories A and B.
Make Post 1 and assign it both category A and category B.
Make Post 2 and assign it to just category A.
Do the
category__and => (A, B)
query and compare.
The query without GROUP BY will return Post 1 as well as Post 2, which is incorrect.
I tried a clean install and tested this. My change does not return both posts, though, and I'm not sure why you'd think it would. The results are identical.
Did you try it? If so, I think you accidentally tried something different.
It won't include Post 2 because when the outer query calls the inner query like this:
AND ( SELECT COUNT(1) FROM wp_term_relationships WHERE term_taxonomy_id IN (<A>,<B>) AND object_id = <Post_2_ID> ) = 2
.... the "COUNT(1)" result won't equal 2. It will equal 1, because only one row in wp_term_relationships will match both WHERE/AND clauses. This subquery will therefore be false and the post won't be included.
I'm attaching a patch to make 100% sure we're talking about the same change....
#12
follow-up:
↓ 14
@
14 years ago
Indeed, we weren't talking about the same thing. Thanks for the clarification.
I presume that COUNT(1)
can be safely replaced with the more common COUNT(*)
The question now is which is faster overall: the extra JOIN or your method.
#13
follow-ups:
↓ 15
↓ 16
@
14 years ago
I ran some tests with SQL_NO_CACHE on (22600+ posts):
GROUP BY: 0.9389 sec
COUNT: 0.1706 sec
JOINs: 0.0013 sec
If you turn on profiling, you see why the first two methods are so slow:
Sending data 0.000031 executing 0.000003 Sending data 0.000007 executing 0.000002
over and over again.
So, anyway, we can use both patches:
16706.diff avoids 'AND' queries whenever possible.
16706-tigertech.diff optimizes the SQL for 'AND' queries when we have no choice.
#14
in reply to:
↑ 12
@
14 years ago
Replying to scribu:
I presume that
COUNT(1)
can be safely replaced with the more commonCOUNT(*)
Sure, that's fine -- whichever is the WordPress style. (The "COUNT(1)
" style is faster in some cases involving NULLs, so I use it out of habit, but in this case it doesn't matter.)
#15
in reply to:
↑ 13
@
14 years ago
Replying to scribu:
I ran some tests with SQL_NO_CACHE on (22600+ posts):
GROUP BY: 0.9389 sec
COUNT: 0.1706 sec
JOINs: 0.0013 sec
Right, that makes sense. The fundamental problem with the first two is that the DEPENDENT SUBQUERY runs the inner query once for each outer row. My proposed change isn't the fastest possible; it merely makes sure that the subquery prefers the index that returns only a handful of rows (keeping the inner query manageable), instead of using the index that potentially returns tens of thousands of rows requiring a filesort that completely kills everything.
The JOIN method avoids running a DEPENDENT SUBQUERY and is superior, but is unfortunately not possible in every case.
I agree that using JOIN where possible, and my suggestion when it's not, seems to be the best possible solution.
#16
in reply to:
↑ 13
@
14 years ago
Replying to scribu:
I ran some tests with SQL_NO_CACHE on (22600+ posts):
By the way, that's still small enough that MySQL doesn't resort to a filesort.
If you try the same thing when one of the category__and
categories is a category containing more than, say, 50000 posts, MySQL will switch to a filesort and the GROUP BY version of the test will suddenly start to take minutes/hours.
(I'm not suggesting you actually do it -- just emphasizing that the problem can get much, much worse than these results indicate.)
#17
@
14 years ago
16706.split.diff is just some DRYing, letting WP_Tax_Query::get_sql() do the transformation.
#22
@
13 years ago
With the patch a tag intersection (foo+bar) on a testbed with over 2000 posts took 49.7ms. Without the patch the intersection took 7,360.7ms. Results seem to be identical. Sweet patch.
#27
@
13 years ago
Given that this can decimate a shared MySQL server, and there are a ton of WP installs running on shared hosting around the world, shouldn't this be worthy of shipping in the next minor release, instead of waiting for 3.2?
#28
@
13 years ago
Minor releases are reserved for security issues and major bugs that slipped through in the previous release.
This is not a major bug, as the majority of WP instalations don't use category__and
.
#29
@
13 years ago
I'd advise adding an extra test-case for this one.
As I read the initial query plan with the group by/having clause, MySQL is merge joining two large sets, and the resulting list of posts gets sorted in the best possible way is available.
The suggested optimization, by contrast, results in a nested loop over the posts' index, with a subquery check for each row. This is faster with your test data, but not always. For instance, if you've a reasonably large posts table (say 100k rows) each sprinkled with a term or two, and the final result is no match at all or one very far away in the index, you may very well end up visiting the whole index on the posts table (which both plans do anyway), along with a subquery for each row (which is then considerably slower than merge joining the mess).
Point is, the merge join plan is appropriate if there are very few resulting rows.
Out of curiosity, has the following plan been tried at any point while testing or in prior implementations?
SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts JOIN wp_term_relationships ON wp_posts.ID = wp_term_relationships.object_id AND term_taxonomy_id IN (1,461) WHERE 1=1 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID HAVING COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) = 2 ORDER BY wp_posts.post_date DESC LIMIT 0, 6;
The COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) rather than COUNT(*) should ensure we don't get bogus rows when additional joins kick in. Intuitively, the MySQL planner should do a nested loop from whichever of posts or terms will yield the least results.
If the above introduces two additional sorts, maybe try this one too (so as to aggregate over the pre-sorted result set):
SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts JOIN wp_term_relationships ON wp_posts.ID = wp_term_relationships.object_id AND term_taxonomy_id IN (1,461) WHERE 1=1 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.post_date DESC, wp_posts.ID HAVING COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) = 2 ORDER BY wp_posts.post_date DESC LIMIT 0, 6;
#32
@
13 years ago
In #17488, it was suggested that we should do two separate queries.
However, we know that if you have a large amount of posts, it can lead to segfaults, by having to concatenate all the ids into a single string.
#33
@
13 years ago
FYI, #mysql suggested a "derived table" query which would look like
SELECT a.name, b.name, IFNULL(b.cnt, 0) AS cnt FROM tblname AS a LEFT JOIN (SELECT name, COUNT(*) AS cnt FROM tblname2 GROUP BY name) AS b ON a.name = b.name;
Obviously not adapted to wp, and I didn't look further into it. I'll do two queries for now.
(and +1 on major bug, this obviously renders a built-in feature useless in quite common circumstances)
#34
@
13 years ago
Please try 16706.2.diff. It should be considerably faster.
#37
@
13 years ago
What are we doing here? This looks scary for 3.2, and we're already done a lot here. 3.3? Ryan?
#41
@
13 years ago
I am having this exact same problem on a busy 150k posts WP 3.1 install. Very happy to test this and post results but could someone explain to me which patches I require please?
#42
@
13 years ago
@milez: you should try applying this patch:
http://core.trac.wordpress.org/attachment/ticket/16706/16706.2.diff
Tutorials can be found here:
#43
@
13 years ago
- Cc milez added
Thanks for the quick reply but I have already applied the first 3 patches after reading through this thread closer. I actually tried the combined patch 16706.2.diff first but got an error. Also tried it last but same error:
$ patch -p0 < taxonomy-16706.2.diff patching file wp-includes/taxonomy.php Hunk #1 FAILED at 636.
Did I apply the wrong patches?
As far as performance goes a bit early to tell but I noticed *most* of my SQL_CALC_FOUND_ROWS queries to magically improve but not all.
#44
@
13 years ago
I actually tried the combined patch 16706.2.diff first but got an error. Also tried it last but same error
It applies cleanly for me.
Make sure you cd
-ed into the right directory (the wp root directory) and make sure you have a clean checkout:
svn revert --recursive .
#45
@
13 years ago
Hmm thanks scribu, but on a clean version of 3.1 taxonomy.php line 629:
foreach ( $this->queries as $query ) {
But in your .diff you have it as line 639. Also see the original patch you posted 16706.diff also has that line as 629. Am I missing something?
Thanks!
#47
@
13 years ago
Aha thanks scribu! Well the version specified at the top of the thread is 3.1. Honestly updating to 3.2.1 is not an option at this point either as too many customizations to work through :/
Any chance you could help me fix the patch for 3.1? I've made a few attempts but been unable to fix the .diff file you provided. Or am I right to assume that the other 3 .diff files you provided are for 3.1: 16706.diff, 16706-tigertech.diff and 16706.split.diff ???
Thanks again!
#48
@
12 years ago
- Keywords needs-testing added; 3.3-early removed
- Type changed from defect (bug) to enhancement
So, what's left are:
- 16706.split.diff - let
WP_Tax_Query::get_sql()
do the transformation from AND to IN - 16706.2.diff - converts AND to extra JOINs
#52
@
12 years ago
- Cc daevid added
Having the same issue on a 60k+ post table. Is this patch about to be released in a coming WP release quite soon? Or should I apply the mentioned patches?
#59
in reply to:
↑ 58
@
6 years ago
Replying to talldanwp:
On #46459 the workaround applied in #25238 was mentioned. It appears that the workaround for
category__and
could be reverted as part of this ticket.
Note, the workaround is not in ticket 25238, it's in changeset 25238: https://core.trac.wordpress.org/changeset/25238
By the way, I showed this explanation to a co-worker here who said it wasn't clear why the second version is so much faster. So:
The reason it's faster is that it can operate almost entirely on the wp_term_relationships index. For each row (post) in the outer query, the inner query uses the PRIMARY index to load a very small number of rows based on the object_id. So it ends up with something like five rows, which it can then examine to see if they match "term_taxonomy_id IN (1,461)". (A single index on "(object_id, term_taxonomy_id)" would probably speed this up even more and eliminate the "Using where" on the subquery, operating entirely on the index.)
In contrast, the "GROUP BY" version attacks it the other way round: it first uses the term_taxonomy_id index to search for rows in wp_term_relationships that match "term_taxonomy_id IN (1,461)". Unfortunately, it ends up with tens of thousands (instead of a handful) in the worst case, which it shoves into a temporary table. It then sorts the unindexed temporary table to do the COUNT. That might be fine if MySQL 5.0.x only did that once, but it does it once for each outer query row, even though the results of the inner query are identical (constant) each time through.