Make WordPress Core

Opened 14 years ago

Last modified 5 years ago

#16706 assigned enhancement

Queries using "category__and" are slow on large databases

Reported by: tigertech's profile tigertech Owned by: chriscct7's profile chriscct7
Milestone: Priority: normal
Severity: normal Version: 3.1
Component: Query Keywords: has-patch needs-testing needs-refresh bulk-reopened
Focuses: performance Cc:

Description

Summary: Using "category__and" in query_posts() generates a dependent subquery in MySQL with extremely poor performance ("Using temporary, Using filesort") when one of the categories has a large number of posts. The result is so slow that WordPress appears to completely hang. Changing the query structure avoids the filesort and solves it.

Details:

We have a customer doing this as part of a theme in WordPress 3.1:

query_posts(array(
  "category__and" => array(1, 461),
  "posts_per_page" => 6
));

The database is fairly large. There are 45,610 posts in category 1, and 167 posts in category 461. The resulting database query runs for so long that it effectively hangs (it doesn't finish within 30 minutes).

The generated MySQL query looks like:

SELECT SQL_CALC_FOUND_ROWS  wp_posts.* FROM wp_posts  WHERE 1=1  AND ( wp_posts.ID IN (
  SELECT object_id
  FROM wp_term_relationships
  WHERE term_taxonomy_id IN (1,461)
  GROUP BY object_id HAVING COUNT(object_id) = 2
) ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 6;

An "explain" on this query shows that it generates a dependent subquery that devolves to a filesort:

+----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+
| id | select_type        | table                 | type  | possible_keys    | key              | key_len | ref         | rows  | Extra                                                     |
+----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+
|  1 | PRIMARY            | wp_posts              | ref   | type_status_date | type_status_date | 44      | const,const |  8177 | Using where; Using index                                  |
|  2 | DEPENDENT SUBQUERY | wp_term_relationships | range | term_taxonomy_id | term_taxonomy_id | 8       | NULL        | 25665 | Using where; Using index; Using temporary; Using filesort |
+----+--------------------+-----------------------+-------+------------------+------------------+---------+-------------+-------+-----------------------------------------------------------+

I've tried adding indexes, optimizing, and so on to get the filesort to go away, but I can't get it to do so, at least not with MySQL 5.0.x.

The filesort comes from the "GROUP BY ... HAVING". Changing the query to avoid those fixes this. For example, this appears to produce the same results:

SELECT SQL_CALC_FOUND_ROWS  wp_posts.* FROM wp_posts  WHERE 1=1  AND
(
  SELECT COUNT(1)
  FROM wp_term_relationships
  WHERE term_taxonomy_id IN (1,461)
  AND object_id = wp_posts.ID
) = 2
AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 6;

But it finishes in a fraction of a second and generates cleaner "explain" output with no filesort:

+----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+
| id | select_type        | table                 | type | possible_keys            | key              | key_len | ref                  | rows | Extra                    |
+----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+
|  1 | PRIMARY            | wp_posts              | ref  | type_status_date         | type_status_date | 44      | const,const          | 8177 | Using where              |
|  2 | DEPENDENT SUBQUERY | wp_term_relationships | ref  | PRIMARY,term_taxonomy_id | PRIMARY          | 8       | database.wp_posts.ID |    1 | Using where; Using index |
+----+--------------------+-----------------------+------+--------------------------+------------------+---------+----------------------+------+--------------------------+

So my suggestion is that the "category__and" query be changed to something like this that avoids filesorts. (I could provide a patch if people agree that this change is a reasonable approach.)

Attachments (4)

16706.diff (930 bytes) - added by scribu 14 years ago.
16706-tigertech.diff (559 bytes) - added by tigertech 14 years ago.
16706.split.diff (802 bytes) - added by scribu 14 years ago.
Use 'AND' operator when splitting query vars
16706.2.diff (934 bytes) - added by scribu 13 years ago.
refresh

Download all attachments as: .zip

Change History (61)

#1 @tigertech
14 years ago

By the way, I showed this explanation to a co-worker here who said it wasn't clear why the second version is so much faster. So:

The reason it's faster is that it can operate almost entirely on the wp_term_relationships index. For each row (post) in the outer query, the inner query uses the PRIMARY index to load a very small number of rows based on the object_id. So it ends up with something like five rows, which it can then examine to see if they match "term_taxonomy_id IN (1,461)". (A single index on "(object_id, term_taxonomy_id)" would probably speed this up even more and eliminate the "Using where" on the subquery, operating entirely on the index.)

In contrast, the "GROUP BY" version attacks it the other way round: it first uses the term_taxonomy_id index to search for rows in wp_term_relationships that match "term_taxonomy_id IN (1,461)". Unfortunately, it ends up with tens of thousands (instead of a handful) in the worst case, which it shoves into a temporary table. It then sorts the unindexed temporary table to do the COUNT. That might be fine if MySQL 5.0.x only did that once, but it does it once for each outer query row, even though the results of the inner query are identical (constant) each time through.

Last edited 14 years ago by tigertech (previous) (diff)

#2 follow-up: @scribu
14 years ago

  • Keywords close added

Sure, they seem to return the same posts, but they don't. That modified query is equivalent to using 'category__in'.

Last edited 14 years ago by scribu (previous) (diff)

#3 @scribu
14 years ago

  • Owner set to scribu
  • Status changed from new to reviewing

A correct solution would involve doing an extra join for each category in the array.

#4 in reply to: ↑ 2 @tigertech
14 years ago

Replying to scribu:

Sure, they seem to return the same posts, but they don't. That modified query is equivalent to using 'category__in'.

Hmmmm -- what makes you say that? I've tried it with some simpler queries than the example (which I unfortunately can't compare directly because the original WordPress version never finishes), and it seems to be correct.

In either version, the 'category__and' nature of it comes from loading all the rows of wp_term_relationships that match both the object_id and term_taxonomy_id, then counting them and making sure that the number of found rows is the same as the number of categories passed into the query (2 in this example).

Even if my suggested change is wrong for a 'category__and' fix, I'm pretty sure it's not equivalent to 'category__in'. Using 'category__in' on our customer's data with "461" as one of the categories always results in at least 167 posts if I remove the LIMIT. But using my replacement query with "461" as one of the categories always results in fewer.

#5 follow-up: @scribu
14 years ago

Here's an experiment to try on a clean install:

Make two categories A and B.

Make Post 1 and assign it both category A and category B.

Make Post 2 and assign it to just category A.

Do the category__and => (A, B) query and compare.

The query without GROUP BY will return Post 1 as well as Post 2, which is incorrect.

Last edited 14 years ago by scribu (previous) (diff)

#6 @scribu
14 years ago

  • Component changed from Database to Performance
  • Keywords needs-patch added; close removed
  • Milestone changed from Awaiting Review to Future Release
  • Type changed from defect (bug) to enhancement

By the way, you can get the more efficient query I was talking about by using the more verbose syntax:

query_posts(array(
  'tax_query' => array(
    array( 'taxonomy' => 'category', 'terms' => 1 ),
    array( 'taxonomy' => 'category', 'terms' => 461 ),
));
SELECT SQL_CALC_FOUND_ROWS wp_posts . *
FROM wp_posts
INNER JOIN wp_term_relationships ON ( wp_posts.ID = wp_term_relationships.object_id )
INNER JOIN wp_term_relationships AS tt1 ON ( wp_posts.ID = tt1.object_id )
WHERE 1 =1
AND (
wp_term_relationships.term_taxonomy_id
IN ( 38 )
AND tt1.term_taxonomy_id
IN ( 37 )
SIMPLE wp_term_relationships ref PRIMARY,term_taxonomy_id term_taxonomy_id 8 const 1 Using temporary; Using filesort
SIMPLE tt1 eq_ref PRIMARY,term_taxonomy_id PRIMARY 16 wp-trunk.wp_term_relationships.object_id,const 1 Using index
SIMPLE wp_posts eq_ref PRIMARY,type_status_date PRIMARY 8 wp-trunk.wp_term_relationships.object_id 1 Using where

So, it's just a matter of making the 'category__and' rewrite internally to that more efficient way.

Version 1, edited 14 years ago by scribu (previous) (next) (diff)

#7 @scribu
14 years ago

  • Keywords has-patch added; needs-patch removed

@scribu
14 years ago

#8 @scribu
14 years ago

16706.diff performs the transformation only if there are less than 10 terms involved per query, but I haven't done any testing to see at which point the number of JOINs causes worse performance than the GROUP BY method.

#9 @scribu
14 years ago

Also, it only works if the relation between queries is also 'AND', because if it's 'OR', you have to group the IN queries that come from an AND one.

#10 @mikeschinkel
14 years ago

  • Cc mikeschinkel@… added

#11 in reply to: ↑ 5 @tigertech
14 years ago

Replying to scribu:

Here's an experiment to try on a clean install:

Make two categories A and B.

Make Post 1 and assign it both category A and category B.

Make Post 2 and assign it to just category A.

Do the category__and => (A, B) query and compare.

The query without GROUP BY will return Post 1 as well as Post 2, which is incorrect.

I tried a clean install and tested this. My change does not return both posts, though, and I'm not sure why you'd think it would. The results are identical.

Did you try it? If so, I think you accidentally tried something different.

It won't include Post 2 because when the outer query calls the inner query like this:

  AND (
  SELECT COUNT(1)
  FROM wp_term_relationships
  WHERE term_taxonomy_id IN (<A>,<B>)
  AND object_id = <Post_2_ID>
  ) = 2

.... the "COUNT(1)" result won't equal 2. It will equal 1, because only one row in wp_term_relationships will match both WHERE/AND clauses. This subquery will therefore be false and the post won't be included.

I'm attaching a patch to make 100% sure we're talking about the same change....

#12 follow-up: @scribu
14 years ago

Indeed, we weren't talking about the same thing. Thanks for the clarification.

I presume that COUNT(1) can be safely replaced with the more common COUNT(*)

The question now is which is faster overall: the extra JOIN or your method.

#13 follow-ups: @scribu
14 years ago

I ran some tests with SQL_NO_CACHE on (22600+ posts):

GROUP BY: 0.9389 sec

COUNT: 0.1706 sec

JOINs: 0.0013 sec

If you turn on profiling, you see why the first two methods are so slow:

Sending data 	0.000031
executing 	0.000003
Sending data 	0.000007
executing 	0.000002

over and over again.

So, anyway, we can use both patches:

16706.diff avoids 'AND' queries whenever possible.

16706-tigertech.diff optimizes the SQL for 'AND' queries when we have no choice.

#14 in reply to: ↑ 12 @tigertech
14 years ago

Replying to scribu:

I presume that COUNT(1) can be safely replaced with the more common COUNT(*)

Sure, that's fine -- whichever is the WordPress style. (The "COUNT(1)" style is faster in some cases involving NULLs, so I use it out of habit, but in this case it doesn't matter.)

#15 in reply to: ↑ 13 @tigertech
14 years ago

Replying to scribu:

I ran some tests with SQL_NO_CACHE on (22600+ posts):

GROUP BY: 0.9389 sec

COUNT: 0.1706 sec

JOINs: 0.0013 sec

Right, that makes sense. The fundamental problem with the first two is that the DEPENDENT SUBQUERY runs the inner query once for each outer row. My proposed change isn't the fastest possible; it merely makes sure that the subquery prefers the index that returns only a handful of rows (keeping the inner query manageable), instead of using the index that potentially returns tens of thousands of rows requiring a filesort that completely kills everything.

The JOIN method avoids running a DEPENDENT SUBQUERY and is superior, but is unfortunately not possible in every case.

I agree that using JOIN where possible, and my suggestion when it's not, seems to be the best possible solution.

#16 in reply to: ↑ 13 @tigertech
14 years ago

Replying to scribu:

I ran some tests with SQL_NO_CACHE on (22600+ posts):

By the way, that's still small enough that MySQL doesn't resort to a filesort.

If you try the same thing when one of the category__and categories is a category containing more than, say, 50000 posts, MySQL will switch to a filesort and the GROUP BY version of the test will suddenly start to take minutes/hours.

(I'm not suggesting you actually do it -- just emphasizing that the problem can get much, much worse than these results indicate.)

#17 @scribu
14 years ago

16706.split.diff is just some DRYing, letting WP_Tax_Query::get_sql() do the transformation.

@scribu
14 years ago

Use 'AND' operator when splitting query vars

#18 @scribu
14 years ago

Related: #16826

#19 @scribu
13 years ago

  • Milestone changed from Future Release to 3.2

#20 @scribu
13 years ago

  • Status changed from reviewing to accepted

#21 @GautamGupta
13 years ago

  • Cc gautam.2011@… added

#22 @ryan
13 years ago

With the patch a tag intersection (foo+bar) on a testbed with over 2000 posts took 49.7ms. Without the patch the intersection took 7,360.7ms. Results seem to be identical. Sweet patch.

#23 @scribu
13 years ago

All 3 patches should go in. :)

#24 @matt
13 years ago

Very cool.

#25 @aaroncampbell
13 years ago

  • Cc aaroncampbell added

#26 @ryan
13 years ago

(In [17652]) More efficient term intersection query. Props tigertech. see #16706

#27 @sexytyranno
13 years ago

Given that this can decimate a shared MySQL server, and there are a ton of WP installs running on shared hosting around the world, shouldn't this be worthy of shipping in the next minor release, instead of waiting for 3.2?

#28 @scribu
13 years ago

Minor releases are reserved for security issues and major bugs that slipped through in the previous release.

This is not a major bug, as the majority of WP instalations don't use category__and.

#29 @Denis-de-Bernardy
13 years ago

I'd advise adding an extra test-case for this one.

As I read the initial query plan with the group by/having clause, MySQL is merge joining two large sets, and the resulting list of posts gets sorted in the best possible way is available.

The suggested optimization, by contrast, results in a nested loop over the posts' index, with a subquery check for each row. This is faster with your test data, but not always. For instance, if you've a reasonably large posts table (say 100k rows) each sprinkled with a term or two, and the final result is no match at all or one very far away in the index, you may very well end up visiting the whole index on the posts table (which both plans do anyway), along with a subquery for each row (which is then considerably slower than merge joining the mess).

Point is, the merge join plan is appropriate if there are very few resulting rows.

Out of curiosity, has the following plan been tried at any point while testing or in prior implementations?

SELECT SQL_CALC_FOUND_ROWS 
wp_posts.*
FROM wp_posts 
JOIN wp_term_relationships
ON wp_posts.ID = wp_term_relationships.object_id
AND term_taxonomy_id IN (1,461)
WHERE 1=1  
AND wp_posts.post_type = 'post'
AND (wp_posts.post_status = 'publish')
GROUP BY wp_posts.ID
HAVING COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) = 2
ORDER BY wp_posts.post_date DESC
LIMIT 0, 6;

The COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) rather than COUNT(*) should ensure we don't get bogus rows when additional joins kick in. Intuitively, the MySQL planner should do a nested loop from whichever of posts or terms will yield the least results.

If the above introduces two additional sorts, maybe try this one too (so as to aggregate over the pre-sorted result set):

SELECT SQL_CALC_FOUND_ROWS 
wp_posts.*
FROM wp_posts 
JOIN wp_term_relationships
ON wp_posts.ID = wp_term_relationships.object_id
AND term_taxonomy_id IN (1,461)
WHERE 1=1  
AND wp_posts.post_type = 'post'
AND (wp_posts.post_status = 'publish')
GROUP BY wp_posts.post_date DESC, wp_posts.ID
HAVING COUNT(DISTINCT wp_term_relationships.term_taxonomy_id) = 2
ORDER BY wp_posts.post_date DESC
LIMIT 0, 6;

#30 @jane
13 years ago

  • Type changed from enhancement to defect (bug)

#31 @scribu
13 years ago

Marked #17488 as duplicate.

#32 @scribu
13 years ago

In #17488, it was suggested that we should do two separate queries.

However, we know that if you have a large amount of posts, it can lead to segfaults, by having to concatenate all the ids into a single string.

#33 @nkuttler
13 years ago

FYI, #mysql suggested a "derived table" query which would look like

SELECT a.name, b.name, IFNULL(b.cnt, 0) AS cnt FROM tblname AS a LEFT JOIN (SELECT name, COUNT(*) AS cnt FROM tblname2 GROUP BY name) AS b ON a.name = b.name;

Obviously not adapted to wp, and I didn't look further into it. I'll do two queries for now.

(and +1 on major bug, this obviously renders a built-in feature useless in quite common circumstances)

@scribu
13 years ago

refresh

#34 @scribu
13 years ago

Please try 16706.2.diff. It should be considerably faster.

#35 @toscho
13 years ago

  • Cc info@… added

#36 @nkuttler
13 years ago

  • Cc wp-trac@… added

#37 @nacin
13 years ago

What are we doing here? This looks scary for 3.2, and we're already done a lot here. 3.3? Ryan?

#38 @ryan
13 years ago

  • Keywords 3.3-early added
  • Milestone changed from 3.2 to Future Release

#39 @kirasong
13 years ago

  • Cc mike.schroder@… added

#40 @batmoo
13 years ago

  • Cc batmoo@… added

#41 @milez
13 years ago

I am having this exact same problem on a busy 150k posts WP 3.1 install. Very happy to test this and post results but could someone explain to me which patches I require please?

#43 @milez
13 years ago

  • Cc milez added

Thanks for the quick reply but I have already applied the first 3 patches after reading through this thread closer. I actually tried the combined patch 16706.2.diff first but got an error. Also tried it last but same error:

$ patch -p0 < taxonomy-16706.2.diff patching file wp-includes/taxonomy.php Hunk #1 FAILED at 636.

Did I apply the wrong patches?

As far as performance goes a bit early to tell but I noticed *most* of my SQL_CALC_FOUND_ROWS queries to magically improve but not all.

#44 @scribu
13 years ago

I actually tried the combined patch 16706.2.diff first but got an error. Also tried it last but same error

It applies cleanly for me.

Make sure you cd-ed into the right directory (the wp root directory) and make sure you have a clean checkout:

svn revert --recursive .

#45 @milez
13 years ago

Hmm thanks scribu, but on a clean version of 3.1 taxonomy.php line 629:

foreach ( $this->queries as $query ) {

But in your .diff you have it as line 639. Also see the original patch you posted 16706.diff also has that line as 629. Am I missing something?

Thanks!

#46 @scribu
13 years ago

Oh, you're running WP 3.1? You should upgrade to 3.2.1 first.

#47 @milez
13 years ago

Aha thanks scribu! Well the version specified at the top of the thread is 3.1. Honestly updating to 3.2.1 is not an option at this point either as too many customizations to work through :/

Any chance you could help me fix the patch for 3.1? I've made a few attempts but been unable to fix the .diff file you provided. Or am I right to assume that the other 3 .diff files you provided are for 3.1: 16706.diff, 16706-tigertech.diff and 16706.split.diff ???

Thanks again!

#48 @scribu
12 years ago

  • Keywords needs-testing added; 3.3-early removed
  • Type changed from defect (bug) to enhancement

So, what's left are:

Last edited 12 years ago by scribu (previous) (diff)

#49 @scribu
12 years ago

  • Owner scribu deleted
  • Status changed from accepted to assigned

#50 @johnbillion
12 years ago

  • Cc johnbillion added

#51 @Viper007Bond
12 years ago

  • Cc viper007bond added

#52 @daevid
12 years ago

  • Cc daevid added

Having the same issue on a 60k+ post table. Is this patch about to be released in a coming WP release quite soon? Or should I apply the mentioned patches?

#53 @nacin
11 years ago

  • Component changed from Performance to Query
  • Focuses performance added

#54 @chriscct7
9 years ago

  • Keywords needs-refresh added
  • Owner set to chriscct7

#57 @talldanwp
6 years ago

#46459 was marked as a duplicate.

#58 follow-up: @talldanwp
6 years ago

  • Keywords bulk-reopened added

On #46459 the workaround applied in #25238 was mentioned. It appears that the workaround for category__and could be reverted as part of this ticket.

#59 in reply to: ↑ 58 @seanleavey
6 years ago

Replying to talldanwp:

On #46459 the workaround applied in #25238 was mentioned. It appears that the workaround for category__and could be reverted as part of this ticket.

Note, the workaround is not in ticket 25238, it's in changeset 25238: https://core.trac.wordpress.org/changeset/25238

Note: See TracTickets for help on using tickets.