WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 6 months ago

#24837 new enhancement

querying optimization for category AND searching

Reported by: robertv123 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.6
Component: Taxonomy Keywords: has-patch needs-testing
Focuses: performance Cc:

Description (last modified by johnbillion)

querying category with an AND for two term_id's plus having a 50,000 + posts site causes the query produced by get_sql in wp-includes/taxonomy.php to take excessive amount of time.

On my site the current query in trunk takes 200 + seconds and the optimizes query takes 100 to 200 milliseconds.

Sub selects in mysql are not optimized on the server very well.

Example code to utilize query

query_posts(array('category__and' => array($term0_id,$term1_id)));

affected code starts on line 724 of wp-includes/taxonomy.php

Potential problems.

  1. I simply added the temp table to the $join variable not knowing if this will cause problems on other queries
  2. Clean up of the temporary tables may be needed after query is executed with a persistent connection but in my case they can be left there as mysql will garbage clean them after script disconnects.

Attached is a patch to wp-includes/taxonomy.php

Attachments (1)

taxonomy_optimization_patch_for_and_operations.patch (1.6 KB) - added by robertv123 2 years ago.
svn diff for modification of taxonomy.php

Download all attachments as: .zip

Change History (17)

comment:1 @johnbillion2 years ago

  • Description modified (diff)

comment:2 @SergeyBiryukov2 years ago

Please provide an svn diff if possible, rather than a standard diff.

@robertv1232 years ago

svn diff for modification of taxonomy.php

comment:3 @robertv1232 years ago

svn diff uploaded to replace standard diff

comment:4 @wonderboymusic2 years ago

Can you paste the before / after queries in the comments here?

comment:5 @robertv1232 years ago

before

SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts  WHERE 1=1  AND ( (
                                        SELECT COUNT(1)
                                        FROM wp_term_relationships
                                        WHERE term_taxonomy_id IN (5,453)
                                        AND object_id = wp_posts.ID
                                ) = 2 ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20

after I split it up into many queries that mysql can handle easier. My guess is mysql does not cache a sub queries result. This creates a temporary table (mysql could do this automatically but that's another story, sub queries are well known to be poor in mysql)

//create the temporary tables
create temporary table $table ( object_id bigint(20) unsigned, term_taxonomy_id bigint(20) unsigned) 
create temporary table $table1 ( object_id bigint(20) unsigned, c int )

// collect all the objects + terms together into temp table
insert into $table select  object_id , term_taxonomy_id from  wp_term_relationships where term_taxonomy_id IN ($terms)
// insert the counts of how many objects that have all $terms
insert into $table1 select object_id,count(*) from $table group by object_id
// drop the first temp table that was use to collect the data
DROP TABLE $table

// add the second table to the where query
$join .= " , $table1 AS tt "; 
$where = " tt.c = $num_terms AND tt.object_id = wp_posts.ID "
}}
This would make the finial query (I did this manually so hopefully I got it right)
{{
SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts , $table1 AS tt   WHERE 1=1  AND (  tt.c = $num_terms AND tt.object_id = wp_posts.ID  ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20

The concerns I have is that the second temp table is currently not if persistent connections are used. This would cause problems over time if the same connect was used many times either due to high volume or the connection being open for a long term.

So a hook is needed to drop the second temporary table after the sql query is completed

OR

a better query that does not require temporary tables

Version 0, edited 2 years ago by robertv123 (next)

comment:6 @nacin2 years ago

It may require a greater rewrite, but I imagine there is also an alternative solution for this that does not require temporary tables?

comment:7 @robertv1232 years ago

There may be other ways of doing this that gets mysql to handle this internally but I have not as yet found it yet. If someone knows a sql super star I'd like to hear alternatives, my mysql is quite basic.

Portability comes into question as wordpress uses a wide range of quries, this pattern optimizes sub queries works across the board for earlier 4.x mysql as well.

These types of quires have been on the mysql todo list for quite some time now, a decade or more. Latest promise is mysql 6.0, there is some that has been completed for 5.6 but it's far from being done.

Word press will likely still be run on plenty of 5.x mysql installs for quite some time so any sub queries in the code are land mines for users who happen to create a larger data set. In my case it was 120000 records in wp_term_relationships , 60000 posts.

For low end virtual machines or shared hosting the threshold would be lower.

Here is more info on sub queries
http://www.mysqlperformanceblog.com/2010/10/25/mysql-limitations-part-3-subqueries/

comment:8 @wonderboymusic2 years ago

When ordering by post_date - probably the most effective optimization is forcing an index: FROM wp_posts USE INDEX (type_status_date). This will force the join type to be range instead of index. On a query with GROUP BY and ORDER BY, the index forcing, for me, has made queries faster un-cached than the cached queries without it.

Subqueries are sometimes rewritten internally by MySQL to perform worse than JOINs.

comment:9 @robertv1232 years ago

I took a look and the query was all ready using index type_status_date. My feeling is that the query is simply too complex for mysql. Two group by's and a sub query. Mysql is just trying to sort out simple sub queries in the current unstable db.

+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+
| id | select_type        | table                 | type  | possible_keys            | key              | key_len | ref                      | rows  | Extra                    |
+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+
|  1 | PRIMARY            | wp_posts              | index | type_status_date         | type_status_date | 140     | NULL                     | 47456 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | wp_term_relationships | ref   | PRIMARY,term_taxonomy_id | PRIMARY          | 8       | test.wp_posts.ID |  1213 | Using where; Using index | 
+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+

comment:10 @wonderboymusic2 years ago

no, you want to force the index so that type becomes range instead of index. http://dev.mysql.com/doc/refman/5.1/en/explain-output.html#jointype_range

comment:11 @robertv1232 years ago

I have not figure out away to use a range index I used FROM wp_posts USE INDEX (type_status_date) in the query in this manner to no luck. The documentation seams to imply that the range index only works with constants.

 SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts use index ( type_status_date) WHERE 1=1  AND ( (                                         SELECT COUNT(1)                                         FROM wp_term_relationships                                         WHERE term_taxonomy_id IN (5,453)                                         AND object_id = wp_posts.ID                                 ) = 2 ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20;

And explain indicates type index.

The tests where done on mysql 5.5.31

comment:12 @jaybna19 months ago

This may not be the exact same query, but for something we were doing recently we had to create a new compound index that includes ID and post_type (http://screencast.com/t/dGDNzoSQceQk)

Then used FROM wp_posts USE INDEX (post_type_id) to force the new index to be used. It eliminated the temp table and improved query performances from 6 seconds to 0.32 seconds. Using mysql 5.5.30

comment:13 @archon8107 months ago

I just ran into a very slow query (13-25s) just dealing with the simple category page. Upon further analysis, I found that mysql isn't using the type_status_date index and PRIMARY (when it does, the query runs a lot faster and says "using index" in both the query and subquery), but rather PRIMARY and term_taxonomy_id, which is a LOT slower.

At the very least, FORCE or USE INDEX should be added here to force optimization and make it run faster.

I'm playing with adding another index right now.

The query in question btw (FORCE INDEX should be added after FROM wp_posts:

SELECT SQL_CALC_FOUND_ROWS
	wp_posts.ID
FROM
	wp_posts
INNER JOIN wp_term_relationships ON (
	wp_posts.ID = wp_term_relationships.object_id
)
WHERE
	(
		wp_term_relationships.term_taxonomy_id IN (3, 5201, 9699, 11025, 20317)
	)
AND wp_posts.post_type = 'post'
AND (
	wp_posts.post_status = 'publish'
	OR wp_posts.post_status = 'private'
)
GROUP BY
	wp_posts.ID
ORDER BY
	wp_posts.post_date DESC
LIMIT 0,
 10;

comment:14 @archon8106 months ago

WP Query is extremely complicated. What's the easiest way to insert such a simple USE INDEX directive into the SQL call? There are many filters - what do the pros recommend? @SergeyBiryukov?

comment:15 @archon8106 months ago

I was wrong - USE INDEX(type_status_date) here doesn't actually resolve the issue, and the query still runs for a very long time.

I've been trying various index approaches but so far I haven't been able to get this thing to run fast.

comment:16 @archon8106 months ago

  • Focuses performance added
Note: See TracTickets for help on using tickets.