WordPress.org

Make WordPress Core

Opened 12 months ago

Last modified 6 months ago

#24837 new enhancement

querying optimization for category AND searching

Reported by: robertv123 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.6
Component: Taxonomy Keywords: has-patch needs-testing
Focuses: Cc:

Description (last modified by johnbillion)

querying category with an AND for two term_id's plus having a 50,000 + posts site causes the query produced by get_sql in wp-includes/taxonomy.php to take excessive amount of time.

On my site the current query in trunk takes 200 + seconds and the optimizes query takes 100 to 200 milliseconds.

Sub selects in mysql are not optimized on the server very well.

Example code to utilize query

query_posts(array('category__and' => array($term0_id,$term1_id)));

affected code starts on line 724 of wp-includes/taxonomy.php

Potential problems.

  1. I simply added the temp table to the $join variable not knowing if this will cause problems on other queries
  2. Clean up of the temporary tables may be needed after query is executed with a persistent connection but in my case they can be left there as mysql will garbage clean them after script disconnects.

Attached is a patch to wp-includes/taxonomy.php

Attachments (1)

taxonomy_optimization_patch_for_and_operations.patch (1.6 KB) - added by robertv123 12 months ago.
svn diff for modification of taxonomy.php

Download all attachments as: .zip

Change History (13)

comment:1 johnbillion12 months ago

  • Description modified (diff)

comment:2 SergeyBiryukov12 months ago

Please provide an svn diff if possible, rather than a standard diff.

robertv12312 months ago

svn diff for modification of taxonomy.php

comment:3 robertv12312 months ago

svn diff uploaded to replace standard diff

comment:4 wonderboymusic11 months ago

Can you paste the before / after queries in the comments here?

comment:5 robertv12311 months ago

before

SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts  WHERE 1=1  AND ( (
                                        SELECT COUNT(1)
                                        FROM wp_term_relationships
                                        WHERE term_taxonomy_id IN (5,453)
                                        AND object_id = wp_posts.ID
                                ) = 2 ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20

after I split it up into many queries that mysql can handle easier. My guess is mysql does not cache a sub queries result. This creates a temporary table (mysql could do this automatically but that's another story, sub queries are well known to be poor in mysql)

//create the temporary tables
create temporary table $table ( object_id bigint(20) unsigned, term_taxonomy_id bigint(20) unsigned) 
create temporary table $table1 ( object_id bigint(20) unsigned, c int )

// collect all the objects + terms together into temp table
insert into $table select  object_id , term_taxonomy_id from  wp_term_relationships where term_taxonomy_id IN ($terms)
// insert the counts of how many objects that have all $terms
insert into $table1 select object_id,count(*) from $table group by object_id
// drop the first temp table that was use to collect the data
DROP TABLE $table

// add the second table to the where query
$join .= " , $table1 AS tt "; 
$where = " tt.c = $num_terms AND tt.object_id = wp_posts.ID "
}}
This would make the finial query (I did this manually so hopefully I got it right)
{{
SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts , $table1 AS tt   WHERE 1=1  AND (  tt.c = $num_terms AND tt.object_id = wp_posts.ID  ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20

The concerns I have is that the second temp table is currently not if persistent connections are used. This would cause problems over time if the same connect was used many times either due to high volume or the connection being open for a long term.

So a hook is needed to drop the second temporary table after the sql query is completed

OR

a better query that does not require temporary tables

Version 0, edited 11 months ago by robertv123 (next)

comment:6 nacin11 months ago

It may require a greater rewrite, but I imagine there is also an alternative solution for this that does not require temporary tables?

comment:7 robertv12311 months ago

There may be other ways of doing this that gets mysql to handle this internally but I have not as yet found it yet. If someone knows a sql super star I'd like to hear alternatives, my mysql is quite basic.

Portability comes into question as wordpress uses a wide range of quries, this pattern optimizes sub queries works across the board for earlier 4.x mysql as well.

These types of quires have been on the mysql todo list for quite some time now, a decade or more. Latest promise is mysql 6.0, there is some that has been completed for 5.6 but it's far from being done.

Word press will likely still be run on plenty of 5.x mysql installs for quite some time so any sub queries in the code are land mines for users who happen to create a larger data set. In my case it was 120000 records in wp_term_relationships , 60000 posts.

For low end virtual machines or shared hosting the threshold would be lower.

Here is more info on sub queries
http://www.mysqlperformanceblog.com/2010/10/25/mysql-limitations-part-3-subqueries/

comment:8 wonderboymusic11 months ago

When ordering by post_date - probably the most effective optimization is forcing an index: FROM wp_posts USE INDEX (type_status_date). This will force the join type to be range instead of index. On a query with GROUP BY and ORDER BY, the index forcing, for me, has made queries faster un-cached than the cached queries without it.

Subqueries are sometimes rewritten internally by MySQL to perform worse than JOINs.

comment:9 robertv12311 months ago

I took a look and the query was all ready using index type_status_date. My feeling is that the query is simply too complex for mysql. Two group by's and a sub query. Mysql is just trying to sort out simple sub queries in the current unstable db.

+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+
| id | select_type        | table                 | type  | possible_keys            | key              | key_len | ref                      | rows  | Extra                    |
+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+
|  1 | PRIMARY            | wp_posts              | index | type_status_date         | type_status_date | 140     | NULL                     | 47456 | Using where; Using index | 
|  2 | DEPENDENT SUBQUERY | wp_term_relationships | ref   | PRIMARY,term_taxonomy_id | PRIMARY          | 8       | test.wp_posts.ID |  1213 | Using where; Using index | 
+----+--------------------+-----------------------+-------+--------------------------+------------------+---------+--------------------------+-------+--------------------------+

comment:10 wonderboymusic11 months ago

no, you want to force the index so that type becomes range instead of index. http://dev.mysql.com/doc/refman/5.1/en/explain-output.html#jointype_range

comment:11 robertv12311 months ago

I have not figure out away to use a range index I used FROM wp_posts USE INDEX (type_status_date) in the query in this manner to no luck. The documentation seams to imply that the range index only works with constants.

 SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts use index ( type_status_date) WHERE 1=1  AND ( (                                         SELECT COUNT(1)                                         FROM wp_term_relationships                                         WHERE term_taxonomy_id IN (5,453)                                         AND object_id = wp_posts.ID                                 ) = 2 ) AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish') GROUP BY wp_posts.ID ORDER BY wp_posts.post_date DESC LIMIT 0, 20;

And explain indicates type index.

The tests where done on mysql 5.5.31

comment:12 jaybna6 months ago

This may not be the exact same query, but for something we were doing recently we had to create a new compound index that includes ID and post_type (http://screencast.com/t/dGDNzoSQceQk)

Then used FROM wp_posts USE INDEX (post_type_id) to force the new index to be used. It eliminated the temp table and improved query performances from 6 seconds to 0.32 seconds. Using mysql 5.5.30

Note: See TracTickets for help on using tickets.