Make WordPress Core

Opened 16 years ago

Closed 12 years ago

Last modified 12 years ago

#7415 closed defect (bug) (fixed)

"using filesort" in default install

Reported by: dbuser123's profile dbuser123 Owned by:
Milestone: 2.8 Priority: high
Severity: major Version: 2.7
Component: Optimization Keywords:
Focuses: Cc:

Description

I took the database of a large existing WP blog (10k posts, 30k comments) and got it to work with a fresh and clean WP install with the Default theme without any plugins.

There are ridiculously many queries executed: 50. Here are some recommendations:

Almost half of them are like "SELECT post_parent FROM wp_posts WHERE ID= '...' LIMIT 1", just after a "SELECT * FROM wp_posts WHERE (post_type = 'page' AND post_status = 'publish') ORDER BY menu_order, post_title ASC". A left join would be appropriate. Even if you don't want to join, you could cache the post_parent because many posts could have the same parent.

The slowest query by far (40% of total query time) is this one:
SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts WHERE 1=1 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish' OR wp_posts.post_status = 'private') ORDER BY wp_posts.post_date DESC LIMIT 0, 6
This query had to sort 3000 rows BY HAND (only if logged in as admin). Because of the OR, the post_date part of the index (post_type,post_status,post_date,ID) could not be used for sorting. A UNION would solve this problem:
(SELECT wp_posts.* FROM wp_posts WHERE wp_posts.post_type = 'post' AND wp_posts.post_status = 'publish' ORDER BY wp_posts.post_date DESC LIMIT 0, 6)
UNION
(SELECT wp_posts.* FROM wp_posts WHERE wp_posts.post_type = 'post' AND wp_posts.post_status = 'private' ORDER BY wp_posts.post_date DESC LIMIT 0, 6)
ORDER BY post_date DESC LIMIT 0,6

The second problem with this query is the SQL_CALC_FOUND_ROWS. It makes MySQL retrieve all rows, thus causing much disk i/o. It's way faster to do a second query that does a count(*) because it can be done by reading the index instead of the rows. This made the query 10 times as fast for me.

Next one: SELECT post_id, meta_key, meta_value FROM wp_postmeta WHERE post_id IN (9109,9103,9052,9112,9100,9096) ORDER BY post_id, meta_key
I have no idea why it has to be sorted by meta_key as well, but as the table is not indexed on (post_id,meta_key), it causes a manual sort.

Yet another one: SELECT * FROM wp_posts WHERE (post_type = 'page' AND post_status = 'publish') ORDER BY menu_order, post_title ASC
Has to be ordered manually again because there's no appropriate index.

I hope somebody will look at the db-optimization because there's still lots of room for improvement. To assist you, set SAVEQUERIES to true and change the destructor of wp-db.php:

function destruct() {

$ttime = 0;
foreach($this->queries as $q) $ttime += $q[1];
echo '<div style="text-align:left; font-family:courier new; font-size: 14;">';
foreach($this->queries as $q) {

$i++;
echo '. $i .? (' . round(100*$q[1]/$ttime,5) . '% - ' . round($q[1],5) . 's) ' . htmlspecialchars($q[0]) . '<br />';

}
echo '</div>';
return true;

}

Attachments (1)

no_found_rows.diff (1.2 KB) - added by ryan 16 years ago.
Eliminate FOUND_ROWS for post query.

Download all attachments as: .zip

Change History (44)

#1 @ryan
16 years ago

(In [8469]) No need to order results. Props dbuser123. see #7415

#2 @ryan
16 years ago

The repeated post_parent SELECTs should be fixed in 2.6. Are you using 2.6?

Good suggestions. I already applied one of them. I'll try out the rest.

#3 @dbuser123
16 years ago

I'm sorry, nl.wordpress.org gave me an old version. The post_parent is fixed indeed, and this version only needs 23 queries.

One comment on my initial report: instead of the UNION, post_status could be left out of the query. That way, the index could still be used for sorting.

There are two new queries that could be optimized by taking out the ORDER BY clause:
SELECT t.*, tt.*, tr.object_id FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON tt.term_id = t.term_id INNER JOIN iphone_term_relationships AS tr ON tr.term_taxonomy_id = tt.term_taxonomy_id WHERE tt.taxonomy IN ('category', 'post_tag') AND tr.object_id IN (9109, 9103, 9052, 9112, 9100, 9096) ORDER BY t.name ASC
SELECT t.*, tt.* FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('category') ORDER BY t.name ASC

This query is also new and slow:
SELECT object_id, term_taxonomy_id FROM iphone_term_relationships INNER JOIN iphone_posts ON object_id = ID WHERE term_taxonomy_id IN (3,27,10,8,5,28,4,26,4286,13,39,41,3296,2169,2699,36,1920,35,2881,23,42,14,21,15,24,32,34,48,25,22,20,2158,17,977,47,18,31,37,40,16,2159,19,45,30,46,1182,33,3431,11,49,12,7,6,29,2539,43,38,4456,9,44,4138,1) AND post_type = 'post' AND post_status = 'publish'
On iphone_term_relationships it uses an index on (object_id,term_taxonomy_id), while an index on (term_taxonomy_id,object_id) would be way better. The reason why object_id is in the index, is because then rows don't have to be read.

And for this query, an index on tt.taxonomy would be nice:
SELECT t.*, tt.* FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('link_category') AND tt.count > 0 ORDER BY t.name ASC

#4 @dbuser123
16 years ago

Oh, there are some code-tags. Here's everything that still needs attention; everything mentioned above can be ignored.

The slowest query by far (40% of total query time) is this one

SELECT SQL_CALC_FOUND_ROWS wp_posts.* FROM wp_posts WHERE 1=1 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish' OR wp_posts.post_status = 'private') ORDER BY wp_posts.post_date DESC LIMIT 0, 6

This query had to sort 3000 rows BY HAND (only if logged in as admin). Because of the OR, the post_date part of the index (post_type,post_status,post_date,ID) could not be used for sorting. If you leave out the column post_status out of the index, the index can be used for sorting.

Yet another one:
SELECT * FROM wp_posts WHERE (post_type = 'page' AND post_status = 'publish') ORDER BY menu_order, post_title ASC Has to be ordered manually again because there's no appropriate index.

The second problem with this query is the SQL_CALC_FOUND_ROWS. It makes MySQL retrieve all rows, thus causing much disk i/o. It's way faster to do a second query that does a count(*) because it can be done by reading the index instead of the rows. This made the query 10 times as fast for me.

There are two more queries that could be optimized by taking out the ORDER BY clause: SELECT t.*, tt.*, tr.object_id FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON tt.term_id = t.term_id INNER JOIN iphone_term_relationships AS tr ON tr.term_taxonomy_id = tt.term_taxonomy_id WHERE tt.taxonomy IN ('category', 'post_tag') AND tr.object_id IN (9109, 9103, 9052, 9112, 9100, 9096) ORDER BY t.name ASC
SELECT t.*, tt.* FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('category') ORDER BY t.name ASC

This query is also new and slow:
SELECT object_id, term_taxonomy_id FROM iphone_term_relationships INNER JOIN iphone_posts ON object_id = ID WHERE term_taxonomy_id IN (3,27,10,8,5,28,4,26,4286,13,39,41,3296,2169,2699,36,1920,35,2881,23,42,14,21,15,24,32,34,48,25,22,20,2158,17,977,47,18,31,37,40,16,2159,19,45,30,46,1182,33,3431,11,49,12,7,6,29,2539,43,38,4456,9,44,4138,1) AND post_type = 'post' AND post_status = 'publish'
On iphone_term_relationships it uses an index on (object_id,term_taxonomy_id), while an index on (term_taxonomy_id,object_id) would be way better. The reason why object_id is still in the index, is because then rows don't have to be read.

And for this query, an index on tt.taxonomy would be nice:
SELECT t.*, tt.* FROM iphone_terms AS t INNER JOIN iphone_term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy IN ('link_category') AND tt.count > 0 ORDER BY t.name ASC

#5 @djr
16 years ago

  • Cc djr added

#6 @ryan
16 years ago

  • Milestone changed from 2.7 to 2.8

Did get as many done as I wanted for 2.7. Moving to 2.8 for further work.

#7 @rawalex
16 years ago

  • Priority changed from normal to high

Just touched this up to "high" priority, as this is a bug that stops wordpress from scaling. At a certain point (number of posts / pages / old versions saved etc) archive pages take so long to pull up that a simple bot visit for multiple month archives can freeze even a fairly large server. I have a few wordpress installs with more than 10,000 posts over 3 years, and something as simple as someone with a cache bot download the site for offline reading can kill it dead.

#8 @rawalex
16 years ago

  • Severity changed from normal to major

#9 @Denis-de-Bernardy
16 years ago

correct me if I'm wrong, but... on the post meta ordering, the query plan would almost certainly be an index scan on ID followed by a quick sort on the tiny number of rows returned. it's no big deal to run a quick sort at that stage -- but I still agree that it was not necessary either.

On this: "SELECT * FROM wp_posts WHERE (post_type = 'page' AND post_status = 'publish') ORDER BY menu_order, post_title ASC Has to be ordered manually again because there's no appropriate index." I believe your assumption is wrong.

the true constraint here is post_type = page and post_status = publish, and the goal is to fetch them all, sorted. adding an "appropriate" index to (menu_order, post_title), as you seem suggest, would be completely pointless unless you limit the results to a tiny number of items. (then, and only then, would the optimizer identify that a nested loop over the index would be appropriate.)

you do raise a point in that there is no appropriate index though. the wp_posts table has:

KEY type_status_date (post_type,post_status,post_date,ID)

but that one is probably so large, and the db stats are so, that the optimizer ignores it and prefers a seq scan for this particular query. it could, and probably should, be replaced by:

KEY status_type_date (post_status,post_type,post_date)
KEY status_type_menu (post_status,post_type,menu_order)

that way, the optimizer would at least consider the first for posts, and the second for pages. this smaller index probably gets the job done too:

KEY type_status_menu (post_status, post_type)

I'm not 100% on the MySQL internals. But PostgreSQL would almost certainly prefer the smaller index, and do a top-N search in the resulting set where necessary, if the db stats show that using an index at all makes any sense.

Speaking of db stats, it seems to me that using an index only makes sense when it lets you go straight at the a small subset of data. If you need to sort 9,900 out of 10,000 entries that match post_type = post and post_status = published, the optimizer will prefer a seq scan, load the entire mess in memory, and manually sort it no matter what. It'll only consider the index once you've added several thousands of post revisions, or if you're retrieving a small subset of rows using a LIMIT clause.

As a complete aside, the optimization using the union introduces a unique and an extra sort operator in the query plan, and it probably isn't an option due to the complicated stuff in the WP_Query class. the same optimization can probably be achieved, however, by using IN ('publish', 'private') rather than the OR statement.

#10 @Denis-de-Bernardy
16 years ago

  • Cc Denis-de-Bernardy added

#11 @Denis-de-Bernardy
16 years ago

  • Version set to 2.7

#12 @dbuser123
16 years ago

I believe your assumption is wrong.

True.

load the entire mess in memory

Loading 10.000 entries in memory isn't a good idea. Sort buffers mostly aren't that large, and the hdd will be used. Does this query ever show up without a LIMIT clause anyways?

by using IN ('publish', 'private') rather than the OR statement.

MySQL doesn't distinguish between IN and OR.

#13 @rawalex
16 years ago

as there are a limited number of queries against the wp-posts file, it isn't unreasonable to have an index for each of the major combinations. So having an index on post type, post status, post date isn't a bad idea, example.

That however is really a nit pick, the true problematic queries use SQL_CALC_FOUND_ROWS in them. MySQL cannot use any index for these queries, and as a reuslt, they are non-scalable queries. They are a neat sort of code when you are playing with a small database, but get to 10,000 records and they turn into pigs. Doing a standard select followed by a simple mysql_num_rows is hugely more effecient. it is potentially a "lock" query that also won't allow other simultanious queries, which makes it even worse.

I would suggest a developer load up a single blog install with 10k posts spread over 2 years, and then use a bot to request each of the archive pages (1 a second or so). By the 5th or 6th request, your server performance will tank. I have had googlebot all but wipe out a strong dual core server asking for archive pages.

#14 @dbuser123
16 years ago

Another slow query (0.20s for a 30 MB comments table) I regularly encounter is this one:

SELECT comment_post_ID
FROM comments
WHERE LCASE(comment_author_email) = 'user@host.com' AND comment_subscribe='Y' AND comment_approved = '1'
GROUP BY comment_post_ID

It would be better to always store e-mail as lowercase, stop using the LCASE function, and add an index on (comment_author_email, comment_post_ID).

Furthermore I agree with rawalex. Using SQL_CALC_FOUND_ROWS and/or LIMIT high_number,25 both are performance killers. With 15.000 posts and a modern server, this query that shows the posts on an old page, takes 0.25s to complete. This is due to both the SQL_CALC_FOUND_ROWS and the LIMIT-clause. In an ideal situation, each post gets assigned a number that increases by one for every post and has an index on it. Then you can use SELECT max(number) instead of the SQL_CALC_FOUND_ROWS and use WHERE number BETWEEN max-5,max ORDER BY number DESC to circumvent the LIMIT-clause. This works on any page with posts, and should make even old pages very performant. It might be very time consuming to implement, but it's the best way to increase performance for large websites.

SELECT SQL_CALC_FOUND_ROWS posts.*
FROM posts
WHERE 1=1 AND posts.ID NOT IN (
    SELECT tr.object_id
    FROM term_relationships AS tr
    INNER JOIN term_taxonomy AS tt
    ON tr.term_taxonomy_id = tt.term_taxonomy_id
    WHERE tt.taxonomy = 'category' AND tt.term_id IN ('1924', '3428') )
  AND posts.post_type = 'post'
  AND (posts.post_status = 'publish')
ORDER BY posts.post_date DESC
LIMIT 2500, 5

#15 @Denis-de-Bernardy
16 years ago

  • Cc Denis-de-Bernardy removed

the max(number) option wouldn't work: keep in mind that every chunk of that query is pluginable... if the number is valid for one where clause, it isn't necessarily for the next one.

#16 @ryan
16 years ago

(In [10422]) Eliminate SQL_CACLC_FOUND_ROWS for edit-comments.php query. Use results of earlier wp_count_comments() if appropriate otherwise perform COUNT query. see #7415

#17 @ryan
16 years ago

(In [10424]) Don't use order and limit for count query. see #7415

#18 @ryan
16 years ago

(In [10446]) Force the index only for 0 OR 1 queries. see #7415

@ryan
16 years ago

Eliminate FOUND_ROWS for post query.

#19 @ryan
16 years ago

Patch eliminates found rows for the big query in WP_Query.

#20 @Denis-de-Bernardy
16 years ago

the count query you just posted is erroneous. it should be:

SELECT $distinct $fields FROM $wpdb->posts $join WHERE 1=1 $where $groupby

Adding to this, a $having clause would be a bonus. ;-)

#21 follow-up: @Denis-de-Bernardy
16 years ago

Err.. silly pasting. This even:

SELECT COUNT($distinct $wpdb->posts.ID) FROM $wpdb->posts $join WHERE 1=1 $where $groupby

#22 @Denis-de-Bernardy
16 years ago

$wpdb->posts.* might even be useful in some cases, in case someone retrieves extra fields with one or more left joins on $wpdb->postmeta.

alternatively, we'd need:

$count_fields = apply_filters('count_fields', "$wpdb->posts.ID");

or something like that... in order to retrieve the correct number of posts.

#23 in reply to: ↑ 21 ; follow-up: @mrmist
16 years ago

Replying to Denis-de-Bernardy:

Err.. silly pasting. This even:

SELECT COUNT($distinct $wpdb->posts.ID) FROM $wpdb->posts $join WHERE 1=1 $where $groupby

posts.ID is an autoinc, so it's always unique. $distinct is superfluous.

#24 @ryan
16 years ago

SELECT * FROM wp_comments WHERE comment_post_ID = 1643 AND comment_approved = '1' ORDER BY comment_date

That's the comment query that single post pages use when there is not a logged in user. It uses filesort. Ordering by comment_date_gmt and adding a comment_post_ID,comment_approved,comment_date_gmt key gets rid of filesort.

#25 follow-up: @ryan
16 years ago

Some casual testing against a posts table with 26k rows doesn't show much difference between SQL_CALC_FOUND_ROWS and COUNT for me, even when paging way back into the archives. SQL_CALC_FOUND_ROWS + FOUND ROWS is probably a little faster, on average.

#26 in reply to: ↑ 23 ; follow-up: @Denis-de-Bernardy
16 years ago

Replying to mrmist:

Replying to Denis-de-Bernardy:

Err.. silly pasting. This even:

SELECT COUNT($distinct $wpdb->posts.ID) FROM $wpdb->posts $join WHERE 1=1 $where $groupby

posts.ID is an autoinc, so it's always unique. $distinct is superfluous.

of course not. suppose this results in two rows per id:

select t1.id natural join

then this will return one row per id:

select distinct t1.id natural join t2

#27 in reply to: ↑ 25 @Denis-de-Bernardy
16 years ago

Replying to ryan:

Some casual testing against a posts table with 26k rows doesn't show much difference between SQL_CALC_FOUND_ROWS and COUNT for me, even when paging way back into the archives. SQL_CALC_FOUND_ROWS + FOUND ROWS is probably a little faster, on average.

I'm not surprised, personally. for a query with meaningful join/where/group by/having clauses, it'll actually run the aggregate -- meaning you end up running the query twice.

#28 in reply to: ↑ 26 @mrmist
16 years ago

Replying to Denis-de-Bernardy:

of course not. suppose this results in two rows per id:

select t1.id natural join

then this will return one row per id:

select distinct t1.id natural join t2

Apologies I did not see $join in your query.

#29 follow-up: @rawalex
16 years ago

If you need the data and the count, why not just do the query and then a mysql_count_rows? If you are just looking for the count, then severely limit the fields you are pulling in (wp_post.* is a bad choice if you don't need the data). Less fields means less data space required.

#30 @rawalex
16 years ago

One other thing is the use in the archive page queries of year(date) month(date) in queries. It would be much more effecient to create a "first date" and "last date" and have the query look for post dates between them. The natural date comparison is much more effecient than requiring two date conversions to a year number / month number on each item, and would allow the query to use a date index to be somewhat more effecient.

Basically, if a query can't use an index, it normally should be re-written to use one.

#31 in reply to: ↑ 29 ; follow-up: @ryan
16 years ago

Replying to rawalex:

If you need the data and the count, why not just do the query and then a mysql_count_rows? If you are just looking for the count, then severely limit the fields you are pulling in (wp_post.* is a bad choice if you don't need the data). Less fields means less data space required.

We don't need to count the number of rows in the result, we need to count the number of rows the query would return if it had no limits. Without SQL_CALC_FOUND_ROWS, we have to do a separate COUNT query.

#32 in reply to: ↑ 31 ; follow-up: @rawalex
16 years ago

Replying to ryan:

Replying to rawalex:


We don't need to count the number of rows in the result, we need to count the number of rows the query would return if it had no limits. Without SQL_CALC_FOUND_ROWS, we have to do a separate COUNT query.

Because of the overhead of calc_found_rows, wouldn't it just be better to make the first query full without limits and count it, and control the number of items displayed elsewhere? Right now you are doing two full queries, which would be more intensive, no?

#33 in reply to: ↑ 32 ; follow-up: @mrmist
16 years ago

Replying to rawalex:

Because of the overhead of calc_found_rows, wouldn't it just be better to make the first query full without limits and count it, and control the number of items displayed elsewhere? Right now you are doing two full queries, which would be more intensive, no?

The mysql docs say

If you are using SELECT SQL_CALC_FOUND_ROWS, MySQL must calculate how many rows are in the full result set. However, this is faster than running the query again without LIMIT, because the result set need not be sent to the client.


And I'm tempted to agree. If you have a massive results set the last thing you would want to do is return the entire thing to the client.

#34 in reply to: ↑ 33 @rawalex
16 years ago

Replying to mrmist:

Replying to rawalex:

Because of the overhead of calc_found_rows, wouldn't it just be better to make the first query full without limits and count it, and control the number of items displayed elsewhere? Right now you are doing two full queries, which would be more intensive, no?

The mysql docs say

If you are using SELECT SQL_CALC_FOUND_ROWS, MySQL must calculate how many rows are in the full result set. However, this is faster than running the query again without LIMIT, because the result set need not be sent to the client.


And I'm tempted to agree. If you have a massive results set the last thing you would want to do is return the entire thing to the client.

The point is run the query (normally without the "count" and then just do a mysql_num_rows() on the result. That means you only do the query once, and then ask it how many (which is a very low load request). The limit is meaningless in the current query (because the entire database must be considered), so the limit would be just as well done in software as put here.

Moving also to replace the year() and month() parts of the queries with actual start and end dates would also improve the query dramatically.

#35 @ryan
16 years ago

(In [10733]) Add index on taxonomy. see #7415

#36 @janeforshort
16 years ago

  • Milestone changed from 2.8 to Future Release

Punting due to feature freeze. Reconsider with next release.

#37 @Denis-de-Bernardy
16 years ago

  • Milestone changed from Future Release to 2.9

I'm itching to close this one and open new tickets, as it's getting a bit hard to follow due to its length.

I also opened a new ticket on db optimizations (#9642), which I've cross-referenced to here. I'll look into them in 2.9

#38 @aaroncampbell
15 years ago

  • Cc aaroncampbell added

#39 @ryan
15 years ago

  • Milestone changed from 2.9 to Future Release

#40 @simonwheatley
14 years ago

  • Cc simon@… added

#41 @wonderboymusic
12 years ago

  • Keywords changed from database, optimization, slow queries, filesort to database optimization, slow queries, filesort
  • Milestone Future Release deleted
  • Resolution set to wontfix
  • Status changed from new to closed

This ticket hasn't been updated in 3 years and is way too general - individual queries should be targeted in separate tickets.

#42 @wonderboymusic
12 years ago

  • Keywords changed from database optimization, slow queries, filesort to database optimization slow queries, filesort
  • Resolution changed from wontfix to fixed

Changing to "fixed" as per Helen - a lot of these issues are in other tickets as well.

#43 @helenyhou
12 years ago

  • Keywords database optimization slow queries filesort removed
  • Milestone set to 2.8

Only suggested close as fixed on 2.8 because of the existing commits :) Agreed that it's very much a blanket ticket.

Note: See TracTickets for help on using tickets.