Make WordPress Core

Opened 13 years ago

Last modified 6 months ago

#15499 reviewing enhancement

Add an index for get_lastpostmodified query

Reported by: simonwheatley's profile simonwheatley Owned by: olliejones's profile OllieJones
Milestone: Future Release Priority: normal
Severity: normal Version: 3.0.1
Component: Database Keywords: has-patch dev-feedback
Focuses: performance Cc:

Description (last modified by mukesh27)

I had a friend (Jools Wills) look over a WordPress site recently, to get a fresh view on what might be optimised, and he noticed a query which might benefit from an additional index on WP_Posts. The query SELECT post_modified_gmt FROM $wpdb->posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1 in get_lastpostmodified is run for last modified date in GMT, and currently doesn't use an index. This SQL is run whenever certain types of feed are requested as far as I can see.

We added CREATE INDEX type_status_modified ON wp_posts (post_type, post_status, post_modified_gmt); and CREATE INDEX type_status_modified_no_id ON wp_posts (post_type, post_status, post_date_gmt); and the query runs a lot faster now. The following timings were taken running the first query (post_modified_gmt) on a 36,362 row posts table. Note that it doesn't use filesort after the index has been added.

Before:

mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1;
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
| id | select_type | table      | type | possible_keys    | key              | key_len | ref         | rows  | Extra                       |
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
|  1 | SIMPLE      | slgr_posts | ref  | type_status_date | type_status_date | 124     | const,const | 24718 | Using where; Using filesort |
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
1 row in set (0.03 sec)
  • 0.21290683746338ms
  • 0.25690102577209ms
  • 0.230553150177ms
  • 0.2274341583252ms
  • 0.23083996772766ms

After:

mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1;
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
| id | select_type | table      | type | possible_keys                         | key                  | key_len | ref         | rows  | Extra       |
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | slgr_posts | ref  | type_status_date,type_status_modified | type_status_modified | 124     | const,const | 24718 | Using where |
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
1 row in set (0.00 sec)
  • 0.00082707405090332ms
  • 0.00072288513183594ms
  • 0.00074386596679688ms
  • 0.00066494941711426ms
  • 0.00066208839416504ms

In get_lastpostmodified both these queries are run, so the total savings in my case on a quiet server are nearly 0.5 seconds... worth having, I reckon.

I've not created a patch for schema changes before, but I think the only place the change would need to go would be scheme.php? Suggested patch attached.

Attachments (2)

new-keys.diff (569 bytes) - added by simonwheatley 13 years ago.
New indexes for wp_posts
15499.diff (528 bytes) - added by simonwheatley 10 years ago.
Diff to cut the get_lastpostdate call from within get_lastpostmodified

Download all attachments as: .zip

Change History (59)

@simonwheatley
13 years ago

New indexes for wp_posts

#1 @dd32
13 years ago

  • Keywords 3.2-early added
  • Milestone changed from Awaiting Review to Future Release

#2 @pento
12 years ago

  • Keywords 3.2-early removed

These are good keys and should be added.

#3 @nacin
12 years ago

  • Milestone changed from Future Release to 3.5

#4 @simonwheatley
12 years ago

Let me know if there's anything further you need from me for this.

#5 @nacin
11 years ago

I recall barry hesitating for the idea of adding an index to improve a single query.

This query is persistently cached, so this is definitely less of a concern than if we couldn't reasonably cache it.

get_lastpostmodified() is only used in core with 'GMT' (and only for feeds), so I think at most we need to add type_status_modified.

#6 @pento
11 years ago

An index to improve a single query is fine, if that query is run frequently enough.

I don't think using the object cache is a good enough reason to not do this - do that many sites really have a persistent cache? Similarly, I think these indexes would be more helpful on self-hosted sites than they would be harmful on WP.com.

I'm not following why we'd do one but not the other - both queries are run every time get_lastpostmodified() is called.

Finally, nobody likes long read locks. :)

#7 follow-ups: @nacin
11 years ago

I didn't catch that get_lastpostmodified() also calls get_lastpostdate(), triggering the other query. Lame.

Here's the thing. I don't know why. get_lastpostdate() wants to know the newest published post. Since it's published, it can't be in the future. get_lastpostmodified() wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

This compare-modified-and-last-post-date thing was introduced, no joke, in [908]. Probably time it got challenged.

#8 @nacin
11 years ago

  • Keywords 3.6-early added
  • Milestone changed from 3.5 to Future Release

#9 @brokentone
11 years ago

The great majority of the time these functions are called are within a feed context, and in fact, they're called multiple times there. Once, in the WP::send_headers() then once for the feed generation. I'm not sure if this query is caching within WP or mysql at all and thus whether the second call is significant, but I would expect it is significant.

I'm looking to remove the first call, and set the headers later in #22742. If there are hesitations to adding the index, then perhaps reducing the query by 1/2 would be another way to deal with this issue. (Although, without a full investigation, I like the index).

#10 @nacin
10 years ago

See also #25358. Would be nice to fix all of this at some point.

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.


10 years ago

@simonwheatley
10 years ago

Diff to cut the get_lastpostdate call from within get_lastpostmodified

#12 in reply to: ↑ 7 @simonwheatley
10 years ago

Replying to nacin:

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

15499.diff just attached addresses this. The only test which exercises get_lastpostmodified directly, test_channel in the feed group, fails on an earlier (unrelated) assertion.

This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.


10 years ago

#14 @SergeyBiryukov
10 years ago

  • Milestone changed from Future Release to 4.0

#15 in reply to: ↑ 7 @DrewAPicture
10 years ago

  • Keywords commit added

Replying to nacin:

Here's the thing. I don't know why. get_lastpostdate() wants to know the newest published post. Since it's published, it can't be in the future. get_lastpostmodified() wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

Seems logical. 15499.diff still applies cleanly.

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.


10 years ago

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.


10 years ago

#18 @DrewAPicture
10 years ago

  • Keywords 4.1-early added; 3.6-early removed
  • Milestone 4.0 deleted

We've reached the end of line, at least for 4.0. Punting.

#19 @helen
10 years ago

  • Milestone set to Future Release

#20 @nacin
9 years ago

#31171 was marked as a duplicate.

#21 @archon810
9 years ago

  • Focuses performance added

#31171 is indeed a duplicate, but the proposed solution there is far more optimized than the one in this ticket because it ends up with Using where; Using index and a much shorter table scan.

Please consider the index in #31171.

Can we perhaps push this issue to be resolved in 4.2?

Last edited 9 years ago by archon810 (previous) (diff)

#22 follow-ups: @Denis-de-Bernardy
9 years ago

There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:

https://core.trac.wordpress.org/ticket/4575#comment:14

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

#23 in reply to: ↑ 22 @archon810
9 years ago

Replying to Denis-de-Bernardy:

There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:

https://core.trac.wordpress.org/ticket/4575#comment:14

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

If feeds are the only users of this unoptimized function, and by resolving that ticket you can deprecate it, then by all means, I'll be glad to remove the unnecessary index I added to our db. However, if it's only going to solve it for feeds but other places in the code may still use it, then the index should be put in place.

Either way, hopefully this saga comes to its end in 4.2.

Last edited 9 years ago by archon810 (previous) (diff)

#24 @Denis-de-Bernardy
9 years ago

It's the case if a cursory search is anything to go by.

get_lastpostmodified() is used once in each feed template file over in wp-include, and another time in the wp-include/class-wp.php to create an ETag for post feeds.

get_lastcommentmodified() gets an identical treatment.

#25 in reply to: ↑ 22 @mdgl
9 years ago

Replying to Denis-de-Bernardy:

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

I agree it would be much better if we used the last modified date of the data we have actually queried, rather than the last modified date of all posts/comments in the database. As far as I can see, the current code will presently return the wrong value for most category, tag and custom post type feeds (see also #22742).

Having queried the data, a simple loop around the post objects in memory will almost certainly be much faster than querying the database for a second time, regardless of which columns are indexed.

This seems do-able for the calls to get_lastpostmodified() and get_lastcommentmodified() in the various feed templates. Unfortunately, these functions are also used to generate the HTTP headers Last-Modified and ETag for feeds and which are output before we have even run the main query.

What do you think? Is there a safe and efficient way of generating those HTTP header values?

See also #13066.

Last edited 9 years ago by mdgl (previous) (diff)

#28 @McBoof
5 years ago

  • Keywords bulk-reopened added

Hi there,

Just to add something from what I'm seeing. We're running a large multi-tenanted WordPress setup (all on close to the latest 4.x versions at the moment). I've been analyzing the slow query log (showing queries longer than 2 secs on a large but busy MySQL RDS instance).

This query is by far and away the worst culprit across all my databases. It is doing a Table Scan of often over 100,000 rows, sometimes taking 20 seconds or so. Of the top 100 slowest query on my instance today for example (running about 60 databases), rougly 70% of the slowest queries where this one.

From my point of view, it is certainly worth a dedicated index if this query is needed.

#29 @mwidmann
5 years ago

Same issue here. This functionality doesn't behave well with large post tables (in our case 300k+ posts).

There are two changes I'd love to see in post.php:

get_lastpostmodified has a pre and after filter, where especially the pre filter is useful as it allows to provide different data, instead of running this extremely slow query. get_lastpostdate sadly doesn't provide such functionality.

Worse is the issue in _get_last_post_time where when $timezone is set to Blog it selects the value of either post_date or post_modified, which at least for post_date would use the index type_status_date on wp_posts, but then orders by post_{$field}_gmt which skips the index.

#30 @Krstarica
5 years ago

Still horribly slow in WordPress 5.2.3:

# Query_time: 2.345178 Lock_time: 0.000041 Rows_sent: 1 Rows_examined: 542040
SELECT post_date_gmt FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment') ORDER BY post_date_gmt DESC LIMIT 1;

The same result, but 4 times faster (but still slow: 0.65 sec):
SELECT MAX(post_date_gmt) FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment');

#31 in reply to: ↑ 7 @SergeyBiryukov
4 years ago

  • Keywords bulk-reopened removed
  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

Replying to comment:7:

I didn't catch that get_lastpostmodified() also calls get_lastpostdate(), triggering the other query. Lame.

Introduced in [952].

#32 @johnbillion
3 years ago

  • Keywords 4.1-early commit removed

#34 @mukesh27
22 months ago

  • Description modified (diff)

Hi there!

@simonwheatley PR updated against the Trunk version can you please test if it works for you?

@SergeyBiryukov is this ticket is in your to-do list? Do you think it will marge in the upcoming release?

#35 @aliragab
19 months ago

Before the index

# Query_time: 20.442996  Lock_time: 0.000056  Rows_sent: 1  Rows_examined: 214661
# Rows_affected: 0  Bytes_sent: 139
SET timestamp=1660498161;
SELECT post_modified_gmt FROM ***_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment') ORDER BY post_modified_gmt DESC LIMIT 1;

These 20sec droped to only 0.4sec after applying index.

@mukesh27 commented on PR #3117:


17 months ago
#37

Ping @SergeyBiryukov @costdev @craigfrancis for review.

@mukesh27 commented on PR #3117:


16 months ago
#38

The DB performance has been run both before and after the patch. This is the outcome: https://docs.google.com/spreadsheets/d/1o_sALzt-OvpDs7Cba14Ib8Yrz-zeO5UmrM4OCd-65tE/edit#gid=0 of the analysis across various WordPress setup post sets.

You can access the analysis' findings at https://docs.google.com/spreadsheets/d/1o_sALzt-OvpDs7Cba14Ib8Yrz-zeO5UmrM4OCd-65tE/edit#gid=198509179. This demonstrates unequivocally that we significantly improve query executions for medium- to large-sized sites.

For a review, ping @dd32.

craigfrancis commented on PR #3117:


16 months ago
#39

Taking a quick look at this, I think it's a good addition, but I will note that I'm not overly familiar with this query.

As to removing the call to get_lastpostdate(), I'm fairly sure nacin is right, and post_date_gmt shouldn't be greater than post_modified_gmt... but, this change does mean "get_lastpostdate" filter won't be called, which I don't think it will be a problem, but it's worth noting.

---

And just to add the usual notes about adding an INDEX... well used queries often benefit from them, but the caveats are:

  1. Indexes do negatively affect queries that modify the indexed values (i.e. it takes time to update the index, usually this is a very small performance issue, but it can be noticeable when lots of modifications are being performed);
  2. Each index does take up disk and RAM space.
  3. While not relevant in this case (as they cover datetime fields); if the index does not contain enough unique values, the database is unlikely to use it.

@mukesh27 commented on PR #3117:


16 months ago
#40

Thank you for your input, @craigfrancis . Which choice is best in this situation? Should we take into account RAM and disc use instead of adding an index to speed up query execution?

craigfrancis commented on PR #3117:


16 months ago
#41

I suspect you know the WordPress audience better than me... it really depends on how often this function used, if the majority of websites use it (ref send_headers() for feeds, get_feed_build_date() fallback, and maybe extensions?), then they should all benefit from the speed improvement, and would probably be ok with the relatively small disk/ram usage increase... but if it's rarely used, then those affected could add this index themselves (in a similar way to how they will be tuning their install for other queries that are slow for them).

@costdev commented on PR #3117:


16 months ago
#42

Hey @craigfrancis, on the "relatively small disk/ram usage increase", I note that you say "probably be ok". Is that you just being safe, or would you have concerns about merging this PR if you were a committer?

As this PR ultimately hits a lot of Core via send_headers(), get_feed_build_date(), their "Used by" chain, and extender usage, I just want to clarify whether a RAM increase in particular would be enough (cumulatively too) to warrant concern?

craigfrancis commented on PR #3117:


16 months ago
#43

Hi @costdev, I'm just being paranoid.

For your typical website, with a few hundred/thousand records, I doubt a single person will notice, it's when you get to the bigger websites (100,000+ records) there *might* be problems.

I've just created a simplistic test with 100,000 records in a InnoDB table on a fairly quick SSD... to run the ALTER TABLE took 0.9 seconds to add both indexes (ref upgrade.php)... and the SHOW TABLE STATUS reported the Index_length went from 11 MB to 18 MB (~7 MB; and yes, I know, skipping over some details there)... running an UPDATE to change post_modified_gmt = DATE_SUB(NOW(), INTERVAL `id` SECOND) (so each record has a different value), when that's for a specific/single record (e.g. WHERE id = 123) it barely makes a difference in the processing time, but updating every record (which I doubt anyone will do) is ~0.6 without the indexes, and ~2.0 sec with the indexes.

@mukesh27 commented on PR #3117:


16 months ago
#44

I appreciate your feedback and sharing the test findings, @craigfrancis. It is quite beneficial.

My and @costdev understanding of your result is:

  • ALTER TABLE will only be run once - upon installation/upgrade
  • The ~0.6 without, ~2.0 sec with was when updating 100,000 records all at once. How common a case would this be, and would ~1.4 be a terrible wait for someone waiting for 100,000 updates to run?

Should we consider 0.9 seconds to alter the table upon upgrade/installation, and ~1.4 seconds when updating 100,000 records acceptable.

Previously, @pento added an index to wp_options.autoload #24044 which was a good addition and improved performance as per the follow-up comments on the ticket.

If it is acceptable, then it would be good to include it now so we get more time to test it.

craigfrancis commented on PR #3117:


16 months ago
#45

I suspect it will be good to include, but I don't have enough experience with WordPress users to say for certain (my patch had been committed for most of 6.1 development time, and it wasn't until the last RC that someone found some undocumented feature that caused it to be pulled, so I'm just being cautious).

#46 @mukesh27
16 months ago

  • Milestone changed from Future Release to 6.2

Thanks @craigfrancis for confirmation.

Moving to 6.2 for visibility.

@costdev commented on PR #3117:


16 months ago
#47

Thanks @craigfrancis! Let's ping some committers for their thoughts.

@SergeyBiryukov, @peterwilsoncc, @dream-encode

#48 @OllieJones
16 months ago

This looks good to me. About the bulk-update slowdown objection, bulk updates are very rare compared to page views, and aren't generally a good reason to avoid otherwise useful indexes.

I think it's a good idea to make these new index names follow the convention in others, where column names appear in the index names. There's zero penalty for long index names.

 KEY type_status_post_date_gmt (post_type,post_status,post_date_gmt)
 KEY type_status_modified_date_gmt (post_type,post_status,post_modified_gmt)

#49 @rjasdfiii
16 months ago

Some technical info on MySQL. These will explain some of the timing differences mentioned in this thread. And may help in designing INDEXes in the future.

  • "IN ('post', 'page', 'attachment')" is less optimizable than "= 'post'". (Note in the slowlog "Rows examined".)
  • It is best to put columns tested with "=" first in the INDEX. (The order of those columns in the index does not matter.) So, ask which of these is usually tested with "=": post_status or post_type.
  • A "covering index" is one where _all_ the columns that are used _anywhere_ in the SELECT are present in the INDEX. (This explains some of the improved times.) The reason that covering is better is that all the work is done in the B+Tree containing the INDEX; no need to reach over to the data's B+Tree to fetch more columns.
  • INDEX(post_status, post_type, post_modified_gmt, post_date_gmt) would possibly the best overall. Status is first because of "=". For all the mentioned queries, it would be "covering".
  • WHERE post_status = '...' AND post_type = '...' ... and get post_modified_gmt -- This query would be very fast (Rows_examined = 1) with the above index. All other queries would be less efficient -- either due to IN or due to not having the index's columns ordered optimally. But they would be somewhat efficient due to "covering".

@mukesh27 commented on PR #3117:


15 months ago
#50

Thanks @OllieJones. The index name changes look good to me, so I have updated PR with the suggested name.

#51 @mukesh27
15 months ago

Thanks @rjasdfiii for the info. So you are suggesting to use single INDEX instead of two?

INDEX(post_status, post_type, post_modified_gmt, post_date_gmt)

#52 @rjasdfiii
15 months ago

Ollie's two 3-column indexes versus my one 4-column index:

  • It is hard to predict which would be better overall.
  • More indexes has a slight burden on disk space and INSERT/UPDATE time.
  • The longer index is slightly slower for the GMT usage for it.

Without a good enough benchmark to choose between them, I would flip a coin.

This ticket was mentioned in Slack in #core by costdev. View the logs.


14 months ago

#54 @SergeyBiryukov
14 months ago

It's been a long time since I last looked into this, if anyone else would like to own the ticket, happy to transfer :) Otherwise, will try to find some time to review the latest PRs.

Curious, does the index addition needs a $wp_db_version bump?

#55 @OllieJones
14 months ago

Yes, a $wp_db_version bump is needed when adding an index to schema.php.

Conversations about this in @core-performance have tended in the direction of avoiding core changes to indexes specifically for large sites, but rather putting such changes in a a "canonical plugin". (The meaning of "canonical plugin" is not yet defined as of this writing.)

Suggestion: add this to @rjasdfiii and my [reindexing plugin](https://wordpress.org/plugins/index-wp-mysql-for-speed/) rather than core.

At any rate, I'll take responsibility for this if you like.

Workaround -- use wp-cli to add the index like this:

wp db query "ALTER TABLE wp_posts ADD KEY type_status_post_date_gmt (post_status, post_type, post_modified_gmt, post_date_gmt)

(Maybe it's time to develop an EXPLAIN addon to Query Monitor to show query plans.)

This ticket was mentioned in Slack in #core by costdev. View the logs.


14 months ago

#57 @costdev
14 months ago

  • Milestone changed from 6.2 to Future Release

This ticket was discussed during the bug scrub. As discussion about the future of this change is still ongoing, and we're approaching 6.2 Beta 1 in the next few days, I'm moving this to Future Release so that discussion can continue.

Additional props: @audrasjb @petitphp

#58 @SergeyBiryukov
14 months ago

  • Owner changed from SergeyBiryukov to OllieJones

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


6 months ago

Note: See TracTickets for help on using tickets.