Opened 14 years ago
Last modified 4 weeks ago
#15499 reviewing enhancement
Add an index for get_lastpostmodified query
Reported by: | simonwheatley | Owned by: | OllieJones |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 3.0.1 |
Component: | Database | Keywords: | has-patch dev-feedback |
Focuses: | performance | Cc: |
Description (last modified by )
I had a friend (Jools Wills) look over a WordPress site recently, to get a fresh view on what might be optimised, and he noticed a query which might benefit from an additional index on WP_Posts
. The query SELECT post_modified_gmt FROM $wpdb->posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1
in get_lastpostmodified
is run for last modified date in GMT, and currently doesn't use an index. This SQL is run whenever certain types of feed are requested as far as I can see.
We added CREATE INDEX type_status_modified ON wp_posts (post_type, post_status, post_modified_gmt);
and CREATE INDEX type_status_modified_no_id ON wp_posts (post_type, post_status, post_date_gmt);
and the query runs a lot faster now. The following timings were taken running the first query (post_modified_gmt
) on a 36,362 row posts table. Note that it doesn't use filesort after the index has been added.
Before:
mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1; +----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+ | 1 | SIMPLE | slgr_posts | ref | type_status_date | type_status_date | 124 | const,const | 24718 | Using where; Using filesort | +----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+ 1 row in set (0.03 sec)
- 0.21290683746338ms
- 0.25690102577209ms
- 0.230553150177ms
- 0.2274341583252ms
- 0.23083996772766ms
After:
mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1; +----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+ | 1 | SIMPLE | slgr_posts | ref | type_status_date,type_status_modified | type_status_modified | 124 | const,const | 24718 | Using where | +----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+ 1 row in set (0.00 sec)
- 0.00082707405090332ms
- 0.00072288513183594ms
- 0.00074386596679688ms
- 0.00066494941711426ms
- 0.00066208839416504ms
In get_lastpostmodified
both these queries are run, so the total savings in my case on a quiet server are nearly 0.5 seconds... worth having, I reckon.
I've not created a patch for schema changes before, but I think the only place the change would need to go would be scheme.php
? Suggested patch attached.
Attachments (2)
Change History (61)
#5
@
12 years ago
I recall barry hesitating for the idea of adding an index to improve a single query.
This query is persistently cached, so this is definitely less of a concern than if we couldn't reasonably cache it.
get_lastpostmodified() is only used in core with 'GMT' (and only for feeds), so I think at most we need to add type_status_modified.
#6
@
12 years ago
An index to improve a single query is fine, if that query is run frequently enough.
I don't think using the object cache is a good enough reason to not do this - do that many sites really have a persistent cache? Similarly, I think these indexes would be more helpful on self-hosted sites than they would be harmful on WP.com.
I'm not following why we'd do one but not the other - both queries are run every time get_lastpostmodified() is called.
Finally, nobody likes long read locks. :)
#7
follow-ups:
↓ 12
↓ 15
↓ 31
@
12 years ago
I didn't catch that get_lastpostmodified() also calls get_lastpostdate(), triggering the other query. Lame.
Here's the thing. I don't know why. get_lastpostdate() wants to know the newest published post. Since it's published, it can't be in the future. get_lastpostmodified() wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.
I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.
This compare-modified-and-last-post-date thing was introduced, no joke, in [908]. Probably time it got challenged.
#9
@
12 years ago
The great majority of the time these functions are called are within a feed context, and in fact, they're called multiple times there. Once, in the WP::send_headers() then once for the feed generation. I'm not sure if this query is caching within WP or mysql at all and thus whether the second call is significant, but I would expect it is significant.
I'm looking to remove the first call, and set the headers later in #22742. If there are hesitations to adding the index, then perhaps reducing the query by 1/2 would be another way to deal with this issue. (Although, without a full investigation, I like the index).
This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.
10 years ago
#12
in reply to:
↑ 7
@
10 years ago
Replying to nacin:
I think we can just axe the call to
get_lastpostdate()
fromget_lastpostmodified()
for the same amount of functionality.
15499.diff just attached addresses this. The only test which exercises get_lastpostmodified
directly, test_channel
in the feed
group, fails on an earlier (unrelated) assertion.
This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.
10 years ago
#15
in reply to:
↑ 7
@
10 years ago
- Keywords commit added
Replying to nacin:
Here's the thing. I don't know why.
get_lastpostdate()
wants to know the newest published post. Since it's published, it can't be in the future.get_lastpostmodified()
wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.
I think we can just axe the call to
get_lastpostdate()
fromget_lastpostmodified()
for the same amount of functionality.
Seems logical. 15499.diff still applies cleanly.
This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.
10 years ago
This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.
10 years ago
#18
@
10 years ago
- Keywords 4.1-early added; 3.6-early removed
- Milestone 4.0 deleted
We've reached the end of line, at least for 4.0. Punting.
#21
@
10 years ago
- Focuses performance added
#22
follow-ups:
↓ 23
↓ 25
@
10 years ago
There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:
https://core.trac.wordpress.org/ticket/4575#comment:14
In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).
#23
in reply to:
↑ 22
@
10 years ago
Replying to Denis-de-Bernardy:
There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:
https://core.trac.wordpress.org/ticket/4575#comment:14
In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).
If feeds are the only users of this unoptimized function, and by resolving that ticket you can deprecate it, then by all means, I'll be glad to remove the unnecessary index I added to our db. However, if it's only going to solve it for feeds but other places in the code may still use it, then the index should be put in place.
Either way, hopefully this saga comes to its end in 4.2.
#24
@
10 years ago
It's the case if a cursory search is anything to go by.
get_lastpostmodified()
is used once in each feed template file over in wp-include, and another time in the wp-include/class-wp.php to create an ETag for post feeds.
get_lastcommentmodified()
gets an identical treatment.
#25
in reply to:
↑ 22
@
10 years ago
Replying to Denis-de-Bernardy:
In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).
I agree it would be much better if we used the last modified date of the data we have actually queried, rather than the last modified date of all posts/comments in the database. As far as I can see, the current code will presently return the wrong value for most category, tag and custom post type feeds (see also #22742).
Having queried the data, a simple loop around the post objects in memory will almost certainly be much faster than querying the database for a second time, regardless of which columns are indexed.
This seems do-able for the calls to get_lastpostmodified()
and get_lastcommentmodified()
in the various feed templates. Unfortunately, these functions are also used to generate the HTTP headers Last-Modified
and ETag
for feeds and which are output before we have even run the main query.
What do you think? Is there a safe and efficient way of generating those HTTP header values?
See also #13066.
#28
@
6 years ago
- Keywords bulk-reopened added
Hi there,
Just to add something from what I'm seeing. We're running a large multi-tenanted WordPress setup (all on close to the latest 4.x versions at the moment). I've been analyzing the slow query log (showing queries longer than 2 secs on a large but busy MySQL RDS instance).
This query is by far and away the worst culprit across all my databases. It is doing a Table Scan of often over 100,000 rows, sometimes taking 20 seconds or so. Of the top 100 slowest query on my instance today for example (running about 60 databases), rougly 70% of the slowest queries where this one.
From my point of view, it is certainly worth a dedicated index if this query is needed.
#29
@
5 years ago
Same issue here. This functionality doesn't behave well with large post tables (in our case 300k+ posts).
There are two changes I'd love to see in post.php
:
get_lastpostmodified
has a pre and after filter, where especially the pre filter is useful as it allows to provide different data, instead of running this extremely slow query. get_lastpostdate
sadly doesn't provide such functionality.
Worse is the issue in _get_last_post_time
where when $timezone
is set to Blog
it selects the value of either post_date
or post_modified
, which at least for post_date would use the index type_status_date
on wp_posts, but then orders by post_{$field}_gmt
which skips the index.
#30
@
5 years ago
Still horribly slow in WordPress 5.2.3:
# Query_time: 2.345178 Lock_time: 0.000041 Rows_sent: 1 Rows_examined: 542040
SELECT post_date_gmt FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment') ORDER BY post_date_gmt DESC LIMIT 1;
The same result, but 4 times faster (but still slow: 0.65 sec):
SELECT MAX(post_date_gmt) FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment');
#31
in reply to:
↑ 7
@
5 years ago
- Keywords bulk-reopened removed
- Owner set to SergeyBiryukov
- Status changed from new to reviewing
This ticket was mentioned in PR #2805 on WordPress/wordpress-develop by mukeshpanchal27.
2 years ago
#33
Trac ticket: https://core.trac.wordpress.org/ticket/15499
#34
@
2 years ago
- Description modified (diff)
Hi there!
@simonwheatley PR updated against the Trunk version can you please test if it works for you?
@SergeyBiryukov is this ticket is in your to-do list? Do you think it will marge in the upcoming release?
#35
@
2 years ago
Before the index
# Query_time: 20.442996 Lock_time: 0.000056 Rows_sent: 1 Rows_examined: 214661 # Rows_affected: 0 Bytes_sent: 139 SET timestamp=1660498161; SELECT post_modified_gmt FROM ***_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment') ORDER BY post_modified_gmt DESC LIMIT 1;
These 20sec droped to only 0.4sec after applying index.
This ticket was mentioned in PR #3117 on WordPress/wordpress-develop by mukeshpanchal27.
2 years ago
#36
Trac ticket: https://core.trac.wordpress.org/ticket/15499
@mukesh27 commented on PR #3117:
23 months ago
#37
Ping @SergeyBiryukov @costdev @craigfrancis for review.
@mukesh27 commented on PR #3117:
22 months ago
#38
The DB performance has been run both before and after the patch. This is the outcome: https://docs.google.com/spreadsheets/d/1o_sALzt-OvpDs7Cba14Ib8Yrz-zeO5UmrM4OCd-65tE/edit#gid=0 of the analysis across various WordPress setup post sets.
You can access the analysis' findings at https://docs.google.com/spreadsheets/d/1o_sALzt-OvpDs7Cba14Ib8Yrz-zeO5UmrM4OCd-65tE/edit#gid=198509179. This demonstrates unequivocally that we significantly improve query executions for medium- to large-sized sites.
For a review, ping @dd32.
craigfrancis commented on PR #3117:
22 months ago
#39
Taking a quick look at this, I think it's a good addition, but I will note that I'm not overly familiar with this query.
As to removing the call to get_lastpostdate()
, I'm fairly sure nacin is right, and post_date_gmt
shouldn't be greater than post_modified_gmt
... but, this change does mean "get_lastpostdate" filter won't be called, which I don't think it will be a problem, but it's worth noting.
---
And just to add the usual notes about adding an INDEX... well used queries often benefit from them, but the caveats are:
- Indexes do negatively affect queries that modify the indexed values (i.e. it takes time to update the index, usually this is a very small performance issue, but it can be noticeable when lots of modifications are being performed);
- Each index does take up disk and RAM space.
- While not relevant in this case (as they cover datetime fields); if the index does not contain enough unique values, the database is unlikely to use it.
@mukesh27 commented on PR #3117:
22 months ago
#40
Thank you for your input, @craigfrancis . Which choice is best in this situation? Should we take into account RAM and disc use instead of adding an index to speed up query execution?
craigfrancis commented on PR #3117:
22 months ago
#41
I suspect you know the WordPress audience better than me... it really depends on how often this function used, if the majority of websites use it (ref send_headers() for feeds, get_feed_build_date() fallback, and maybe extensions?), then they should all benefit from the speed improvement, and would probably be ok with the relatively small disk/ram usage increase... but if it's rarely used, then those affected could add this index themselves (in a similar way to how they will be tuning their install for other queries that are slow for them).
22 months ago
#42
Hey @craigfrancis, on the "relatively small disk/ram usage increase", I note that you say "probably be ok". Is that you just being safe, or would you have concerns about merging this PR if you were a committer?
As this PR ultimately hits a lot of Core via send_headers()
, get_feed_build_date()
, their "Used by" chain, and extender usage, I just want to clarify whether a RAM increase in particular would be enough (cumulatively too) to warrant concern?
craigfrancis commented on PR #3117:
22 months ago
#43
Hi @costdev, I'm just being paranoid.
For your typical website, with a few hundred/thousand records, I doubt a single person will notice, it's when you get to the bigger websites (100,000+ records) there *might* be problems.
I've just created a simplistic test with 100,000 records in a InnoDB table on a fairly quick SSD... to run the ALTER TABLE
took 0.9 seconds to add both indexes (ref upgrade.php
)... and the SHOW TABLE STATUS
reported the Index_length
went from 11 MB to 18 MB (~7 MB; and yes, I know, skipping over some details there)... running an UPDATE to change post_modified_gmt = DATE_SUB(NOW(), INTERVAL `id` SECOND)
(so each record has a different value), when that's for a specific/single record (e.g. WHERE id = 123
) it barely makes a difference in the processing time, but updating every record (which I doubt anyone will do) is ~0.6 without the indexes, and ~2.0 sec with the indexes.
@mukesh27 commented on PR #3117:
22 months ago
#44
I appreciate your feedback and sharing the test findings, @craigfrancis. It is quite beneficial.
My and @costdev understanding of your result is:
ALTER TABLE
will only be run once - upon installation/upgrade- The
~0.6 without, ~2.0 sec
with was when updating 100,000 records all at once. How common a case would this be, and would ~1.4 be a terrible wait for someone waiting for 100,000 updates to run?
Should we consider 0.9 seconds
to alter the table upon upgrade/installation, and ~1.4 seconds
when updating 100,000 records acceptable.
Previously, @pento added an index to wp_options.autoload
#24044 which was a good addition and improved performance as per the follow-up comments on the ticket.
If it is acceptable, then it would be good to include it now so we get more time to test it.
craigfrancis commented on PR #3117:
22 months ago
#45
I suspect it will be good to include, but I don't have enough experience with WordPress users to say for certain (my patch had been committed for most of 6.1 development time, and it wasn't until the last RC that someone found some undocumented feature that caused it to be pulled, so I'm just being cautious).
#46
@
22 months ago
- Milestone changed from Future Release to 6.2
Thanks @craigfrancis for confirmation.
Moving to 6.2
for visibility.
22 months ago
#47
Thanks @craigfrancis! Let's ping some committers for their thoughts.
@SergeyBiryukov, @peterwilsoncc, @dream-encode
#48
@
22 months ago
This looks good to me. About the bulk-update slowdown objection, bulk updates are very rare compared to page views, and aren't generally a good reason to avoid otherwise useful indexes.
I think it's a good idea to make these new index names follow the convention in others, where column names appear in the index names. There's zero penalty for long index names.
KEY type_status_post_date_gmt (post_type,post_status,post_date_gmt) KEY type_status_modified_date_gmt (post_type,post_status,post_modified_gmt)
#49
@
22 months ago
Some technical info on MySQL. These will explain some of the timing differences mentioned in this thread. And may help in designing INDEXes in the future.
- "IN ('post', 'page', 'attachment')" is less optimizable than "= 'post'". (Note in the slowlog "Rows examined".)
- It is best to put columns tested with "=" first in the INDEX. (The order of those columns in the index does not matter.) So, ask which of these is usually tested with "=": post_status or post_type.
- A "covering index" is one where _all_ the columns that are used _anywhere_ in the SELECT are present in the INDEX. (This explains some of the improved times.) The reason that covering is better is that all the work is done in the B+Tree containing the INDEX; no need to reach over to the data's B+Tree to fetch more columns.
- INDEX(post_status, post_type, post_modified_gmt, post_date_gmt) would possibly the best overall. Status is first because of "=". For all the mentioned queries, it would be "covering".
- WHERE post_status = '...' AND post_type = '...' ... and get post_modified_gmt -- This query would be very fast (Rows_examined = 1) with the above index. All other queries would be less efficient -- either due to IN or due to not having the index's columns ordered optimally. But they would be somewhat efficient due to "covering".
@mukesh27 commented on PR #3117:
22 months ago
#50
Thanks @OllieJones. The index name changes look good to me, so I have updated PR with the suggested name.
#51
@
22 months ago
Thanks @rjasdfiii for the info. So you are suggesting to use single INDEX instead of two?
INDEX(post_status, post_type, post_modified_gmt, post_date_gmt)
#52
@
21 months ago
Ollie's two 3-column indexes versus my one 4-column index:
- It is hard to predict which would be better overall.
- More indexes has a slight burden on disk space and INSERT/UPDATE time.
- The longer index is slightly slower for the GMT usage for it.
Without a good enough benchmark to choose between them, I would flip a coin.
This ticket was mentioned in Slack in #core by costdev. View the logs.
20 months ago
#54
@
20 months ago
It's been a long time since I last looked into this, if anyone else would like to own the ticket, happy to transfer :) Otherwise, will try to find some time to review the latest PRs.
Curious, does the index addition needs a $wp_db_version
bump?
#55
@
20 months ago
Yes, a $wp_db_version
bump is needed when adding an index to schema.php.
Conversations about this in @core-performance have tended in the direction of avoiding core changes to indexes specifically for large sites, but rather putting such changes in a a "canonical plugin". (The meaning of "canonical plugin" is not yet defined as of this writing.)
Suggestion: add this to @rjasdfiii and my [reindexing plugin](https://wordpress.org/plugins/index-wp-mysql-for-speed/) rather than core.
At any rate, I'll take responsibility for this if you like.
Workaround -- use wp-cli to add the index like this:
wp db query "ALTER TABLE wp_posts ADD KEY type_status_post_date_gmt (post_status, post_type, post_modified_gmt, post_date_gmt)
(Maybe it's time to develop an EXPLAIN addon to Query Monitor to show query plans.)
This ticket was mentioned in Slack in #core by costdev. View the logs.
20 months ago
#57
@
20 months ago
- Milestone changed from 6.2 to Future Release
This ticket was discussed during the bug scrub. As discussion about the future of this change is still ongoing, and we're approaching 6.2 Beta 1 in the next few days, I'm moving this to Future Release so that discussion can continue.
Additional props: @audrasjb @petitphp
This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.
12 months ago
#60
@
5 months ago
Hello everyone has this issue been resolved in the latest wordpress version? If it causes lag, why don't we just remove this function?
New indexes for wp_posts