WordPress.org

Make WordPress Core

Opened 9 years ago

Last modified 2 months ago

#15499 new enhancement

Add an index for get_lastpostmodified query

Reported by: simonwheatley Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.0.1
Component: Database Keywords: has-patch dev-feedback 4.1-early commit bulk-reopened
Focuses: performance Cc:
PR Number:

Description

I had a friend (Jools Wills) look over a WordPress site recently, to get a fresh view on what might be optimised, and he noticed a query which might benefit from an additional index on WP_Posts. The query SELECT post_modified_gmt FROM $wpdb->posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1 in get_lastpostmodified is run for last modified date in GMT, and currently doesn't use an index. This SQL is run whenever certain types of feed are requested as far as I can see.

We added CREATE INDEX type_status_modified ON wp_posts (post_type, post_status, post_modified_gmt); and CREATE INDEX type_status_modified_no_id ON wp_posts (post_type, post_status, post_date_gmt); and the query runs a lot faster now. The following timings were taken running the first query (post_modified_gmt) on a 36,362 row posts table. Note that it doesn't use filesort after the index has been added.

Before:

mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1;
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
| id | select_type | table      | type | possible_keys    | key              | key_len | ref         | rows  | Extra                       |
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
|  1 | SIMPLE      | slgr_posts | ref  | type_status_date | type_status_date | 124     | const,const | 24718 | Using where; Using filesort |
+----+-------------+------------+------+------------------+------------------+---------+-------------+-------+-----------------------------+
1 row in set (0.03 sec)
  • 0.21290683746338ms
  • 0.25690102577209ms
  • 0.230553150177ms
  • 0.2274341583252ms
  • 0.23083996772766ms

After:

mysql> EXPLAIN SELECT post_modified_gmt FROM slgr_posts WHERE post_status = 'publish' AND post_type = 'post' ORDER BY post_modified_gmt DESC LIMIT 1;
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
| id | select_type | table      | type | possible_keys                         | key                  | key_len | ref         | rows  | Extra       |
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | slgr_posts | ref  | type_status_date,type_status_modified | type_status_modified | 124     | const,const | 24718 | Using where |
+----+-------------+------------+------+---------------------------------------+----------------------+---------+-------------+-------+-------------+
1 row in set (0.00 sec)
  • 0.00082707405090332ms
  • 0.00072288513183594ms
  • 0.00074386596679688ms
  • 0.00066494941711426ms
  • 0.00066208839416504ms

In get_lastpostmodified both these queries are run, so the total savings in my case on a quiet server are nearly 0.5 seconds... worth having, I reckon.

I've not created a patch for schema changes before, but I think the only place the change would need to go would be scheme.php? Suggested patch attached.

Attachments (2)

new-keys.diff (569 bytes) - added by simonwheatley 9 years ago.
New indexes for wp_posts
15499.diff (528 bytes) - added by simonwheatley 5 years ago.
Diff to cut the get_lastpostdate call from within get_lastpostmodified

Download all attachments as: .zip

Change History (30)

@simonwheatley
9 years ago

New indexes for wp_posts

#1 @dd32
9 years ago

  • Keywords 3.2-early added
  • Milestone changed from Awaiting Review to Future Release

#2 @pento
7 years ago

  • Keywords 3.2-early removed

These are good keys and should be added.

#3 @nacin
7 years ago

  • Milestone changed from Future Release to 3.5

#4 @simonwheatley
7 years ago

Let me know if there's anything further you need from me for this.

#5 @nacin
7 years ago

I recall barry hesitating for the idea of adding an index to improve a single query.

This query is persistently cached, so this is definitely less of a concern than if we couldn't reasonably cache it.

get_lastpostmodified() is only used in core with 'GMT' (and only for feeds), so I think at most we need to add type_status_modified.

#6 @pento
7 years ago

An index to improve a single query is fine, if that query is run frequently enough.

I don't think using the object cache is a good enough reason to not do this - do that many sites really have a persistent cache? Similarly, I think these indexes would be more helpful on self-hosted sites than they would be harmful on WP.com.

I'm not following why we'd do one but not the other - both queries are run every time get_lastpostmodified() is called.

Finally, nobody likes long read locks. :)

#7 follow-ups: @nacin
7 years ago

I didn't catch that get_lastpostmodified() also calls get_lastpostdate(), triggering the other query. Lame.

Here's the thing. I don't know why. get_lastpostdate() wants to know the newest published post. Since it's published, it can't be in the future. get_lastpostmodified() wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

This compare-modified-and-last-post-date thing was introduced, no joke, in [908]. Probably time it got challenged.

#8 @nacin
7 years ago

  • Keywords 3.6-early added
  • Milestone changed from 3.5 to Future Release

#9 @brokentone
7 years ago

The great majority of the time these functions are called are within a feed context, and in fact, they're called multiple times there. Once, in the WP::send_headers() then once for the feed generation. I'm not sure if this query is caching within WP or mysql at all and thus whether the second call is significant, but I would expect it is significant.

I'm looking to remove the first call, and set the headers later in #22742. If there are hesitations to adding the index, then perhaps reducing the query by 1/2 would be another way to deal with this issue. (Although, without a full investigation, I like the index).

#10 @nacin
6 years ago

See also #25358. Would be nice to fix all of this at some point.

Last edited 6 years ago by SergeyBiryukov (previous) (diff)

This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.


5 years ago

@simonwheatley
5 years ago

Diff to cut the get_lastpostdate call from within get_lastpostmodified

#12 in reply to: ↑ 7 @simonwheatley
5 years ago

Replying to nacin:

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

15499.diff just attached addresses this. The only test which exercises get_lastpostmodified directly, test_channel in the feed group, fails on an earlier (unrelated) assertion.

This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.


5 years ago

#14 @SergeyBiryukov
5 years ago

  • Milestone changed from Future Release to 4.0

#15 in reply to: ↑ 7 @DrewAPicture
5 years ago

  • Keywords commit added

Replying to nacin:

Here's the thing. I don't know why. get_lastpostdate() wants to know the newest published post. Since it's published, it can't be in the future. get_lastpostmodified() wants to know the latest modified date. There is never going to be a situation that I can think of (aside from manually messing around with the values) where the latest post_modified date in a table is going to be less than the latest post_date in a table.

I think we can just axe the call to get_lastpostdate() from get_lastpostmodified() for the same amount of functionality.

Seems logical. 15499.diff still applies cleanly.

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.


5 years ago

This ticket was mentioned in IRC in #wordpress-dev by DrewAPicture. View the logs.


5 years ago

#18 @DrewAPicture
5 years ago

  • Keywords 4.1-early added; 3.6-early removed
  • Milestone 4.0 deleted

We've reached the end of line, at least for 4.0. Punting.

#19 @helen
5 years ago

  • Milestone set to Future Release

#20 @nacin
5 years ago

#31171 was marked as a duplicate.

#21 @archon810
5 years ago

  • Focuses performance added

#31171 is indeed a duplicate, but the proposed solution there is far more optimized than the one in this ticket because it ends up with Using where; Using index and a much shorter table scan.

Please consider the index in #31171.

Can we perhaps push this issue to be resolved in 4.2?

Last edited 5 years ago by archon810 (previous) (diff)

#22 follow-ups: @Denis-de-Bernardy
5 years ago

There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:

https://core.trac.wordpress.org/ticket/4575#comment:14

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

#23 in reply to: ↑ 22 @archon810
5 years ago

Replying to Denis-de-Bernardy:

There's another salient ticket over at #4575. The need for a last modified date is based on the feed, and what could be done instead of adding an index is this:

https://core.trac.wordpress.org/ticket/4575#comment:14

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

If feeds are the only users of this unoptimized function, and by resolving that ticket you can deprecate it, then by all means, I'll be glad to remove the unnecessary index. However, if it's only going to solve it for feeds but other places in the code may still use it, then the index should be put in place.

Either way, hopefully this saga comes to its end in 4.2.

Version 1, edited 5 years ago by archon810 (previous) (next) (diff)

#24 @Denis-de-Bernardy
5 years ago

It's the case if a cursory search is anything to go by.

get_lastpostmodified() is used once in each feed template file over in wp-include, and another time in the wp-include/class-wp.php to create an ETag for post feeds.

get_lastcommentmodified() gets an identical treatment.

#25 in reply to: ↑ 22 @mdgl
5 years ago

Replying to Denis-de-Bernardy:

In essence, rather than serving the absolute last modified, we could fetch the posts or comments that appear in a feed, and then simply loop through the array to find their last modified date (which would then indeed concern that specific feed, and that only, solving this ticket and #4575 in one go).

I agree it would be much better if we used the last modified date of the data we have actually queried, rather than the last modified date of all posts/comments in the database. As far as I can see, the current code will presently return the wrong value for most category, tag and custom post type feeds (see also #22742).

Having queried the data, a simple loop around the post objects in memory will almost certainly be much faster than querying the database for a second time, regardless of which columns are indexed.

This seems do-able for the calls to get_lastpostmodified() and get_lastcommentmodified() in the various feed templates. Unfortunately, these functions are also used to generate the HTTP headers Last-Modified and ETag for feeds and which are output before we have even run the main query.

What do you think? Is there a safe and efficient way of generating those HTTP header values?

See also #13066.

Last edited 5 years ago by mdgl (previous) (diff)

#28 @McBoof
8 months ago

  • Keywords bulk-reopened added

Hi there,

Just to add something from what I'm seeing. We're running a large multi-tenanted WordPress setup (all on close to the latest 4.x versions at the moment). I've been analyzing the slow query log (showing queries longer than 2 secs on a large but busy MySQL RDS instance).

This query is by far and away the worst culprit across all my databases. It is doing a Table Scan of often over 100,000 rows, sometimes taking 20 seconds or so. Of the top 100 slowest query on my instance today for example (running about 60 databases), rougly 70% of the slowest queries where this one.

From my point of view, it is certainly worth a dedicated index if this query is needed.

#29 @mwidmann
7 months ago

Same issue here. This functionality doesn't behave well with large post tables (in our case 300k+ posts).

There are two changes I'd love to see in post.php:

get_lastpostmodified has a pre and after filter, where especially the pre filter is useful as it allows to provide different data, instead of running this extremely slow query. get_lastpostdate sadly doesn't provide such functionality.

Worse is the issue in _get_last_post_time where when $timezone is set to Blog it selects the value of either post_date or post_modified, which at least for post_date would use the index type_status_date on wp_posts, but then orders by post_{$field}_gmt which skips the index.

#30 @Krstarica
2 months ago

Still horribly slow in WordPress 5.2.3:

# Query_time: 2.345178 Lock_time: 0.000041 Rows_sent: 1 Rows_examined: 542040
SELECT post_date_gmt FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment') ORDER BY post_date_gmt DESC LIMIT 1;

The same result, but 4 times faster (but still slow: 0.65 sec):
SELECT MAX(post_date_gmt) FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment');

Note: See TracTickets for help on using tickets.