Fix feed inefficiencies - Possibly Extend Etag/Last-Modified/Conditional Get functionality
|Reported by:||brokentone||Owned by:|
Description (last modified by SergeyBiryukov)
First the story.
Attempting to generate what are effectively feeds on our WP site, we've been using the WP rewrite functions in a page/template context for some time, but recently it's made more sense to use the feed context for this. It seems as if there is some liability to using pages--requires a page to be created and not changed by non-dev users and requires a coordinated database and code solution to create. And it introduces some complexity/mess with creating template files for feed logic/display, and again, these dummy pages have to exist. Finally--assigning this feed to a page required the WP URL canonical functionality to occur, so I couldn't use "feed.json", I had to use "feed.json/" which is just not cool. An entirely code-based solution that effectively rewrites a URL pattern to a function is perfect--which is effectively what the feed context is.
As an aside--for my purposes could I use actual Apache rewrites? Well sure, but I am using WP functions to retrieve the data for my feed--in which case I would have to separately bootstrap my function, and I would assume incur fairly similar performance implications. Also, adding a feed still feels like something that should be code-addressable to me and rolling rewrites to all of our production web servers isn't necessarily fun. I'll be benchmarking this and get back to you though.
In my page->feed conversion, I was curious what the performance implications were. I figured with less files and functions (related to the theme layer) involved, it could only be better using a feed. Not so, it was worse at about 5x to run the same feed creating logic. I sat down with xdebug and query logging to figure out what was going on and found that there were a few extra queries being run on the feed side:
SELECT post_modified_gmt FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment', 'story') ORDER BY post_modified_gmt DESC LIMIT 1; SELECT post_date_gmt FROM wp_posts WHERE post_status = 'publish' AND post_type IN ('post', 'page', 'attachment', 'story') ORDER BY post_date_gmt DESC LIMIT 1; SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND wp_posts.post_type = 'post' AND (wp_posts.post_status = 'publish' OR wp_posts.post_status = 'private') ORDER BY wp_posts.post_date DESC LIMIT 0, 50; SELECT FOUND_ROWS();
These are coming from a branch in /wp-includes/class-wp.php WP::send_headers(). It's doing very basic logic to determine via the query string if we expect this request to be a feed, then it uses either get_lastcommentmodified() or get_lastpostmodified() and uses the result of that to set and send some headers for last modified and etag, and evaluate request values for conditional get (304) responses.
For my application--creating arbitrary feeds actually based on WP options--the last time any post or comment was updated/added has no correlation to my feed. It may not be updated at the point a post is updated, but it would show as being updated--more dangerously, it would show as not being updated and return a 304 if a post hasn't been updated. Also, it's incurring overhead doing this lookup that has absolutely no value.
How about the most common WP core uses? I'm thinking it's still going to result in a lot of false positives (saying it's updated more recently than it really is) that certainly is going to depend on number of post types, but even with say a default install with pages and posts, any time a page is updated (which is not in any default feed), it will be affecting the default posts feed.
There is no hook to allow for the disabling or changing of this logic. And, there is no way to apply some form of this logic to heavier page generations that might be kind of nice.
- Add a hook to this send headers function that would allow it to be fully disabled for certain feeds.
- Move the functionality to a discrete function that can be passed the actual page/feed updated time and it can then send the proper headers. Then the default feed generation functions can call back to this functionality when they know the actual updated dates.
I actually favor the latter--this is kinda nice functionality that is trapped inside a less useful branch and is "stuck on" even when it's not helpful. I would be happy to help write a patch for some of this if some core people can jump on here and provide some direction and blessing.
I've noticed a little discussion on this issue elsewhere:
#19466 - This person is attempting to apply the existing logic--page/comment last modified across every request which would be a massive mistake. Assuming that's why there has been no response.
#15499 - This person is suggesting an index to get the results quicker--that could be beneficial additionally, but perhaps my modifications could reduce the need for such an index.