WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 5 weeks ago

#14759 new enhancement

Improve the way oEmbed deals with caching

Reported by: Viper007Bond Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.0.1
Component: Embeds Keywords:
Focuses: Cc:

Description (last modified by Viper007Bond)

As Matt pointed to me today, caching oEmbed results to post meta is kinda lame.

I originally did this because I didn't want the HTML in old posts changing on it's own, say for example if the oEmbed provider got compromised. However I think that's extremely unlikely since we only whitelist major websites that can be trusted.

Perhaps instead we should use transients to cache this data. That way when embed HTML is updated, posts will eventually be updated too due to expiring caches. I'm thinking a cache time of a week would be good, but with some random +/- to make sure multiple caches don't all expire at once.

A good example of this is Vimeo. Anyone who embedded a Vimeo embed in their post before a few weeks ago got <object>-based embeds. However now Vimeo gives out <iframe> based ones. The only way currently to update the old embeds to the new HTML (without using a few lines of code) is to edit all of the posts and re-save them to trigger a cache update. That's painful.

Alternate suggestions welcome though.

Attachments (3)

14759.patch (4.0 KB) - added by leewillis77 13 months ago.
Initial patch for discusssion
14759_2.patch (6.3 KB) - added by leewillis77 13 months ago.
Revised patch for discussion
14759.diff (2.3 KB) - added by markjaquith 7 weeks ago.
Attempt at slightly better caching behavior

Download all attachments as: .zip

Change History (45)

comment:1 Viper007Bond4 years ago

  • Description modified (diff)

comment:2 Viper007Bond4 years ago

  • Description modified (diff)

comment:3 nacin4 years ago

I like the idea of these expiring, but my first take is that this storage solution has its own drawbacks. Storing them with postmeta is flexible and is tied to the post, versus having an ever-growing number of transients, which often will be stored in the options table.

From a more practical perspective, that also means new queries for the transient, as they're neither autoloaded like options or queried as part of the post cache like postmeta.

But again, I do like the idea of these expiring and being updated (assuming of course proper oEmbed is returned on background refresh).

comment:4 follow-up: filosofo4 years ago

Why is it lame? For the vast majority of people, the content of a post (and hence its video markup) won't change ever again from a short time after the post is published. It doesn't make much sense to keep refreshing the data as each successive transient expires for something (the post) that remains static over that time.

Actually I think it would be neat to expand that idea and offer a transient-like cache and API specific to each post, so that you can expire the cache only when a post object itself changes.

comment:5 scribu4 years ago

  • Cc scribu added

comment:6 in reply to: ↑ description ; follow-up: Denis-de-Bernardy4 years ago

Replying to Viper007Bond:

A good example of this is Vimeo. Anyone who embedded a Vimeo embed in their post before a few weeks ago got <object>-based embeds. However now Vimeo gives out <iframe> based ones. The only way currently to update the old embeds to the new HTML (without using a few lines of code) is to edit all of the posts and re-save them to trigger a cache update. That's painful.

Alternatively, WP could flush postmeta by key when we know a few that are invalid during upgrades. Btw, is it that much of a big deal? I haven't tested, but I would assume that Vimeo made sure the old oEmbed code would continue to work.

If you need postmeta to expire, the easy approach is to an expire date tied to it in a separate postmeta (much like we do for transients in the options table: _$key_expires). Ideally, the stuff key in question gets detected and its parent key gets stored accordingly in memcache.

I'm in agreement with filosofo though: it makes little sense to keep refreshing the data.

comment:7 in reply to: ↑ 6 ; follow-up: Viper007Bond4 years ago

The other issue is when embeds are used outside of posts, in the sidebar for example. Where do we cache to? Currently it caches sorta randomly (whatever $post happens to be set to at the time).

Replying to Denis-de-Bernardy:

Alternatively, WP could flush postmeta by key when we know a few that are invalid during upgrades. Btw, is it that much of a big deal? I haven't tested, but I would assume that Vimeo made sure the old oEmbed code would continue to work.

We can't flush by service. The meta key is _oembed_md5-hash-of-url-and-args. On post save, _oembed_* are deleted for a post.

Here's some code though that deletes all oEmbed caches: http://core.trac.wordpress.org/attachment/ticket/10337/10337.10.patch

If you need postmeta to expire, the easy approach is to an expire date tied to it in a separate postmeta (much like we do for transients in the options table: _$key_expires). Ideally, the stuff key in question gets detected and its parent key gets stored accordingly in memcache.

That'd require a heavy rewrite of post meta.

comment:8 in reply to: ↑ 7 Denis-de-Bernardy4 years ago

Replying to Viper007Bond:

The other issue is when embeds are used outside of posts, in the sidebar for example. Where do we cache to? Currently it caches sorta randomly (whatever $post happens to be set to at the time).

Don't we have an in_the_loop() function for that? In a sidebar, it would make a lot of sense.

Replying to Denis-de-Bernardy:

Alternatively, WP could flush postmeta by key when we know a few that are invalid during upgrades. Btw, is it that much of a big deal? I haven't tested, but I would assume that Vimeo made sure the old oEmbed code would continue to work.

We can't flush by service. The meta key is _oembed_md5-hash-of-url-and-args. On post save, _oembed_* are deleted for a post.

If memory serves, the ticket I had looked into related to implementing delete_post_meta_by_key() had code to allow the use of a like statement.

Here's some code though that deletes all oEmbed caches: http://core.trac.wordpress.org/attachment/ticket/10337/10337.10.patch

If I may suggest, the huge IN statement could have been replaced by a join or a subquery. It would make the query much faster on larger sites.

If you need postmeta to expire, the easy approach is to an expire date tied to it in a separate postmeta (much like we do for transients in the options table: _$key_expires). Ideally, the stuff key in question gets detected and its parent key gets stored accordingly in memcache.

That'd require a heavy rewrite of post meta.

For the store in memcached part, yes. For the rest, it seems like a one-liner in get_meta() that checks for a separate key before returning the value. Or a similar one liner in the oEmbed code before using the cached value, no?

comment:9 Denis-de-Bernardy4 years ago

Btw, I run into this kind of situation with a number of plugins of mine that cache their output when it's complex to generate. I use transients when out of the loop, and postmeta within it. The hard part is flushing, not storing or retrieving.

In our case we've a reasonably simple approach that works: All oEmbed caches could be flushed on WP upgrades. Seems frequent enough IMO. We've a recent ticket about disabling security filters on the front end to make things faster; surely we don't want to replace them with extra API calls to external services. ;-)

comment:10 nacin3 years ago

  • Milestone changed from Awaiting Review to Future Release

comment:11 in reply to: ↑ 4 r-a-y3 years ago

  • Cc r-a-y@… added

comment:12 Mamaduka19 months ago

  • Cc georgemamadashvili@… added

comment:13 helen14 months ago

Some related discussion on #17210

comment:14 leewillis7713 months ago

#23763 was marked as a duplicate.

comment:15 leewillis7713 months ago

I posted a patch to #23763 with the main aim of supporting cache_age in oEmbed responses, but it seems there's a lot of crossover with this bug so I've closed that in favour of this. Details and patch copied below:

Currently oEmbed results are cached permanently in the postmeta table. These cached results never expire, and will be used forever, or until the WordPress post is updated. This is reasonable behaviour for resources that do not specify a cache_age - although I'd argue we should re-check the oEmbed source occassionally to cope with the case where the original resource is updated / removed etc.

The oEmbed documentation (​http://oembed.com/ - section 2.3.4) lists cache_age as optional - so providers may not provide one. If it is provided, consumers still aren't bound to honour it ("Consumers may choose to use this value or not.") - so the current behaviour is perfectly in line with the spec, but I believe we can do better.

Several of the currently accepted oEmbed providers provide cache_ages (E.g. flickr, hulu, and twitter). Allowing cache_ages also helps when embedding information that can change over time (See ​http://wordpress.org/extend/plugins/github-embed/ and ​http://wordpress.org/extend/plugins/wporg-plugin-embed/ for example).

Patch to follow which:

  • Moves the caching of oEmbed responses from class-wp-embed.php down into class-oembed.php where it has access to the oEmbed response
  • Caches oEmbed HTML in transients rather than postmeta (The same embedded resource will be used no matter which post it is embedded in)
  • Sets a default cache period of 12 hours for oEmbed failures
  • Sets a default cache period of 30 days if no cache_age is specified
  • Sets a cache period according to cache_age if it is specified
  • Adds a new third argument to WP_oEmbed->get_html() and wp_oembed_get() to pass the value of $usecache through from classs-wp-embed.php

More than happy to work this patch up as needed, I'm aware there's still some stuff to tidy up with respect to clearing out the functions that currently drop the postmeta on save/update etc. I'd like to get agreement on the broad approach first though to save duplication of effort.

leewillis7713 months ago

Initial patch for discusssion

comment:16 leewillis7713 months ago

  • Cc junk@… added

comment:17 DrewAPicture13 months ago

  • Cc xoodrew@… added

comment:18 Viper007Bond13 months ago

Transients unfortunately probably aren't a viable solution. The problem is that if you have a persistent object cache enabled, transients are never written to the database, only to memcached or whatever system you have set up. These caches come and go and will almost certainly not last 30 days (likely much, much less). This would result in more frequent cache fetching. This is why I chose post meta in the first place -- it is persistent.

There's also the issue of race conditions. If a cache expires and your site is very popular, you will easily have hundreds of users triggering the cache repopulation before it is completed. This is why caching is done preemptively right now via an AJAX request (after you come back to the edit page after saving). It's only done on-demand via the frontend if that AJAX request somehow fails (or potentially in a race condition).

Asynchronous cache repopulation is probably the solution here, i.e. WP-Cron. It could loop through the post meta entries and regenerate any that have expired.

Post meta isn't a great solution but it's the best one I've come up with. :/

Last edited 13 months ago by Viper007Bond (previous) (diff)

comment:19 leewillis7713 months ago

Thanks for the explanation. With that in mind I think I'm going to re-open my original ticket, and I'll post a revised patch that concentrates solely on respecting cache_age in oEmbed responses, taking into account the concerns here about cache re-population - would that be sensible?

comment:20 helen13 months ago

I think it could all be dealt with together on this ticket. I've been thinking about the flip side - whether we could indicate that a cached oEmbed response is extremely new and doesn't need to be dumped and re-requested (somebody is obsessively fixing small things and updates 10+ times after publishing a post, which kills the Twitter rate limit pretty quickly, and implementation of things like #15490). I guess something indicating the age of a cached response? Not sure.

comment:21 leewillis7713 months ago

The attached patch:

  • Respects cache_age if it is provided by the oEmbed provider
  • Caches for 30 days if not (Should we make this filterable?)
  • Caches failures for 12 hours
  • Stores the oEmbed response in postmeta (_oembed_md5)
  • Stores the expiry time for that embed in postmeta also (_oemexp_md5)
  • Deletes oEmbed data on post update (As per current code)

The missing bit for me currently is how do we preload the oEmbed resource again once it expires (Or is near to expiry). I'm happy to add a cron job that retrieves any postmeta entries that have expired/are due to expire in the next hour(?) and preloads them if we agree that's sensible?

leewillis7713 months ago

Revised patch for discussion

comment:22 SergeyBiryukov13 months ago

Please use tabs rather than spaces for indentation (per WordPress Coding Standards).

comment:23 jtsternberg7 months ago

Using postmeta for caching limits the ability to use oEmbed in places like options panels, user profile pages, and other places where there is no access to a global $post. I'm currently trying to figure out a way to bypass the post ID requirement in order to enable oEmbeds on profile edit page (for a client).

comment:24 jtsternberg7 months ago

  • Cc justin@… added

comment:25 follow-up: sanchothefat5 months ago

  • Cc robert@… added

I recently managed to brick a server by incorrectly assuming that caching would occur in the WP_oEmbed class. Confessional aside I make use of wp_oembed_get() fairly often and have had to implement my own transient based caches similar to Lee's in various plugins.

So the issue is permanence (hence using post meta instead of transients) and using regular options is not possible because it would use up too much memory when auto loaded.

Can I get some opinions on whether it's a horrible idea to think of oembed strings as media that can be stored as text files in the uploads folder?

Here's my thinking:

  • WP_oEmbed checks cache using file_exists (php caches the result in stat cache)
  • Cache hits are returned using $wp_fileSystem->get_contents() or readfile() whichever is quicker
  • Cache misses are requested and written to disk with a name like /uploads/oembeds/%provider%-md5(%url%+%args%).txt
  • cache_age for each provider can be stored as an option
  • A wp-cron task cleans the expired oembed files based on cache_age and filemtime() using glob('%provider%-*')

Is this just a bad idea? It would mean that embeds are available to all sites in a multisite installation. Some data on the number of files this could produce might show it to be impractical unless date based folders could be used.

EDIT: for one of our bigger sites we're talking about 6000 embeds across 30 sites in a network so maybe it'd be better done per site.

Let me know your thoughts on whether this worth coding up or if I should crawl back under my rock :)

Last edited 5 months ago by sanchothefat (previous) (diff)

comment:26 in reply to: ↑ 25 ; follow-ups: rmccue5 months ago

Replying to sanchothefat:

I recently managed to brick a server by incorrectly assuming that caching would occur in the WP_oEmbed class. Confessional aside I make use of wp_oembed_get() fairly often and have had to implement my own transient based caches similar to Lee's in various plugins.

So the issue is permanence (hence using post meta instead of transients) and using regular options is not possible because it would use up too much memory when auto loaded.

add_option() takes an autoload parameter that can be set to 'no' instead, plus this stores it per-site rather than network-wide. Seems better than files to me. :)

comment:27 in reply to: ↑ 26 ; follow-up: Denis-de-Bernardy5 months ago

Replying to rmccue:

add_option() takes an autoload parameter that can be set to 'no' instead, plus this stores it per-site rather than network-wide. Seems better than files to me. :)

But then you're adding extra queries. And storing this kind of stuff in wp_options isn't so great an idea — recall Magpie.

Btw, what's wrong with the idea I suggested earlier, i.e. on WP upgrade, delete all cached oEmbeds? Not by service: all of them. No ifs, no buts. Problem solved if we do that, no?

comment:28 in reply to: ↑ 26 sanchothefat5 months ago

Replying to rmccue:

add_option() takes an autoload parameter that can be set to 'no' instead, plus this stores it per-site rather than network-wide. Seems better than files to me. :)

I'd thought about that but took into account nacin's comment:

I like the idea of these expiring, but my first take is that this storage solution has its own drawbacks. Storing them with postmeta is flexible and is tied to the post, versus having an ever-growing number of transients, which often will be stored in the options table.
From a more practical perspective, that also means new queries for the transient, as they're neither autoloaded like options or queried as part of the post cache like postmeta.

So the goal is to keep queries down as well.

comment:29 in reply to: ↑ 27 sanchothefat5 months ago

Replying to Denis-de-Bernardy:

Replying to rmccue:

add_option() takes an autoload parameter that can be set to 'no' instead, plus this stores it per-site rather than network-wide. Seems better than files to me. :)

But then you're adding extra queries. And storing this kind of stuff in wp_options isn't so great an idea — recall Magpie.

Btw, what's wrong with the idea I suggested earlier, i.e. on WP upgrade, delete all cached oEmbeds? Not by service: all of them. No ifs, no buts. Problem solved if we do that, no?

How would you go about caching within WP_oEmbed? Same as in Lee's patch and passing in $usecache as false in the context of WP_Embed?

comment:30 follow-up: peterbooker5 months ago

Would it be unreasonable to store an expiry time inside the post_meta, and use that to judge expiry instead of deleting all the oembed meta each update?

This would maintain the advantages of storage in post_meta and gain more control over when to refresh the stored content?

comment:31 in reply to: ↑ 30 sanchothefat5 months ago

Replying to peterbooker:

Would it be unreasonable to store an expiry time inside the post_meta, and use that to judge expiry instead of deleting all the oembed meta each update?

This would maintain the advantages of storage in post_meta and gain more control over when to refresh the stored content?

I think the issue is that if you call WP_oEmbed directly using wp_oembed_get() there is no caching - it can be used outside the context of posts so should ideally be cached at a lower level than WP_Embed which is used on post save.

comment:32 follow-up: Viper007Bond5 months ago

Correct, the ideal solution would be a caching system not related to posts but rather the call itself.

Caching to the file system will not work or scale. Many filesystems are not local and reading and writing to filesystems heavily will be slow and not good. Look back at what happened when WordPress tried to store WP_Cache results to the filesystem back around WordPress 2.0 if you don't believe me.

Sadly there isn't really a good solution here, at least one that I've thought of. Post meta was the best I could come up with.

comment:33 in reply to: ↑ 32 ; follow-up: Denis-de-Bernardy5 months ago

Replying to Viper007Bond:

Correct, the ideal solution would be a caching system not related to posts but rather the call itself.

Caching to the file system will not work or scale. Many filesystems are not local and reading and writing to filesystems heavily will be slow and not good. Look back at what happened when WordPress tried to store WP_Cache results to the filesystem back around WordPress 2.0 if you don't believe me.

Sadly there isn't really a good solution here, at least one that I've thought of. Post meta was the best I could come up with.

Methinks it'll remain the best option until we ever mimic what we did with options by creating some kind of post_transient API (and while we're at it, user_transient, comment_transient, etc.).

comment:34 in reply to: ↑ 33 ; follow-ups: sanchothefat5 months ago

Replying to Denis-de-Bernardy:

Replying to Viper007Bond:

Correct, the ideal solution would be a caching system not related to posts but rather the call itself.

Caching to the file system will not work or scale. Many filesystems are not local and reading and writing to filesystems heavily will be slow and not good. Look back at what happened when WordPress tried to store WP_Cache results to the filesystem back around WordPress 2.0 if you don't believe me.

I believe you :)

Sadly there isn't really a good solution here, at least one that I've thought of. Post meta was the best I could come up with.

Methinks it'll remain the best option until we ever mimic what we did with options by creating some kind of post_transient API (and while we're at it, user_transient, comment_transient, etc.).

Rather than leaving it with no optimal solution would it not at least make sense to use transients in so far as it's better than no caching at all?

The only other thing I can think of is that you could register a hidden post type, create a single post in it and use that as the fallback post ID to cache anything coming from a direct wp_oembed_get() call. It seems completely crazy but might be the least disruptive.

Last edited 5 months ago by sanchothefat (previous) (diff)

comment:35 in reply to: ↑ 34 ; follow-up: Denis-de-Bernardy5 months ago

Replying to sanchothefat:

Rather than leaving it with no optimal solution would it not at least make sense to use transients in so far as it's better than no caching at all?

The only other thing I can think of is that you could register a hidden post type, create a single post in it and use that as the fallback post ID to cache anything coming from a direct wp_oembed_get() call. It seems completely crazy but might be the least disruptive.

The current solution is the most sane we have at the moment imo. The only issue is expiring it periodically.

The benefit of post_meta is that it adds no extra queries: post_meta gets queried automatically by the API when a post is loaded by WP (because it's in the loop or because you're fetching any of its meta key/value pairs). This makes it a very sane place to store anything that relates to a specific post.

In contrast, a transient in wp_options yields an extra DB hit for each oembed on a page — whether it's cached or not — on sites with no persistent in-memory storage. Likewise, a custom post type would allow to keep the cache in sync if two or more posts embed the same resource, but this would be at the cost of extra DB queries to actually fetch them. Tbh, I'm suspicious that either approach would be a win in the end.

comment:36 in reply to: ↑ 34 leewillis775 months ago

Replying to sanchothefat:

Rather than leaving it with no optimal solution would it not at least make sense to use transients in so far as it's better than no caching at all?

Not really - transients disappear without notice. Particularly I'm pretty sure that transients get cleared specifically on WP upgrade nowadays - that would leave some epic re-embedding going on which I imagine wouldn't be pretty for sites that embed even reasonable quantities.

The non-guaranteed persistence of transients makes this unworkable.

The only other thing I can think of is that you could register a hidden post type, create a single post in it and use that as the fallback post ID to cache anything coming from a direct wp_oembed_get() call. It seems completely crazy but might be the least disruptive.

I had wondered about child entries in the posts table (In a similar method to "attachments"), but not sure it helps us - there's still no expiry framework built in.

comment:37 in reply to: ↑ 35 ; follow-up: leewillis775 months ago

Replying to Denis-de-Bernardy:

The current solution is the most sane we have at the moment imo. The only issue is expiring it periodically.

Indeed - which I thinks brings us back around to http://core.trac.wordpress.org/ticket/14759#comment:21

Is that a sensible avenue to pursue?

comment:38 in reply to: ↑ 37 Denis-de-Bernardy5 months ago

Replying to leewillis77:

Replying to Denis-de-Bernardy:

The current solution is the most sane we have at the moment imo. The only issue is expiring it periodically.

Indeed - which I thinks brings us back around to http://core.trac.wordpress.org/ticket/14759#comment:21

Is that a sensible avenue to pursue?

I haven't peeked at the patch, but if it does what is described, it seems sensible imo.

Aside: the same functionality could be wrapped in a function called get_post_transient() and, well... that should likely be a separate ticket, but I'm sure you and others see the point.

Last edited 5 months ago by Denis-de-Bernardy (previous) (diff)

comment:39 ircbot8 weeks ago

This ticket was mentioned in IRC in #wordpress-ui by helen. View the logs.

markjaquith7 weeks ago

Attempt at slightly better caching behavior

comment:40 follow-up: markjaquith7 weeks ago

14759.diff is a rough swing.

  • Gets rid of the "delete all oEmbed postmeta caches on post save" concept
  • Triggers regeneration request even for drafts
  • Cache updated only if a valid oEmbed response comes back
  • Doesn't attempt regeneration for draft re-saves if the cache is populated
  • Reduces the stampede possibilities noted in #17210

I still think that the AJAX method is kind of weird. There is still some stampede possibility, for when you update or publish-without-making-a-draft an oembed URL. In that case, both front-end and back-end attempts could be made.

Some thoughts:

  • Is it worth locking the regeneration for a little bit to let the AJAX regeneration to fire?
  • Should we look at an alternative method of firing the regeneration?
  • Should we just do a synchronous oEmbed attempt if updating a published post (just for new URLs)? Would slow things down, but could stop this "front-end requests result in HTTP requests and UPDATE queries" nonsense.
  • Could we recognize oEmbed URLs client-side in the editor and proactively fire AJAX cache requests before the post is even saved with that URL?
  • Should we be doing TLC transients style soft expiration, obeying the oEmbed cache time if it provides one and using a long cache time (a year) otherwise?
  • Should we have a "post transients" API?

comment:41 in reply to: ↑ 40 helen7 weeks ago

Replying to markjaquith:

  • Could we recognize oEmbed URLs client-side in the editor and proactively fire AJAX cache requests before the post is even saved with that URL?

I think this would go a long way toward solving the initial publish problem, and since I also think we should prepare ourselves for a view/preview of sorts even if we don't tackle it this release (although we should probably do at least part of #15490 if we land this soon), we might as well at least do the caching part.

  • Should we be doing TLC transients style soft expiration, obeying the oEmbed cache time if it provides one and using a long cache time (a year) otherwise?

I'm not sure how much I really care about oEmbed expiry, especially if we go with only replacing cache values when a valid response is returned and continue with only doing it on post save (or rather, firing the call to do it after post save).

comment:42 ircbot5 weeks ago

This ticket was mentioned in IRC in #wordpress-dev by helen. View the logs.

Note: See TracTickets for help on using tickets.