WordPress.org

Make WordPress Core

Opened 17 months ago

Last modified 4 weeks ago

#25623 new defect (bug)

The design of update_option etc makes race conditions likely to happen

Reported by: joehoyle Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Cache API Keywords: reporter-feedback needs-patch
Focuses: Cc:

Description

After spending a few hours chasing a nasty race condition bug in production today, I thought it would be good to discuss the design of the $alloptions object caching and how it can lead to data loss very easily when using any persistent object caching.

  1. Thread A: Loads options from cache
  2. Thread B: Loads options from cache
  3. Thread A: Updates options to cache
  4. Thread B: Updates options to cache, which includes the "outdated" option loaded in step 2

This is not a new problem, and was discussed in some part in #17519.

With most persistent object caching plugins, this is not a problem limited to options, as they use an internal (in memory) cache which is used before going to the cache backend (such as memcached) - this means get / updating anything into the cache will not propagate between threads as each thread has potentially already populated it's internal cache tool of now stale data.

The "fix" for that scenario is to not use internal cache pools, so query memcached directly (aside: this is too slow so not really a solution).

However, options is a much worse case, because a single cache key alloptions is used. So updating a single option, pushes all (potentially stale) options back to the cache store. To draw a parallel with MySQL, this would be like the MySQL query in update_option pushing every option in memory back to the db.

As for a persistent object caching plugin fixing this internally - I don't see how it could with maintaining the use of an internal cache -- it's not so bad with updating a post_meta key for example, as the "race condition" scenario can only mess up that single post meta entry - with options, a race condition can lead to a lot data loss.

Having said that - what is the "solution" here? Presumably the options are cached as one large array for performance reasons so it's one cache get / set from memcached / apc etc to load everything in. I am not sure how much of an impact that really has - would having a single cache key for each option make things much worse?

Has anyone else come up against this much before? I know I have run into in a lot of api drriven stuff, as writes can often happen in concurrent running threads.

Change History (8)

comment:1 @jdgrimes15 months ago

  • Cc jdg@… added

comment:2 follow-up: @wonderboymusic13 months ago

  • Keywords reporter-feedback added; dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release

Would it be possible for you to create a unit test demonstrating this issue, potentially with a fix?

comment:3 @wonderboymusic13 months ago

  • Keywords needs-patch added

comment:4 @sphoid13 months ago

I'm fairly certain I'm actually having this very problem. We have 7 web servers in a production environment using the memcached plugin. I've been thinking for the longest time that there is a bug in php-memcache or in memcached itself because we intermittently get stubborn stale cache data but it sounds very likely that cache writes are stepping all over each other causing old data to get written back to the cache which resets the expiry for the key. This is a very difficult thing to prove because it only seems to happen in high load environments with many servers (this is not a common wp configuration outside of wordpress.com) and I'm also thinking the size of the data blob might be a factor too since we only ever experience this with alloptions which is upwards of 500Kb in our environment. I'm watching this ticket with great interest.

comment:5 in reply to: ↑ 2 @joehoyle13 months ago

Replying to wonderboymusic:

Would it be possible for you to create a unit test demonstrating this issue, potentially with a fix?

IMO this is too early to start writing any code - it was more about the discussion to see if this is something that _should_ be tackled. The problem is not necessarily a "bug" in the usual sense that there is something that can be fixed. Running code in parallel is always going to be a problem with WordPress, however it's more that all_options makes overwrites a lot more likely. I was hoping to get some dev feedback as to whether this issue is "serious" enough to warrant at upheaval of how the options object caching is handled.

Unit tests would at least be good to demonstrate the issue to side-effects when calling update_option with a race condition. However, I am not sure exactly how we do that from PHPUnit with the current setup. One idea: use a remote request back to the install to update an option, then call update_option after the script has returning, and observe that the remote request's update_option is lost form the object cache. This requires a web server in the unit test (which I don't *think* we currently have) and also running the tests with Persistent Object Caching (which I am not sure we currently have).

comment:6 @joehoyle7 weeks ago

Quick update on this - I took a bash at writing this into the Memcached plugin we use, as a proof of concept. This is at https://github.com/humanmade/memcache-object-cache/pull/1. The tradeoff is loading WordPress will add about 30 memcached get() calls - depending entirely on your setup of course. In our environment, that's a tradeoff I'm willing to make to avoid data corruption / deletion which is something that happens far too often with the current design of alloptions.

comment:7 @joehoyle5 weeks ago

Ok, new idea: Rather then getting rid of the idea of alloptions, which would add too many Object Cache get calls for people using a persistent object cache; this issue would pretty be much eradicated if we just deleted the alloptions cache pre add/delete/update option. This would cause a re-fetch of the options from the DB before pushing those options back into the object cache. Though this isn't 100% atomic, it vastly reduces the changes of data corruption as the possible window for a race condition is Fetch DB Rows -> Push to Object Cache rather than Start of Script -> Push to Object Cache that is currently is (which is a huge issue when running long background tasks etc).

I don't think the performance hit will be significant, as update/add/delete options are already pretty expensive operations so aren't used in "view" contexts.

This should be a pretty straight forward patch, any thoughts before I submit one?

comment:8 @slackbot4 weeks ago

This ticket was mentioned in Slack in #core by rmccue. View the logs.

Note: See TracTickets for help on using tickets.