Make WordPress Core

Opened 11 years ago

Closed 9 years ago

Last modified 9 years ago

#25623 closed defect (bug) (duplicate)

The design of update_option etc makes race conditions likely to happen

Reported by: joehoyle's profile joehoyle Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Cache API Keywords:
Focuses: Cc:

Description

After spending a few hours chasing a nasty race condition bug in production today, I thought it would be good to discuss the design of the $alloptions object caching and how it can lead to data loss very easily when using any persistent object caching.

  1. Thread A: Loads options from cache
  2. Thread B: Loads options from cache
  3. Thread A: Updates options to cache
  4. Thread B: Updates options to cache, which includes the "outdated" option loaded in step 2

This is not a new problem, and was discussed in some part in #17519.

With most persistent object caching plugins, this is not a problem limited to options, as they use an internal (in memory) cache which is used before going to the cache backend (such as memcached) - this means get / updating anything into the cache will not propagate between threads as each thread has potentially already populated it's internal cache tool of now stale data.

The "fix" for that scenario is to not use internal cache pools, so query memcached directly (aside: this is too slow so not really a solution).

However, options is a much worse case, because a single cache key alloptions is used. So updating a single option, pushes all (potentially stale) options back to the cache store. To draw a parallel with MySQL, this would be like the MySQL query in update_option pushing every option in memory back to the db.

As for a persistent object caching plugin fixing this internally - I don't see how it could with maintaining the use of an internal cache -- it's not so bad with updating a post_meta key for example, as the "race condition" scenario can only mess up that single post meta entry - with options, a race condition can lead to a lot data loss.

Having said that - what is the "solution" here? Presumably the options are cached as one large array for performance reasons so it's one cache get / set from memcached / apc etc to load everything in. I am not sure how much of an impact that really has - would having a single cache key for each option make things much worse?

Has anyone else come up against this much before? I know I have run into in a lot of api drriven stuff, as writes can often happen in concurrent running threads.

Change History (11)

#1 @jdgrimes
11 years ago

  • Cc jdg@… added

#2 follow-up: @wonderboymusic
11 years ago

  • Keywords reporter-feedback added; dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release

Would it be possible for you to create a unit test demonstrating this issue, potentially with a fix?

#3 @wonderboymusic
11 years ago

  • Keywords needs-patch added

#4 @sphoid
11 years ago

I'm fairly certain I'm actually having this very problem. We have 7 web servers in a production environment using the memcached plugin. I've been thinking for the longest time that there is a bug in php-memcache or in memcached itself because we intermittently get stubborn stale cache data but it sounds very likely that cache writes are stepping all over each other causing old data to get written back to the cache which resets the expiry for the key. This is a very difficult thing to prove because it only seems to happen in high load environments with many servers (this is not a common wp configuration outside of wordpress.com) and I'm also thinking the size of the data blob might be a factor too since we only ever experience this with alloptions which is upwards of 500Kb in our environment. I'm watching this ticket with great interest.

#5 in reply to: ↑ 2 @joehoyle
11 years ago

Replying to wonderboymusic:

Would it be possible for you to create a unit test demonstrating this issue, potentially with a fix?

IMO this is too early to start writing any code - it was more about the discussion to see if this is something that _should_ be tackled. The problem is not necessarily a "bug" in the usual sense that there is something that can be fixed. Running code in parallel is always going to be a problem with WordPress, however it's more that all_options makes overwrites a lot more likely. I was hoping to get some dev feedback as to whether this issue is "serious" enough to warrant at upheaval of how the options object caching is handled.

Unit tests would at least be good to demonstrate the issue to side-effects when calling update_option with a race condition. However, I am not sure exactly how we do that from PHPUnit with the current setup. One idea: use a remote request back to the install to update an option, then call update_option after the script has returning, and observe that the remote request's update_option is lost form the object cache. This requires a web server in the unit test (which I don't *think* we currently have) and also running the tests with Persistent Object Caching (which I am not sure we currently have).

#6 @joehoyle
10 years ago

Quick update on this - I took a bash at writing this into the Memcached plugin we use, as a proof of concept. This is at https://github.com/humanmade/memcache-object-cache/pull/1. The tradeoff is loading WordPress will add about 30 memcached get() calls - depending entirely on your setup of course. In our environment, that's a tradeoff I'm willing to make to avoid data corruption / deletion which is something that happens far too often with the current design of alloptions.

#7 follow-up: @joehoyle
10 years ago

Ok, new idea: Rather then getting rid of the idea of alloptions, which would add too many Object Cache get calls for people using a persistent object cache; this issue would pretty be much eradicated if we just deleted the alloptions cache pre add/delete/update option. This would cause a re-fetch of the options from the DB before pushing those options back into the object cache. Though this isn't 100% atomic, it vastly reduces the changes of data corruption as the possible window for a race condition is Fetch DB Rows -> Push to Object Cache rather than Start of Script -> Push to Object Cache that is currently is (which is a huge issue when running long background tasks etc).

I don't think the performance hit will be significant, as update/add/delete options are already pretty expensive operations so aren't used in "view" contexts.

This should be a pretty straight forward patch, any thoughts before I submit one?

This ticket was mentioned in Slack in #core by rmccue. View the logs.


10 years ago

#9 in reply to: ↑ 7 @datafeedr.com
10 years ago

Replying to joehoyle:

This should be a pretty straight forward patch, any thoughts before I submit one?

We're also having the same issue you described in your initial post. The patch you suggest will most likely fix the issues we're having related to this.

Any update on when this patch might be committed to core?

Thanks!

#10 @joehoyle
9 years ago

  • Resolution set to duplicate
  • Status changed from new to closed

Closing in favour of #31245 as they are essentially describing the same problem and virtually the same solution.

#11 @swissspidy
9 years ago

  • Keywords reporter-feedback needs-patch removed
  • Milestone Future Release deleted
Note: See TracTickets for help on using tickets.