Make WordPress Core

Opened 13 years ago

Closed 13 years ago

Last modified 13 years ago

#17462 closed defect (bug) (fixed)

Improve cron locking

Reported by: ryan's profile ryan Owned by: ryan's profile ryan
Milestone: 3.3 Priority: normal
Severity: major Version: 3.1.2
Component: Cron API Keywords: has-patch
Focuses: Cc:

Description

wp-cron.php loops that run longer than the one minute lock timeout can result in multiple cron processes looping over the same events. To avoid this the loop in wp-cron.php should check to see if it still has the lock after firing each action.

Attachments (6)

17462.diff (4.1 KB) - added by ryan 13 years ago.
17462.2.diff (791 bytes) - added by duck_ 13 years ago.
Example patch with only cursory testing
17462.3.diff (5.4 KB) - added by ryan 13 years ago.
17462.4.diff (5.5 KB) - added by ryan 13 years ago.
Typo fix. Props duck_
17462.5.diff (5.4 KB) - added by ryan 13 years ago.
Add force arg to wp_cache_get()
17462.6.diff (5.9 KB) - added by ryan 13 years ago.
Use doing_wp_cron instead of lock

Download all attachments as: .zip

Change History (28)

@ryan
13 years ago

#1 @ryan
13 years ago

wp_cache_reset() was augmented to allow clearing the internal cache for particular groups and keys. This would need to be rolled out to the memcached and APC backends if we go with it.

#2 @scribu
13 years ago

  • Keywords has-patch added

#3 follow-up: @nacin
13 years ago

Short discussion in IRC with duck_ and ryan led to this suggestion that I think we all liked: Make the timeout longer and clear the transient when wp-cron.php completes.

#4 in reply to: ↑ 3 ; follow-up: @Denis-de-Bernardy
13 years ago

Replying to nacin:

Short discussion in IRC with duck_ and ryan led to this suggestion that I think we all liked: Make the timeout longer and clear the transient when wp-cron.php completes.

Can't this lead to runaway crons just like the current situation?

Suppose you've a batch of 100k registration emails to send, and a fussy shared host that kills lengthy processes. Cron A locks and starts sending emails. The next couple of crons see things are locked so don't start. Server kills cron A because it's taking too long. Cron B locks and starts sending emails (to the same people). The next couple of crons see things are locked (etc.).

To avoid the problem, locking needs to be done for individual cron jobs, rather than as a group. This would allow to mark individual items as being locked/in progress, and concurrent crons could then kick in and process other items accordingly.

#5 in reply to: ↑ 4 ; follow-up: @duck_
13 years ago

Replying to Denis-de-Bernardy:

Suppose you've a batch of 100k registration emails to send, and a fussy shared host that kills lengthy processes. Cron A locks and starts sending emails. The next couple of crons see things are locked so don't start. Server kills cron A because it's taking too long. Cron B locks and starts sending emails (to the same people). The next couple of crons see things are locked (etc.).

The job is unscheduled before it's actually started off.

Suppose you have three jobs 1, 2, 3 where 2 takes over a minute. I believe the problem is, Ryan correct me if I'm wrong, that when A is taking too long on 2 B can be spawned as the flag has timed out. That means that both A and B process job 3.

As well as the longer time out period we could, as part of the loop in wp-cron.php, check the difference in time since the start and if it's approaching the lock time out then break. That means that processes that have taken too long stop as if they have died though I suppose they would clear the lock in doing so.

@duck_
13 years ago

Example patch with only cursory testing

#6 @Viper007Bond
13 years ago

Something to think about with Ryan's patch is that some people have real cron jobs set up to hit wp-cron.php, so any URL query arguments should probably be optional and the WP cron continue to work without them.

#7 in reply to: ↑ 5 @duck_
13 years ago

Replying to duck_:

As well as the longer time out period we could, as part of the loop in wp-cron.php, check the difference in time since the start and if it's approaching the lock time out then break. That means that processes that have taken too long stop as if they have died though I suppose they would clear the lock in doing so.

Poked a hole in this one if applied alone. First long running process may clear the transient from under a second process which has followed on after $flag has timed out. This would allow a third cron process to run without being blocked (and, possibly, so on) leading to the current problem. Haven't tested to confirm, but seems like it's possible. However, it's a good complement to Ryan's patch.

Edit to clarify.

Last edited 13 years ago by duck_ (previous) (diff)

#8 @mintindeed
13 years ago

It's a minor thing, but: with Ryan's patch I would recommend swapping $key and $data for consistency, every other wp_cache_* operation accepts the key first.

#9 @archon810
13 years ago

  • Cc admin@… added

Since I just filed a similar bug (#17561), I'd love to see a fix for this. What if each cron job is taken on an individual basis and marked as completed 1-by-1. In that case, you won't have 2 cron processes fighting for the same job and multiple crons may execute multiple jobs.

Though, that's part of the problem - with a lot of jobs, things can pile up and you'll end up with too many wp-cron processes all running at the same time. I think I'd prefer to delay execution a bit instead of having too many thins all running on time.

#10 @archon810
13 years ago

If you guys need an example of how bad this gets, I just posted a new post and took this screenshot of the server-status: http://farm4.static.flickr.com/3631/5769124991_4dc3dc279b_b.jpg.

#11 @ryan
13 years ago

  • Milestone changed from Future Release to 3.3

#12 @tott
13 years ago

  • Cc tott added

#13 @GDragoN
13 years ago

This is really a big problem, especially for tasks that are critical, like backups. For GD Press Tools Pro plugin I made a in job handler protection to avoid the problem. First, I set sleep(1), and than check for the transient value in the database. This value is set to 60 seconds. And for the most part even if the hook is executed twice by the cron it will still allow only one process to run. It was OK for a while, and today I have noticed that this is not really enough, wp-cron.php managed to run same cron job 4 seconds apart. And that is the big difference for cron data to remain invalid. I can increase sleep time, to allow for bigger gap, but that is not really a solution. All cron jobs have the same problem, but most of them are not that critical.

@ryan
13 years ago

#14 @ryan
13 years ago

Allows calling wp-cron.php directly from a script without the lock arg. Introduces WP_CRON_LOCK_TIMEOUT. Works well so far in my lock stealing tests.

@ryan
13 years ago

Typo fix. Props duck_

@ryan
13 years ago

Add force arg to wp_cache_get()

#15 @ryan
13 years ago

Since wp_cache_reset() is a hack that is used only by the internal cache, I decided not to leverage it and cast about for alternatives. 17462.5.diff adds a force flag to wp_cache_get().

#16 @nacin
13 years ago

Looks good. I definitely prefer the wp_cache_get() force rather than wp_cache_reset(), after digging into everything.

spawn_cron() relies on 'doing_wp_cron', and then there's alternate cron support in there. Might be worth it to add the lock value to the doing_wp_cron key rather than changing up the variables. Likewise some of the locking stuff for the alternate cron may be worth fixing up.

@ryan
13 years ago

Use doing_wp_cron instead of lock

#17 @ryan
13 years ago

  • Owner set to ryan
  • Resolution set to fixed
  • Status changed from new to closed

In [18659]:

Improve cron locking. Avoid multiple cron processes looping over the same events. fixes #17462

#18 @greyham
13 years ago

  • Cc greyham added
  • Severity changed from normal to major

Can someone please confirm if the fix for this bug also fixes 15014?

#19 @tott
13 years ago

Did not run into any bug yet but while checking the code got wondering if the check

if ( $lock > $local_time + 10*60 )

in http://core.trac.wordpress.org/browser/trunk/wp-includes/cron.php#L209 should not honor the WP_CRON_LOCK_TIMEOUT to avoid conflicts when the WP_CRON_LOCK_TIMEOUT is bigger than 10 minutes.

#20 @archon810
13 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

I don't think the bug is fixed. After posting a new post, I experienced a load spike (2 -> 20) and found this in the server status: http://minus.com/m1VuUQcxP

http://farm8.staticflickr.com/7019/6583909247_5dc4bc1917_b.jpg

#21 @thee17
13 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

This was closed for an already realized version, it is better to create a new ticket then reopen a fixed and released one.

Note: See TracTickets for help on using tickets.