#17462 closed defect (bug) (fixed)
Improve cron locking
Reported by: | ryan | Owned by: | ryan |
---|---|---|---|
Milestone: | 3.3 | Priority: | normal |
Severity: | major | Version: | 3.1.2 |
Component: | Cron API | Keywords: | has-patch |
Focuses: | Cc: |
Description
wp-cron.php loops that run longer than the one minute lock timeout can result in multiple cron processes looping over the same events. To avoid this the loop in wp-cron.php should check to see if it still has the lock after firing each action.
Attachments (6)
Change History (28)
#3
follow-up:
↓ 4
@
13 years ago
Short discussion in IRC with duck_ and ryan led to this suggestion that I think we all liked: Make the timeout longer and clear the transient when wp-cron.php completes.
#4
in reply to:
↑ 3
;
follow-up:
↓ 5
@
13 years ago
Replying to nacin:
Short discussion in IRC with duck_ and ryan led to this suggestion that I think we all liked: Make the timeout longer and clear the transient when wp-cron.php completes.
Can't this lead to runaway crons just like the current situation?
Suppose you've a batch of 100k registration emails to send, and a fussy shared host that kills lengthy processes. Cron A locks and starts sending emails. The next couple of crons see things are locked so don't start. Server kills cron A because it's taking too long. Cron B locks and starts sending emails (to the same people). The next couple of crons see things are locked (etc.).
To avoid the problem, locking needs to be done for individual cron jobs, rather than as a group. This would allow to mark individual items as being locked/in progress, and concurrent crons could then kick in and process other items accordingly.
#5
in reply to:
↑ 4
;
follow-up:
↓ 7
@
13 years ago
Replying to Denis-de-Bernardy:
Suppose you've a batch of 100k registration emails to send, and a fussy shared host that kills lengthy processes. Cron A locks and starts sending emails. The next couple of crons see things are locked so don't start. Server kills cron A because it's taking too long. Cron B locks and starts sending emails (to the same people). The next couple of crons see things are locked (etc.).
The job is unscheduled before it's actually started off.
Suppose you have three jobs 1, 2, 3 where 2 takes over a minute. I believe the problem is, Ryan correct me if I'm wrong, that when A is taking too long on 2 B can be spawned as the flag has timed out. That means that both A and B process job 3.
As well as the longer time out period we could, as part of the loop in wp-cron.php, check the difference in time since the start and if it's approaching the lock time out then break. That means that processes that have taken too long stop as if they have died though I suppose they would clear the lock in doing so.
#6
@
13 years ago
Something to think about with Ryan's patch is that some people have real cron jobs set up to hit wp-cron.php
, so any URL query arguments should probably be optional and the WP cron continue to work without them.
#7
in reply to:
↑ 5
@
13 years ago
Replying to duck_:
As well as the longer time out period we could, as part of the loop in wp-cron.php, check the difference in time since the start and if it's approaching the lock time out then break. That means that processes that have taken too long stop as if they have died though I suppose they would clear the lock in doing so.
Poked a hole in this one if applied alone. First long running process may clear the transient from under a second process which has followed on after $flag has timed out. This would allow a third cron process to run without being blocked (and, possibly, so on) leading to the current problem. Haven't tested to confirm, but seems like it's possible. However, it's a good complement to Ryan's patch.
Edit to clarify.
#8
@
13 years ago
It's a minor thing, but: with Ryan's patch I would recommend swapping $key and $data for consistency, every other wp_cache_* operation accepts the key first.
#9
@
13 years ago
- Cc admin@… added
Since I just filed a similar bug (#17561), I'd love to see a fix for this. What if each cron job is taken on an individual basis and marked as completed 1-by-1. In that case, you won't have 2 cron processes fighting for the same job and multiple crons may execute multiple jobs.
Though, that's part of the problem - with a lot of jobs, things can pile up and you'll end up with too many wp-cron processes all running at the same time. I think I'd prefer to delay execution a bit instead of having too many thins all running on time.
#10
@
13 years ago
If you guys need an example of how bad this gets, I just posted a new post and took this screenshot of the server-status: http://farm4.static.flickr.com/3631/5769124991_4dc3dc279b_b.jpg.
#13
@
13 years ago
This is really a big problem, especially for tasks that are critical, like backups. For GD Press Tools Pro plugin I made a in job handler protection to avoid the problem. First, I set sleep(1), and than check for the transient value in the database. This value is set to 60 seconds. And for the most part even if the hook is executed twice by the cron it will still allow only one process to run. It was OK for a while, and today I have noticed that this is not really enough, wp-cron.php managed to run same cron job 4 seconds apart. And that is the big difference for cron data to remain invalid. I can increase sleep time, to allow for bigger gap, but that is not really a solution. All cron jobs have the same problem, but most of them are not that critical.
#14
@
13 years ago
Allows calling wp-cron.php directly from a script without the lock arg. Introduces WP_CRON_LOCK_TIMEOUT. Works well so far in my lock stealing tests.
#15
@
13 years ago
Since wp_cache_reset() is a hack that is used only by the internal cache, I decided not to leverage it and cast about for alternatives. 17462.5.diff adds a force flag to wp_cache_get().
#16
@
13 years ago
Looks good. I definitely prefer the wp_cache_get() force rather than wp_cache_reset(), after digging into everything.
spawn_cron() relies on 'doing_wp_cron', and then there's alternate cron support in there. Might be worth it to add the lock value to the doing_wp_cron key rather than changing up the variables. Likewise some of the locking stuff for the alternate cron may be worth fixing up.
#17
@
13 years ago
- Owner set to ryan
- Resolution set to fixed
- Status changed from new to closed
In [18659]:
#18
@
13 years ago
- Cc greyham added
- Severity changed from normal to major
Can someone please confirm if the fix for this bug also fixes 15014?
#19
@
13 years ago
Did not run into any bug yet but while checking the code got wondering if the check
if ( $lock > $local_time + 10*60 )
in http://core.trac.wordpress.org/browser/trunk/wp-includes/cron.php#L209 should not honor the WP_CRON_LOCK_TIMEOUT to avoid conflicts when the WP_CRON_LOCK_TIMEOUT is bigger than 10 minutes.
#20
@
13 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
I don't think the bug is fixed. After posting a new post, I experienced a load spike (2 -> 20) and found this in the server status: http://minus.com/m1VuUQcxP
wp_cache_reset() was augmented to allow clearing the internal cache for particular groups and keys. This would need to be rolled out to the memcached and APC backends if we go with it.