WordPress.org

Make WordPress Core

Opened 11 years ago

Closed 11 years ago

Last modified 8 years ago

#7068 closed enhancement (fixed)

cron improvement

Reported by: hailin Owned by: hailin
Milestone: 2.7 Priority: high
Severity: normal Version:
Component: General Keywords:
Focuses: Cc:

Description

There are several key issues associated with current cron implementation.

  1. cron is not atomic.


Every page load will call wp_cron(), check the first timestamp in cron array, if it has expired, it calls spawn_cron which calls wp-cron.php to do fire up the jobs.

This runs into massive concurrency issue on a large system with hundreds of servers, where millions of pages views are generated every day.

The current method to address this issue is in wp-cron.php:

if ( get_option('doing_cron') > $local_time )

exit;

update_option('doing_cron', $local_time + 30);

However, the check does not solve the issues resulted from concurrency.

Example:

On a busy site, in the particular second when first cron timestamp is expiring, there are 10 blog page loads on 10 different servers.

Suppose process#1 on server #1 goes first, yet before it has reached update_option('doing_cron', $local_time + 30), process #2 on server#2 begins the sequence too.

Since ‘doing_cron” is still being updated by the process#1, or the updated value has not taken effect yet (due to db or cache delays, several milliseconds or longer usually) , process#2 will pass if ( get_option('doing_cron') > $local_time )
Check and also update_option('doing_cron', $local_time + 30). So both processes will proceed to fire up the cron job.

I’ve observed that on a popular blog on a busy production site, ANY cron job was executed 5-7 times! That may be ok for publish_future_post operation, but may not be good for other cron tasks.

An ideal solution is to guarantee every cron is executed once and once only.
I can envision storing all cron jobs in a central table, then a daemon processes it on a PARTICULAR server. Yet this approach may not be as flexible as it may not handle blog-specific jobs well.

A practical solution is to make the cron operation as atomic as possible, knowing that we can never make it truly atomic as there will be database and cross-data center communication delays.

  1. Server timers are not always correct

Because cron job condition is tested on every blog page load on every server. Any server with a bad clock can ruin the cron jobs, causing future posts being published earlier or never being published.

We can build in some protection mechanism to guard against this.

  1. Minor issue

Calling time() in multiple places in cron operation chain can be tricky on a busy server, as each call can give different values if the server is overloaded. Passing the first timestamp at cron entry point is logically sound.

  1. Lack of a central standard time source

Server timer drifting issue caused by power outage, etc poses a fundamental challenge. Software can not prevent hardware failure, and can only do so much to adapt to those failure cases.

Attachments (2)

7068_cron.diff (1.6 KB) - added by hailin 11 years ago.
revised patch part1
7068_cron_part2.diff (2.8 KB) - added by hailin 11 years ago.
revised patch part2

Download all attachments as: .zip

Change History (16)

#1 @hailin
11 years ago

  • Owner changed from anonymous to hailin
  • Status changed from new to assigned

#2 @hailin
11 years ago

Also observed that very large timestamps slipped in wp_schedule_single_event

Eg:
sec=1509044450 => Thu Oct 26 19:00:50 UTC 2017
sec=4129387222 => Mon Nov 8 20:00:22 UTC 2100
sec=4351359650 => Mon Nov 21 23:00:50 UTC 2107

@hailin
11 years ago

revised patch part1

@hailin
11 years ago

revised patch part2

#3 @hailin
11 years ago

Cleaned up the patch a little bit. Not sure why 7068_cron.diff does not display in HTML preview.

7068_cron.diff patches wp-cron.php
7068_cron_part2.diff patches cron.php

The patch has been tested at a very large site for over a month now, and it appears it's doing better than before.

#4 @jacobsantos
11 years ago

  • Type changed from defect to enhancement

#5 @jacobsantos
11 years ago

You know what the best way to make the wp-cron atomic. Setting up a cron job using crontab or Windows Scheduler.

#6 @azaozz
11 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [8927]) Cron improvement, props hailin, fixes #7068

#7 @westi
11 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

What is the point of check_server_timer if it is always going to return true?

If we want something pluggable we should be using a filter.

I'm not sure this new code really improves cron that much.

All it does is reduce the window for the race condition to occur rather than fixing the bug.

We are still going to fight with any object cache that is in existance.

#8 follow-up: @Otto42
11 years ago

I believe this race condition is unavoidable as long as it is driven off of hits to the site itself, as those hits are unpredictable. Two in quick succession can always cause wp-cron to hit twice so the most you can do is reduce it.

PHP does not really have a good interprocess mutex mechanism. There's one over here that uses semaphores or file locking, but I have no information on how reliable such a thing would be: http://in2.php.net/manual/en/ref.sem.php

#9 in reply to: ↑ 8 @westi
11 years ago

Replying to Otto42:

I believe this race condition is unavoidable as long as it is driven off of hits to the site itself, as those hits are unpredictable. Two in quick succession can always cause wp-cron to hit twice so the most you can do is reduce it.

Indeed. But for a single server hosted site we should be able to do something very reliable.

I don't think in core wordpress.org we should be trying to cope for multi server multi db scenarios as they would be better served by running the cron jobs manually triggered from a true cron job allowing you to do all the semaphore or mutex work you wanted.

#10 follow-up: @Otto42
11 years ago

I don't quite follow you there, westi. Where exactly are we talking about multi-server or multi-db setups here? I never considered them at all.

#11 in reply to: ↑ 10 @westi
11 years ago

Replying to Otto42:

I don't quite follow you there, westi. Where exactly are we talking about multi-server or multi-db setups here? I never considered them at all.

Indeed.

The new code and comments on this ticket look to me like Halin is trying to solve problems in that kind of setup (as seen on wp.com) in wp.org which doesn't feel correct to me.

#12 @Otto42
11 years ago

Mmmm... I don't think that he is, really. A single server setup can still get high traffic, and the problem is that high traffic creates the race condition. Sure, he only reduced it instead of fixing it, but still, that reduction is enough to prevent a large load on the system.

By moving the doing_cron check into the main code, you reduce the number of hits that occur to wp-cron.php in the first place. This is a substantial change, because it prevents wp-cron.php from running and then having to suddenly load all of WordPress up a few times just to check doing_cron.

Because until the cron job waiting has actually processed, every hit to the site would previously cause a hit to wp-cron. So until the job finished, you doubled your load. Not a great way to do things.

Sure the best method is a real cron job, but not everybody can set those up easily.

#13 @ryan
11 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

#14 @nacin
8 years ago

Related, #15602. check_server_timer() is gone.

Note: See TracTickets for help on using tickets.