Make WordPress Core

Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#25660 closed defect (bug) (fixed)

If an administrator triggers cron, a background update could temporarily lock them out

Reported by: nacin Owned by: nacin
Milestone: 3.7 Priority: high
Severity: major Version:
Component: Upgrade/Install Keywords: has-patch needs-testing
Focuses: Cc:


Thank goodness for local autosaves in 3.6.

If an administrator triggers a cron event, a background update could temporarily lock them out. The first page they open will trigger the cron, and then the very next thing they do will hit the maintenance wall. We could block or delay updates whenever it's a logged-in user that triggers a cron event, but that's kind of lame — especially if the site is only accessible to logged-in users.

I think we should do a few things to mitigate this. In order of priority and ease:

  • High priority, pretty simple. We need to test how heartbeat — auth checks, post locks, and interrupted connection notices — handles this. It's possible we need to account for a 503 and say "Hang tight, your site is updating" or something, in lieu of the intermittent connection issue.
  • Medium priority, takes a little more work. We should show a notice. "This site is scheduled to update WordPress in less than %s minutes. Update now, it's easy and takes a few moments: (link)". Show it to users capable of updating WordPress core (it's just something we'd add to the existing nag). Say, if we are within 30 minutes of the next scheduled event? Jaquith mocked up a little timer countdown in the toolbar, which could be cool for 3.8.
  • Low priority, decent amount of work. If a user with the ability to update_core triggers a cron event that would normally cause a background update, *delay* the update by 30 minutes, and show them said notice. (Do not allow it to be delayed a second time, which we can solve by just passing some random argument to the cron event.) It's not ideal, as they're then told to click something that should be automatic, but it'd be really nice to not interrupt what they are doing (potentially the saving of a post, etc.). It's possible an update could come at a *really* inopportune time, obviously, and we should try to prevent surprises. It's kind of similar to "Install and Relaunch" that a bunch of Mac apps nag me about all the time. Note, this would likely require us to put some one-off code pretty deep into the cron API.

It's possible that this is not a big deal. I don't think I've seen it reported more than once or twice, though both dd32 and I have triggered it on our own. (It should be noted that when dealing with local installs, it will *always* be your visit that triggers the cron, so it's more likely to be noticed.)

Why? It's possible/probable that the download and unzipping takes most of the time (we're seeing averages of about 24 seconds for nightly builds) and that we only have maintenance mode set for a few seconds.

We currently have data on how long an update takes, but we should specifically try to collect a better data point: exactly how long maintenance mode was set for. (We should be able to do this by including the file before we leave maintenance mode and retrieve the time that was set.) Such a short timeframe might actually render this ticket moot, as the person's page will be loading by the time the cron HTTP request is fired, and maintenance mode will be off again by the time the person tries to advance to the next page.

Attachments (5)

25660.patch (1.8 KB) - added by azaozz 6 years ago.
25660.diff (2.0 KB) - added by dd32 6 years ago.
25660.2.diff (1.7 KB) - added by nacin 6 years ago.
25660.3.diff (2.4 KB) - added by azaozz 6 years ago.
25660.4.diff (474 bytes) - added by azaozz 6 years ago.

Download all attachments as: .zip

Change History (15)

6 years ago

#1 @azaozz
6 years ago

Hearbeat and autosave happily continue after the maintenance mode finishes. 25660.patch adds detection of a 503 XHR response status code in heartbeat and passes it in the custom jQuery event. The bit in autosave that disables the buttons on connection timeout, now disables them on 503 as well and shows the error message on the Edit Post screen:

Connection lost. Saving has been disabled until you’re reconnected. We’re backing up this post in your browser, just in case.

#2 @azaozz
6 years ago

  • Keywords has-patch needs-testing added

6 years ago

#3 @dd32
6 years ago

25660.diff attempts to cover the Low priority angle of this, by hooking into the cron lock transient's actions and re-scheduling the update checks to happen in 30 minutes (and +12-ish hours for the next normally scheduled check).

It's completely untested, but attempts to block it delaying twice (over a greater than 5 minute period of time, to cover cases where cron is triggered is triggered multiple times within a few seconds/minutes). It seems like we can probably do that without too much crazy code, and cron modifications and is a easy win that's silent to the user.

Adding a pre-update notification could be nice, however, we'd need to keep in mind that it may be 5 minutes before the twicedaily cron that we realise that we're going to apply a autoupdate (but, it could also be 11hrs, 55minutes before hand that we know).

6 years ago

#4 @nacin
6 years ago

I like azaozz's 25660.patch. The only thing we should be concerned about is that we wanted to be certain in 3.6 that we weren't accidentally blocking someone from publishing just because their heartbeat or something is broken. 503 is specific enough, and meaningful enough, that we should be okay always treating a 503 as a connection issue. But, that code is also particular to the resource, not to the site. It's tough. I think I would like to see heartbeat work once before treating a 503 as an error. That way we at least know there aren't persistent 503 errors spewing from admin-ajax.php.

Originally, heartbeat-connection-lost was to fire whenever heartbeat has been successful once, plus started failing one or more times. Only 'timeout' would work regardless. But, the callback hooked into the heartbeat-connection-lost event specifically checks for 'timeout', and now also 503. In 3.8 it would be interesting to expand heartbeat-connection-lost to more than just timeout. We just have to be very careful that we are not flagging false positives.

So, okay with 25660.patch. I toyed with it a bit and came up with (untested) 25660.2.diff.

dd32's patch is clever. Do we need it right now? I don't know. #25655 might be enough alone.

#5 @dd32
6 years ago

my thoughts:

  • We could be overstating this whole issue, both me and nacin hit it because we were testing, and the only requests to the site were us.
  • 503 is a pretty specific status that should cause a heartbeat failure no matter what IMHO, a server shouldn't be spewing those
  • We can add a specific header to maintenance mode and trigger based off that 503 isn't specific enough
  • I don't believe for 3.7 we need a pre-update nag
  • I don't believe for 3.7 we need a update-delaying handler such as in my patch 25660.diff - It was more as a self POC

I think as long as heartbeat is taken care of, we can iterate with the rest in future releases, no need to add a bunch of fancy smart logic at the last minute.

#6 @dd32
6 years ago

Forgot to mention, in testing for #25655, maintenance mode will be active for a handful of seconds, currently about 13 seconds, but that can be reduced to 5 seconds, and once it's dealing with a partial build instead of a full zip, i suspect it'll be much less than that.

6 years ago

#7 @azaozz
6 years ago

25660.3.diff merges 25660.2.diff and 25660.patch.

I think I would like to see heartbeat work once before treating a 503 as an error. That way we at least know there aren't persistent 503 errors spewing from admin-ajax.php.

That's sensible. The only thing is that sites with only one user would take longer to trigger the error as we don't check post locks there, so heartbeat runs every couple of minutes instead of every 15 seconds (once an error is triggered heartbeat runs every 15 sec. until the error is cleared). And if maintenance mode lasts about 20 sec. chances are we will miss is.

...we wanted to be certain in 3.6 that we weren't accidentally blocking someone from publishing just because their heartbeat or something is broken. 503 is specific enough, and meaningful enough, that we should be okay always treating a 503 as a connection issue.

Was thinking about that too. When doing AJAX, most errors are some type of PHP notice/warning that prevents parsing the response (we currently bypass these). As @dd32 mentions above, responses with HTTP error 503 are quite specific and we should be preventing the user from saving while they persist. We could add a custom header when in maintenance mode but thinking all 503 should be treated the same.

Also agree we can enhance that part of heartbeat and use it better. For (unrelated) example: when an error is detected, we could set a target on the form and keep the current Edit Post screen open while the user attempts to save the post. If it fails, the original content is still intact and the user can try again later.

Last edited 6 years ago by azaozz (previous) (diff)

#8 @azaozz
6 years ago

Another (theoretical) idea for 3.8: "delay" POST requests coming from the admin while the site is in maintenance mode. The benefit is this would be completely transparent for the user (apart from some delay when submitting a form).

This could work by detecting the .maintenance file on POST and setting something like sleep(3) in PHP, then checking again, etc. The browsers would wait for several minutes for a response and we can use set_time_limit() several times so PHP doesn't time out.

As auto updates typically run less than a minute, the sleep(3) would be in a loop that runs 20 times before giving up. Can also do this only for critical POST requests, for example when a user is attempting to save a post.

Last edited 6 years ago by azaozz (previous) (diff)

#9 @nacin
6 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 25874:

Have heartbeat's connection lost notice handle 503 errors send by the upgrader's maintenance mode.

fixes #25660 for 3.7.

6 years ago

This ticket was mentioned in IRC in #wordpress-dev by nacin. View the logs.

6 years ago

Note: See TracTickets for help on using tickets.