WordPress.org

Make WordPress Core

Opened 4 weeks ago

Last modified 6 days ago

#44458 new task (blessed)

Catch WSODs and provide a means for recovery for end users

Reported by: schlessera Owned by:
Milestone: 5.0 Priority: normal
Severity: normal Version:
Component: Bootstrap/Load Keywords: has-patch needs-unit-tests needs-testing
Focuses: multisite Cc:

Description

While discussing PHP version updates at WCEU18, @pento introduced the notion of "sandboxing" plugins & themes to avoid a WSOD for end users doing a PHP update and having a broken plugin active on their site.

The goal of this "sandboxing" should be to catch/track WSODs and then provide a mechanism for users to still get into the site's admin backend to update/deactivate/replace the offending plugin.

Technical details still have to be figured out, but it looks like our best bet so far is to not do actual "sandboxing", but tracking fatal errors, and then entering some form of "safe mode" on subsequent requests.

Attachments (5)

44458-1.diff (4.9 KB) - added by schlessera 2 weeks ago.
First proof-of-concept implementation (still missing UI integrations)
44458-2.diff (16.0 KB) - added by schlessera 2 weeks ago.
Second iteration, adding all the basic UI integration (for plugins only for now) [Updated]
44458-3.diff (23.0 KB) - added by schlessera 2 weeks ago.
Refactored persistence code, added error output, "paused" filter in plugin list and redirect to filtered view
44458-4.diff (23.1 KB) - added by schlessera 6 days ago.
Avoid array derefencing for PHP 5.2 compat
44458.diff (23.4 KB) - added by flixos90 6 days ago.

Download all attachments as: .zip

Change History (24)

This ticket was mentioned in Slack in #core-php by schlessera. View the logs.


4 weeks ago

#2 @schlessera
4 weeks ago

(copied from Slack for future reference)

I've started looking into this briefly, and here's what I've found out so far:

  • Exceptions don't help, as they are not fully integrated into the error handling at PHP 5.2.
  • We can use a shutdown handler to detect fatal errors and know where they were triggered: https://3v4l.org/4jWAs .
  • Such a shutdown handler could record a fatal error, and the next page request could then detect a recorded fatal error and decide based on some heuristics whether to initiate "safe mode". (edited)

There's an obvious security-related issue here: This "safe mode" will skip loading plugins & themes, so it will also skip any security-related enhancements that were done to the site. An attacker might forcefully trigger the "safe mode" to get past security mechanisms.

Some other assumptions I'm making right now:

  • We cannot just act on plugin activation/deactivation, as this will still take the site down if we update PHP.
  • We cannot disable a single plugin, as we cannot reliably detect who the actual culprit is in all cases.
  • We might be able to disable a single plugin in those cases where we hit a parse error in a file of a plugin.

#3 @schlessera
3 weeks ago

Another proposal: Only allow for safe mode on logged-in /wp-admin/* requests, and build in the following way:

  • a shutdown handler is registered that will redirect to the same URL with a nonce-based safe mode URL argument added.
  • If page loads correctly, shutdown handler is removed again before it is executed.
  • If page would WSOD, it will trigger the shutdown handler, which will redirect to the safe-mode-version of that same page.

This ticket was mentioned in Slack in #core-php by flixos90. View the logs.


3 weeks ago

#5 @MKSnMKS
2 weeks ago

Hi schlessera,

I was referred by Joy (Joyously) who replied to my topic at

https://wordpress.org/support/topic/wp-core-plugin-update-fail-white-screen-od-method-to-detect-and-repair/

to make a comment in your topic here.

My suggestion in short, is to ; a) let the whole process of failure take place. b) detect the white screen of death, by using a server generated page request c) check the result of the page request d) if the result is bad, then automatically to the recovery process (which presently must be manually done) e) this retrieves front and back end access to the website, so the admin can diagnose from the wordpress dashboard, and authors can carry on undisturbed, and (most importantly) the website is back on line.

I describe several instances where this approach could be used.

This should be fairly easy to do right now - the fix is known i.e. rename the folders of recently updated plugins (but automate the timing/event of this process, and use a script to do it automatically).

e.g. for script flow to begin the recovery action i) and no action from recent actioning admin ii) and 'delay time' has passed iii) so automatically have the server send a page request to admin or front-end iv) detect bad page response to page request, so (if possible) display a simple text page (much like the maintenance notice - with an estimate of the number of minutes till back online, then run the recovery script (below)

for script flow to recover from white screen of death i) search plugins folders for plugins whose folders were altered in the last time period note - this time period could be set to a user chosen default, or left to a system default of the 'delay time (above)' plus 3 hours ii) rename all those folders (found in (i)), by prepending "Disabled_due_to_WSOD--" to the names of each folder (the main folder for each plugin). iii) optionally (user chosen setting or leave for prompting during live run) send an email to admin, saying what happened and the site should be live now.

The web site should then be back to near normal.

The first script could be made to be manually run, in various situations, and made easy to run by use of a button, for example in cPanel admin "auto repair WordPress White Screen of Death", or send a coded email to a special email address that then carries out the action.

Also It would be handy to have an automated system to re-update a failed update, that could be done from the wordpress dashboard after access is recovered, or from cPanel by button, which would update the folders previously detected as recently altered. Precautions may be needed to not loose user's custom settings, or database, but they may be able to be scripted also.

Plugin authors could be invited to supply scripts for recovery in the event that their plugin needs to be repaired after after a WSOD event (this could be made a requirement at some future time).

I hope this helps

#6 @MKSnMKS
2 weeks ago

Hi,

Having made the suggestion described mostly above, originating from the original suggestion at

https://wordpress.org/support/topic/wp-core-plugin-update-fail-white-screen-od-method-to-detect-and-repair/

You may see how difficult it can be to get suggestions through to where ever they need to go. The topic has been closed, and there is no indication of where it can go within the wordpress development cycle, that would be of any use.

The support forum allows a topic to be set as “this topic is not a support question” in the submission, which I had done. So clearly the support forum can take topics which are 'not support questions'. And the particular forum was for making suggestions (requests and feedback), yet I am now told "but it’s not really appropriate for support forum." and the topic has been closed.

@schlessera
2 weeks ago

First proof-of-concept implementation (still missing UI integrations)

#7 @schlessera
2 weeks ago

I uploaded a first proof-of-concept of the safe mode mechanism.

I had to find out that the debug_backtrace is useless, as the stack is completely unrolled before the shutdown handler is being called.

However, error_get_last() returns both the type of error as well as the file where it was triggered in.

With that information, we can do some deduction, like a parse error only ever happens in the file that is actually the cause of the problem, so that can be directly tied to a plugin/mu-plugin/theme.

Also, I built it so that it can be customized on sites just like with advanced-cache.php. You can drop a shutdown-handler.php file into the wp-contents folder that provides a wp_handle_shutdown() function. This lets you customize/override/disable the shutdown handling logic.

I'm currently adding the detected culprits to a DB option called skip_on_admin. When loading the admin or when you're on the login page, these plugins/mu-plugins/themes will always be skipped until you manually declare them to be safe again.

The front-end will not be impacted at all by this. We're only filtering the loading of these when we're inside the admin backend.

The basic system seems to work. When I produce a parse error in a plugin, my frontend breaks and I get a WSOD. The plugin that causes the error is stored in the DB in the skip_on_admin option, though. Then, when I go to either a login or an admin screen, that particular plugin is filtered out of the active plugins list and doesn't get loaded. So although the frontend is broken and shows a WSOD, logging in and working in the admin backend still works correctly.

What needs to be added now is:

  • an admin notice to let the user know that something is broken
  • an integration into the plugin list that lets the user remove the plugin from the skip_on_admin list.

Also, I'm still not sure whether I should include must-use plugins in that mechanism or not...

This ticket was mentioned in Slack in #core-php by schlessera. View the logs.


2 weeks ago

#9 follow-up: @schlessera
2 weeks ago

The second iteration contains basic UI integration, but currently only for plugins.

On the frontend, when an error is triggered, WordPress will display a corresponding message, and provide a link to log into the admin dashboard to fix the issue.

The message is filterable via 'wp_technical_issues_display', and the entire page can be replaced via a drop-in called php-error.php.

https://i.imgur.com/S06ErFV.png

Then, once the site owner has logged into the admin backend, they see an admin notice letting them know about the problem and providing a link to go to the plugins screen to manipulate the plugins.

https://i.imgur.com/alGWnNQ.png

Finally, once they are on the Plugins screen, they can see the plugins that have been "paused" within the admin backend because of an error. The site owner will be able to do all sorts of manipulations, like updating the plugin, deactivating it, deleting, etc... Once the site owner thinks they solved th problem, they can choose to "Resume execution" for the problematic plugin. If the plugin still contains an error, it will be caught and recorded again.

https://i.imgur.com/y3k8ZDc.png

Questions & remaining issues:

  • We should decide on specific language (on the technical side just as well as facing the site owners). Right now, the code talks about "pausing" a plugin and "resuming its execution" (with caps like 'resume_plugin' and similar).
  • Once the language has been decided, we should implement the same UI mechanisms for themes as well.
  • Multisite has been largely ignored for now. This will probably require specific handling of network-activated plugins.
  • We need to decide whether must-use plugins should be pausable or not. I have no clear answer for this yet.
  • In an event like a PHP update, multiple plugins might break at the same time. This would lead to a very confusing back & forth until all problematic plugins have been properly detected. We should think about how we can handle this (an AJAX loop with the paused plugins actually paused, to detect new issues?).

This ticket was mentioned in Slack in #core-php by schlessera. View the logs.


2 weeks ago

@schlessera
2 weeks ago

Second iteration, adding all the basic UI integration (for plugins only for now) [Updated]

#11 in reply to: ↑ 9 @flixos90
2 weeks ago

  • Focuses multisite added
  • Keywords has-patch needs-unit-tests needs-testing added

Replying to schlessera:

The second iteration contains basic UI integration, but currently only for plugins.

I really like the direction this approach is going, thanks for doing all that work on it!

  • Once the language has been decided, we should implement the same UI mechanisms for themes as well.
  • Multisite has been largely ignored for now. This will probably require specific handling of network-activated plugins.

I can see this project becoming really big and complex for a single patch and commit. In order to maintain a good overview as well as easy workflow, I think it would be a good idea to move work on this over to a GitHub fork of WordPress. Then we can handle the respective integrations with themes, multisite and possibly MU plugins via separate branches/PRs.

  • We need to decide whether must-use plugins should be pausable or not. I have no clear answer for this yet.

I would advise against this, for three reasons:

  • MU plugins often contain very foundational functionality of a website.
  • Someone who uses MU plugins usually has a little more technical knowledge than who the safe mode is most beneficial for.
  • There is no way to manage MU plugins through the admin. Of course we could introduce a way to resume a MU plugin, but for the two previous reasons, I don't think it's worth the hassle.
  • In an event like a PHP update, multiple plugins might break at the same time. This would lead to a very confusing back & forth until all problematic plugins have been properly detected. We should think about how we can handle this (an AJAX loop with the paused plugins actually paused, to detect new issues?).

For now, in my opinion we should try to focus on detecting in the most accurate way possible which plugin(s) or theme(s) are causing the error. An AJAX loop is what I thought about too.

A general suggestion for improvement I have is that we should consider keeping the user in the backend when they are already there while something breaks. For example, if you're in the Plugins screen, resuming a plugin, and then it's still broken, it would be great if the user could automatically stay on that page, getting the feedback message there that it was paused again (instead of seeing the message like in the frontend and having to click a link). We could try to redirect the user to the page they're currently on after pausing the broken plugin(s) and theme(s), if that is already a logged-in admin page. Of course this only works under certain circumstances, if we know who the current user is and if we're still before output. This could very well happen a little later, when we improve the UX further.

#12 @schlessera
2 weeks ago

I can see this project becoming really big and complex for a single patch and commit. In order to maintain a good overview as well as easy workflow, I think it would be a good idea to move work on this over to a GitHub fork of WordPress. Then we can handle the respective integrations with themes, multisite and possibly MU plugins via separate branches/PRs.

I'm not sure that is the case. I think right now, we cover already 60-70% of the needed code. Half of the multisite plugin stuff is already covered, and themes can mostly reuse the same code.

I would advise against this, for three reasons:

  1. MU plugins often contain very foundational functionality of a website.
  2. Someone who uses MU plugins usually has a little more technical knowledge than who the safe mode is most beneficial for.
  3. There is no way to manage MU plugins through the admin. Of course we could introduce a way to resume a MU plugin, but for the two previous reasons, I don't think it's worth the hassle.

I'm not entirely sure these are valid reasons.

  1. Yes, but a parse error means they just take down the site either way. The importance of the plugin is irrelevant in that case.
  2. Sometimes MU plugins ar not installed by the site owner themselves, but by hosting or another plugin.
  3. This is the only argument that plays a role here I think. What's the point of making people enter th eplugin screen if they have no valid options. I wonder though (not tested yet) whether they could modify the MU plugin through the built in code editor...

For now, in my opinion we should try to focus on detecting in the most accurate way possible which plugin(s) or theme(s) are causing the error. An AJAX loop is what I thought about too.

Yes, agree.

A general suggestion for improvement I have is that we should consider keeping the user in the backend when they are already there while something breaks. For example, if you're in the Plugins screen, resuming a plugin, and then it's still broken, it would be great if the user could automatically stay on that page, getting the feedback message there that it was paused again (instead of seeing the message like in the frontend and having to click a link). We could try to redirect the user to the page they're currently on after pausing the broken plugin(s) and theme(s), if that is already a logged-in admin page. Of course this only works under certain circumstances, if we know who the current user is and if we're still before output. This could very well happen a little later, when we improve the UX further.

Yes, I had already thought about adding something like that. It could be similar to the Shiny Updates plugin, where it triggers a sandboxed request to see whether the plugin now works correctly again.

@schlessera
2 weeks ago

Refactored persistence code, added error output, "paused" filter in plugin list and redirect to filtered view

#13 @schlessera
2 weeks ago

I refactored the persistence code. Recorded errors are now stored in the database. To avoid filling up the database, I added automatic pruning that get rids of any records that are not active plugins.

The recorded error message is now shown in the plugin list. Not sure whether that's a good or a bad thing, but it definitely needs better styling at the very least. I had planned to make the error display collapsible to not scare away users immediately, but decided to skip introducing JS code for now.

The plugin list is now filterable by 'paused' status, and the admin notice will bring users automatically to a filtered view of the plugin list. This helps find the problem right away on sites that have multiple pages of plugins.

https://i.imgur.com/wlZ6TB2.png

#14 @ibenic
2 weeks ago

Such a great feature! Thank you for working on this.

What would happen if such an error comes from a function/method hooked on an action/filter in the admin area?

My guess would be the first run would cause WOSD and a page refresh would then get the user back in the admin area (unless another plugin causes WOSD again in the admin area). I have patched my local install and will do some runs tomorrow since it's already 1:30am so I might miss something :)

This ticket was mentioned in Slack in #core-php by schlessera. View the logs.


2 weeks ago

#16 @schlessera
13 days ago

@ibenic Parse errors happen when the PHP file is parsed so that the functions and methods can be declared.

For an action/filter to work, the function/method it uses as a callback needs to be declared before the action/filter is triggered, and this is usually done right when the plugin is being loaded (but not always). So parse errors should in almost all cases be detected during the plugin loading stage.

We're not currently doing anything with runtime errors. The main issue with runtime errors is that they are neither reliably broken (it could just have been a temporary hiccup) nor can they be properly traced back to their root cause (as the bug is often contained in one plugin, but another plugin then trips over the previous plugin's bad data and throws a fatal error).

We'll have to discuss in more detail what types of errors we want to catch and how we want to handle them. For now, I opted to go with the safest route to test the mechanism.

#17 @MKSnMKS
12 days ago

Hi, Any thoughts on my thoughts, or are they not worth commenting on? They are not elegant, nor too deep in the system, but they are probably easier to get going for now. The central process is done manually by many users, and is practically the only way it done (that I have found - described in many places by numerous authors, on all sorts of websites).

The diagnosis method is a) swap in/swap out, till it works, so the 'problem area has been located' rather than b) system level fault reporting to find the 'cause'.

The rest of it is the user assisted/automated processes. At any time in future, the method of diagnosis can be changed, or combined, with this process.

This ticket was mentioned in Slack in #core-php by flixos90. View the logs.


6 days ago

@schlessera
6 days ago

Avoid array derefencing for PHP 5.2 compat

@flixos90
6 days ago

#19 @flixos90
6 days ago

44458.diff is an update based on 44458-4.diff that fixes two issues:

  • In wp_record_extension_error() support the plugin directory constants and registered theme directories correctly, as we can't rely on them having the default mu-plugins, plugins and themes names.
  • In wp_forget_extension_error() fix a bug where an extension would never be resumed because the option itself would not be changed.
Note: See TracTickets for help on using tickets.