Make WordPress Core

Opened 3 months ago

Last modified 3 months ago

#62176 new defect (bug)

No stable error detection after Automatic Updates with more than one update

Reported by: georgwordpress's profile georgwordpress Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 6.6
Component: Upgrade/Install Keywords: 2nd-opinion has-patch
Focuses: Cc:

Description

While testing rollback of buggy plugins updates I found some times the buggy plugin was not detected.
But even with the same test setup this happens only sometimes.

In class-wp-automatic-updater.php a risk of a racing condition is already noticed as a comment.
To lower the risk a sleep(2) was introduced as a quick fix.

In another discussion @stoyangeorgiev added a comment:
"... another thing we found is that when you update two plugins that have fatal errors, It may be possible that the checks for enabling/disabling maintenance mode in load.php may not be properly getting the state of the maintenance mode. I've managed to replicate it on my setup ... I'll try to test it more so I can find a stable way to replicate it."

After digging deep into the related functions I think I have found the problem.

The detection of errors after automatic plugin updates is done with a loopback request to the homepage.
While a user request is still answered with the maintenance screen this loopback request must bypass the maintenance mode to receive his dedicated scraping result produced by another function.

There are two very ancient WP functions related to the wp maintenance mode:

1.) the function maintenance_mode()
Was introduced in WP over 15 years ago.
This function creates a new PHP script every time it is called, which is saved as the well-known .maintenance file.

Here the greatly simplifies file operations of this function:

@unlink( '.maintenance' );
$fp = @fopen( '.maintenance', 'wb' );
fwrite( $fp, '<?php $upgrading = ' . time() . '; ?>' );
fclose( $fp );

Nothing complicated - but there is a racing condition:
If in parallel another loading process accesses the maintencance file exactly between fopen and fwrite, the file is still empty.

This hasn't been a problem for the last 15 years because the other maintenance function has been very fault tolerant.

2.) the function wp_is_maintenance_mode()

global $upgrading;
if (file_exists(ABSPATH . '.maintenance') ) require ABSPATH . '.maintenance';

At this point, it can happen that the maintenance file already exists but does not yet contain any PHP code - so the value of $upgrading is not updated.
But thanks to $upgrading has already been declared as global the next line of code is running without any problems.

if ( ( time() - $upgrading ) >= 10 * MINUTE_IN_SECONDS ) return false;

Even with $upgrading beeing NULL when this edge case happens - no php errors - simply return false.

As result: this edge request at the beginning of a maintence phase is not answered with the maintenance screen.

Does it matter?
No - because putting wp into maintenance mode is only a preparation before any changes and updates will really take place. For more than 15 years nobody complains.

What has changed?
Some times ago the automatic updates of core/themes etc was introduced.
In the of loop of Automatic Updates the function maintenance_mode is called a lot of times now.
(Before it enters the plugins section the function maintenance_mode is called even twice in every round.)

And as mentioned above with every call of maintenance_mode the sequence of @unlink/@fopen/fwrite/fclose is deleting and writing the maintenance file.

So the risk becomes higher that another request could hit the gap betweens fopen and fwrite.

But still no problems.

Until in WP 6.6 the Rollback part 3 comes with three related changes:

1.) in the old function wp_is_maintenance_mode() a new, additional condition check was introduced.
The current value of $upgrading is now crucial in this new check.

A new function has_fatal_error was introduced with
2.) require ABSPATH . '.maintenance', so the value of $upgrading as global was now updated at a second place - and during an update loop every time when a plugin was updated.

3.) the value of $upgrading was used as part of the loopback request

So - in the first round of an update the loopback request will bypass the maintenance screen in every case:

  • the value of $upgrading used in the loopback request is the same as expected in the new maintenance bypass condition.
  • Even if the edge case happens the 15 Year old bail out is ensuring the maintenance bypass - so the new condition is not checked in this edge case.

But after the first round of an update under some racing conditions it could happen that the value of $upgrading used in the loopback request does not match the new maintenance bypass check.
Due to $upgrading to be declared global it is possible, that after the first round $upgrading has already a value, even when require is loading a empty maintenance file.

As a result at the long end the function has_fatal_error returns false ==> no errors found - without making a full assessment of the loopback request.

In this PR the use of $upgrading in the new functions was removed/replaced.
And the sleep(2) was also removed.

@afragen, @costdev, @stoyangeorgiev
Do you think I'm on the right path with this PR?

Follow-up to https://core.trac.wordpress.org/changeset/58128
See https://core.trac.wordpress.org/ticket/58281

Change History (12)

#1 @afragen
3 months ago

Is the site you’re testing on a clean installation with only a failing plugin during update?

If not could you provide more specifics about your testing set up?

#2 @afragen
3 months ago

Including the relevant sections of the debug.log would be very helpful.

#4 @costdev
3 months ago

  • Keywords reporter-feedback added

Hi @georgwordpress, thanks for opening this ticket and for your investigations!

A few things I'd recommend before we consider the PR:

  1. This should be tested on a clean installation, with the exception of at least two plugins that will fatal on update. Can you confirm if this is the case, or can you re-test with a clean installation?
  2. The automatic updater should run as normal. It should not be triggered manually or through any other trigger.
  3. We'll need to get full environment information to ensure we track similarities between environments that can reproduce this issue.
  4. We need to see a log of the whole process that demonstrates which plugin is being updated at that time, what its expected $upgrading value should be and what the actual $upgrading value is. This will allow us to see this mismatch in the full context of an update run for multiple plugins.

Update processes can be triggered manually/via other triggers, and sites may not be clean installations. At this stage of testing, reducing all this to the base functionality helps us determine what causes this to occur when thousands of tests prior to the feature's release didn't produce this. The issue is of course happening, so now we need to identify the specific cause(s). It may be certain environmental conditions, testing conditions, or something else that explains why pre-release testing was fine, but this post-release testing is not.

Last edited 3 months ago by costdev (previous) (diff)

#5 @georgwordpress
3 months ago

Hi @costdev
Thanks for the quick reply!

My journey started a while ago.
I'm a big fan of the automatic updates because of the ever faster exploitation of vulnerabilities.
If the risk of activating faulty updates can be minimized by automatic rollback - all the better.

Yes - I fell into all sorts of traps at the beginning that make it difficult to reproduce the problem:
Special features of an existing installation, manual triggering of the cron job etc.

Why the latter can make a difference is partly due to the cookies sent by the loopback.
Depending on how I start the cron job, my admin cookies may be sent along with the loopback.
Then the wp load runs a little differently and the homescreen also looks a little different than with a loopback request as an anonymous user.
I have therefore also ensured in the PR that the loopback request is now always carried out as an anonymous user.

But that's just a side note.

And to avoid any misunderstandings:
after my research in the old tickets, I was able to understand how much effort was put into testing by everyone involved! Because I've done a lot of testing myself in the meantime, I can understand this very well.

@costdev
And I especially have a lot of respect for Your efforts!

There is no question that everything was tested very carefully before commissioning!

But it is an edge case that rarely occurs, especially because the current sleep(2) has really reduced the probability.

I will report later on exactly how I am currently testing.

But before anyone gets nervous, you should sit back and realize what this is all about.

The normal operation of a WP installation is not at risk!
Even with automatic updates of core and plugins, I have had very good experiences in recent years.
With careful selection of quality-assured plugins, the release of a plugin with fatal errors is very, very unlikely.

Part 3 of the project is only about further minimizing the operational risk by automatically detecting a buggy plugin update and preventing the update.

The current detection cannot detect all errors anyway.
Plugin errors that only occur when logging into the admin backend cannot be detected by the loopback request either - are therefore outside the scope of part 3.

But I hope you can see from my efforts that I am a big fan of the automatic rollback project!

#6 @costdev
3 months ago

Hey @georgwordpress!

I realise that my previous comment may have led to a misunderstanding! By mentioning the thousands of tests that were done prior to release, I just meant that this shows how even with many different environments and configurations tested, issues can still occur and that's why environment and testing information is so important to help identify potential causes. For example, during the Rollback project, we saw issues with VirtualBox that we could only identify once we realised that VirtualBox was the common denominator in the bug reports. I just want to clarify that I didn't mean to suggest anything negative!

I really appreciate your efforts so far and look forward to your report about your testing approach and your environment information! 🙂

Last edited 3 months ago by costdev (previous) (diff)

#7 @georgwordpress
3 months ago

  • Keywords reporter-feedback removed

Environment

  • WordPress: 6.6.2
  • PHP: 8.3.8-nmm1
  • Server: Apache
  • Database: mysqli (Server: 10.5.26-MariaDB-ubu2004-log / Client: mysqlnd 8.3.8-nmm1)
  • Browser: Chrome 129.0.0.0 (Windows 10/11)
  • Theme: Twenty Twenty-Four 1.2
  • MU-Plugins: Keine aktiviert
  • Plugins:
    • AAA Plugin Update A 1.1.32
    • AAA Plugin Update B 1.2.32
    • Cron Jobs 1.2.10
    • WooCommerce 9.3.3
    • WordPress Beta Tester 3.5.6

remarks:
Plugin Beta Tester is only activated for scaping above environment
shared hosting at all-inkl

Preparation:
Fresh installation of wordpress
wp-config.php:

define( 'WP_DEBUG', true);
define( 'WP_DEBUG_LOG', true );

update translations
install plugins see above
AAA Plugin Update A/B are selfhosted plugins without further functionalities,
only a unknown functions is used to throw a fatal error,
at first installation this line of code is commented out

After everthing beeing stable a backup of all files is made.
This Backup is the starting point for testing.

Steps for testing

Positive test, no buggy plugins

  1. editing header of installed plugins to lower the version number
  2. Browse to /wp-admin/plugins.php to check wheather available updates is recognized
  3. use a cron plugin to edited the next start time of wp_version_check to 2 Minutes in future
  4. logout
  5. use another browser to request the homepage again and again to trigger the cron jobs.

==> Expected result: all plugins updated, no error in log

Negative tests, with buggy plugins
On my plugin update server I change the plugin zip packages to a version which will throwing a fatal error after update
Then the test steps as described above.

==> Expected result: detection of buggy plugins, rollback of buggy plugins

#8 @georgwordpress
3 months ago

I have just realized that there may be a misunderstanding.
I have not been able to reliably repeat the error either.
If my description above is correct, that will also be impossible because it really is pure coincidence by luck.

When I came across hints in the logs while testing that using $upgrading can lead to problems, I developed this PR.

My approach:
if no problems occur with this PR despite removing sleep(2), then this is an indication that my description above is correct.

But I just thought of another approach to show reproducibly that the error detection of the function has_fatal_error is not always correct (without the PR).

If I edit the function maintenance_mode
(in /wp-admin/includes/class-wp-upgrader.php)
and comment out the line with the update of the variable:

$maintenance_string = '<?php //$upgrading = ' . time() . '; ?>';

then all plugin updates are carried out without any problems.
But in a negative test with a buggy plugin this is not deteced.
Here for example with plugin A beeing buggy.

[08-Oct-2024 05:27:40 UTC] Automatic updates starting...
[08-Oct-2024 05:27:40 UTC] Automatic plugin updates starting...
[08-Oct-2024 05:27:40 UTC] Upgrading plugin 'gri_plugin_update_A'...
[08-Oct-2024 05:27:41 UTC] Plugin 'gri_plugin_update_A' has been upgraded.
[08-Oct-2024 05:27:41 UTC] PHP Fatal error: Uncaught Error:
Call to undefined function noSuchFunctionThrewError()
in .../wp-content/plugins/gri_plugin_update_A/gri_plugin_update_A.php:221
Stack trace:
...
  thrown in ../wp-content/plugins/gri_plugin_update_A/gri_plugin_update_A.php on line 221
[08-Oct-2024 05:27:44 UTC] The update for 'gri_plugin_update_A' has no fatal errors.
[08-Oct-2024 05:27:44 UTC] Upgrading plugin 'gri_plugin_update_B'...
[08-Oct-2024 05:27:44 UTC] Plugin 'gri_plugin_update_B' has been upgraded.
[08-Oct-2024 05:27:47 UTC] The update for 'gri_plugin_update_B' has no fatal errors.
[08-Oct-2024 05:27:47 UTC] Automatic plugin updates complete.
[08-Oct-2024 05:27:47 UTC] Automatic updates complete.

#9 @georgwordpress
3 months ago

@costdev
not sure if you will receive a notification of my updates without beeing mentioned 😊

#10 @afragen
3 months ago

@georgwordpress I see your reports testing your PR. What we would like to see is the demonstration of the error in WP 6.6.x that you are trying to correct.

Is there some reason WooCommerce needs to be present/active?

Let’s try to confirm the issue/error first and then work on the solution.

#11 @georgwordpress
3 months ago

@afragen
Did you notice my comment?
https://core.trac.wordpress.org/ticket/62176#comment:8

WooCommerce:
No needs - I had taken a big plugin with many files during update operations.

Any known problems with WooCommerce related to automatic updates?

#12 @georgwordpress
3 months ago

@costdev 

"thousands of tests prior to the feature's release"


@afragen
I can understand that nobody wants to make major changes to the existing code without good reason. After all, it was a very long way until part 3 went live.

Testing autoupdates/rollback can be terribly time-consuming and frustrating.

Hence my initial question in this ticket to the developer:
Is the result of my code review and therefore a need for action understandable?

Maybe my PR is too big a leap?

Then maybe a smaller step:
We need at least log entries when the current function does not work as expected.

In some places it simply reports back that no error was found, even though the real error check did not take place.

For example, right at the beginning there are some bailouts with return false ==> no error founds.
Here I would at least expect a warning in the error log that the error detection cannot be carried out as expected.

The current code:

    protected function has_fatal_error() {
        global $upgrading;

        $maintenance_file = ABSPATH . '.maintenance';
        if ( ! file_exists( $maintenance_file ) ) {
            return false;
        }

        require $maintenance_file;
        if ( ! is_int( $upgrading ) ) {
            return false;
        }
        ....

My suggestion therefore:
1.) We don't change anything in the current processes.
2.) But we only introduce a few additional log entries in the function has_fatal_error if something does not work as intended in the concept when checking for errors.

@afragen, @costdev 
Do you support this approach?

Note: See TracTickets for help on using tickets.