WordPress.org

Make WordPress Core

Opened 9 years ago

Closed 2 weeks ago

Last modified 4 hours ago

#19879 closed task (blessed) (fixed)

Better caching for get_dirsize

Reported by: batmoo Owned by: helen
Milestone: 5.6 Priority: normal
Severity: normal Version: 3.3.1
Component: Filesystem API Keywords: has-patch needs-dev-note has-unit-tests commit
Focuses: multisite, performance Cc:

Description

In a multisite install, when trying to determine whether a site has exceeded its storage quota, WordPress will scan through a blog's upload directory and sum up the file sizes, by running filesize against each one. With a large number of files, this can significantly slow down the upload process or certain portions of the Dashboard.

get_dirsize has transient caching in place but this is a single cache entry for all folders. It might be better if WordPress has a separate cache entry for each folder and was invalidated based on context so that get_dirsize does not need to be run constantly on older, unchanged directories as frequently.

Attachments (9)

19879.diff (5.0 KB) - added by A5hleyRich 5 years ago.
19879_WordPress_56.patch (5.6 KB) - added by janthiel 2 months ago.
Updated patch against WordPress 5.5 master
19879_WordPress_56-AddedFilterForDirSizeCalculation.patch (7.2 KB) - added by janthiel 2 months ago.
Added new filter to allow for custom dir size calculation without losing the new caching
19879_WordPress_56-AddedFilterForDirSizeCalculation.2.patch (7.2 KB) - added by janthiel 2 months ago.
New filter comment wrongly stated to expect the dir size in megabytes and not in bytes
19879_Better-dirsize-caching-and-invalidation.patch (7.5 KB) - added by janthiel 7 weeks ago.
"Final" patch proposition; Renamed cache invalidation method and added cache invalidation to wp_upload_bits which did not invalidate the cache before acording to the Unit Tests
19879_Unit_Tests.patch (9.6 KB) - added by janthiel 7 weeks ago.
Unit Tests for the new dirsize caching
19879_Unit_Tests_comment-fix.patch (9.6 KB) - added by janthiel 7 weeks ago.
Updated function comment from copied test case to what the function really does. Missed that in the first place.
19879.final.patch (7.6 KB) - added by janthiel 7 weeks ago.
PHPCS Fixes
19879_Unit_Tests.final.patch (9.6 KB) - added by janthiel 7 weeks ago.
PHPCS Fixes

Download all attachments as: .zip

Change History (58)

#1 @DJPaul
9 years ago

  • Cc djpaul@… added

#2 @danielbachhuber
9 years ago

  • Cc wordpress@… added

#3 @DJPaul
9 years ago

We worked around this bug by making sure that upload_space_check_disabled was set; this resulted in an immediate speed-up.

#4 in reply to: ↑ description @kurtpayne
9 years ago

  • Cc kpayne@… added
  • Type changed from defect (bug) to enhancement

My first thought would be to change recurse_dirsize() like so:

} elseif (is_dir($path)) {
	$handlesize = recurse_dirsize( $path );
	if ($handlesize > 0)
		$size += $handlesize;
}

to:

} elseif (is_dir($path)) {
	$handlesize = get_dirsize( $path );
	if ($handlesize > 0)
		$size += $handlesize;
}

This change would check the cache as it's doing the recursive iteration. This is very optimistic, though. Cache misses would result in a database SELECT and UPDATE/INSERT. For most sites, this change would probably slow things down. Further, this could result in a lot of extra rows in the db that wouldn't serve a useful purpose.

#5 @jamescollins
9 years ago

See http://mu.trac.wordpress.org/ticket/1175 for the origins of the get_dirsize() caching.

#6 @nacin
9 years ago

In [19747]:

Remove copy-pasted comment from upload_is_user_over_quota()'s phpdoc. It came from recurse_dirsize(). see #19879.

#8 follow-ups: @jeremyfelt
7 years ago

  • Keywords 2nd-opinion close added

I think (???) this is a non issue now.

In [21387], get_space_used() was introduced to allow for an alternate method of checking used space to work via filter. I believe this takes care of the original issue.

When that was introduced, it started passing the directory to check via wp_upload_dir() rather than BLOGUPLOADDIR, which theoretically should mean nice things for checking directory sizes in smaller pieces.

Most likely worksforme, but further input is welcome.

#9 in reply to: ↑ 8 @SergeyBiryukov
7 years ago

Replying to jeremyfelt:

When that was introduced, it started passing the directory to check via wp_upload_dir() rather than BLOGUPLOADDIR

Related: [21823] (first I tried to find that in the original changeset, [21387]).

#10 in reply to: ↑ 8 @nacin
7 years ago

Replying to jeremyfelt:

In [21387], get_space_used() was introduced to allow for an alternate method of checking used space to work via filter. I believe this takes care of the original issue.

While it allows someone to do their own caching, this is still a bit lame.

One thing we could do would be to remove the timeout. The dirsize_cache transient is invalidated in wp_handle_upload() (though not wp_handle_sideload()) and wp_delete_attachment(). If we're confident in this, we should be able to use the transient as long as it exists.

It doesn't help with repeated invalidations followed by repeated recursive directory searches, but it'll at least lessen the number of times the directory is recursed for no reason at all.

When that was introduced, it started passing the directory to check via wp_upload_dir() rather than BLOGUPLOADDIR, which theoretically should mean nice things for checking directory sizes in smaller pieces.

BLOGUPLOADDIR contained a path to a site's individual /files/ folder, rather than all sites. So no change here.

#11 @wpdavis
7 years ago

Related: #26135

#12 @jeremyfelt
7 years ago

  • Keywords close removed
  • Milestone changed from Awaiting Review to Future Release

We should see what we can do with this in 3.9.

#13 @jeremyfelt
7 years ago

  • Component changed from Multisite to Filesystem
  • Focuses multisite added

#14 @jeremyfelt
7 years ago

  • Focuses performance added

#15 @chriscct7
5 years ago

  • Keywords needs-patch added; 2nd-opinion removed

This ticket was mentioned in Slack in #feature-respimg by jeremyfelt. View the logs.


5 years ago

@A5hleyRich
5 years ago

#17 @A5hleyRich
5 years ago

Supplied patch removes transient timeout as suggested by @nacin. However, instead of purging the entire cache the dirsize_cache transient now caches each directory like so:

<?php
Array
(
    [wp-content/uploads/2013/09] => 2082994
    [wp-content/uploads/2013] => 2082994
    [wp-content/uploads/2014/01] => 2082994
    [wp-content/uploads/2014] => 2082994
    [wp-content/uploads/2015/04] => 486873
    [wp-content/uploads/2015/10] => 2919469
    [wp-content/uploads/2015/11] => 34562447
    [wp-content/uploads/2015] => 37968789
    [wp-content/uploads] => 42142973
)

When wp_handle_upload() or wp_delete_attachment() are called only the directories that are effected are purged. For example if we upload wp-content/uploads/2015/11/test.png the _transient_dirsize_cache becomes:

<?php
Array
(
    [wp-content/uploads/2013/09] => 2082994
    [wp-content/uploads/2013] => 2082994
    [wp-content/uploads/2014/01] => 2082994
    [wp-content/uploads/2014] => 2082994
    [wp-content/uploads/2015/04] => 486873
    [wp-content/uploads/2015/10] => 2919469
)

This ensures that only the changed directories are checked when get_dirsize is called. Still needs unit tests, but wanted to get feedback before doing any more work on this patch.

#18 @A5hleyRich
5 years ago

  • Keywords has-patch dev-feedback needs-unit-tests added; needs-patch removed

This ticket was mentioned in Slack in #core-multisite by jeremyfelt. View the logs.


4 years ago

#20 @janthiel
2 months ago

Just wanted to push the importance of this feature.

We already replaced the default WordPress PHP calculation with OS native features ("du" on Linux) which reduced the amount of time massively last year. Sadly now we moved our uploads to NFS (aka "a slow filesystem") and the upload folder grew even more over time. So we are back to quota calculation times of >30s.
As some sites see regular uploads which resets the transient frequently throughout the day, the impact of this is massive to the WP-Admin backend, as well as to REST API calls to the /media/ endpoint.
Calculating this on a per month base and only on a "as needed" base is a perfect way to address this issue permanently! Then the calculation times should go down instantly independet of filesystem and way to measure the space used (OS tools or default PHP).

So this is a definite +1 for approaching this soon.

Anything we can do to support?

This ticket was mentioned in Slack in #core-multisite by janthiel. View the logs.


2 months ago

@janthiel
2 months ago

Updated patch against WordPress 5.5 master

#22 @janthiel
2 months ago

Just FYI on how much this has an impact on the file upload:

Currently on a huge site on NFS with 25G of files the Quota processing would take >30 seconds(!) to calculate the whole site. This time is required for each upload to finish besides the actual upload processing.
Calculating only the one required folder would reduce this to <1s.

This time also adds on top to wp-admin dashboard calls if there is no cache present.

Here are some basic stats using Unix file system tools. The actual PHP quota calculation will be even worse.

-bash-4.2$ cd wp-content/uploads/sites/25/
-bash-4.2$ time du -sh ./
25G ./

real 0m31.757s
user 0m0.522s
sys 0m7.391s

-bash-4.2$ cd wp-content/uploads/sites/25/2020/09/
-bash-4.2$ time du -sh ./
7,9G ./

real 0m0.898s
user 0m0.068s
sys 0m0.250s

Last edited 2 months ago by janthiel (previous) (diff)

This ticket was mentioned in Slack in #core by thelmachido. View the logs.


2 months ago

#24 @helen
2 months ago

  • Milestone set to 5.6

Milestoning for visibility, seems reasonable at first glance but not at all my area of expertise.

@janthiel
2 months ago

Added new filter to allow for custom dir size calculation without losing the new caching

#25 @janthiel
2 months ago

I just added another patch file. It includes one additional filter which will alow one to change the calculation of the dirsize without losing the new caching benefits.

Currently one can only use the pre_get_space_used filter which then required to either duplicate large parts of the actual get_dirsize and recurse_dirsize functions or lose the benefits of the new caching.
The new filter allows for custom code to run per directory and the result will fill the dirsize_cache. Yet the previous existing filter pre_get_space_used can still be used to implement a complete own space calculation which might or might not rely on the dirsize_cache. So there will be complete backward compatibility for those who implemented their own logic.

Particular use cases for the new filter are:

  • Calling CDN APIs to get the actual usage of the requested directory
  • Implement custom logic benefiting from deep integration into native operating system tools for much faster calculations (like du on unix)

@janthiel
2 months ago

New filter comment wrongly stated to expect the dir size in megabytes and not in bytes

#26 @hellofromTonya
8 weeks ago

  • Keywords needs-dev-note added

Mention in the Misc Dev Note for the new filter.

@janthiel
7 weeks ago

"Final" patch proposition; Renamed cache invalidation method and added cache invalidation to wp_upload_bits which did not invalidate the cache before acording to the Unit Tests

@janthiel
7 weeks ago

Unit Tests for the new dirsize caching

#27 @janthiel
7 weeks ago

  • Keywords has-unit-tests added; needs-unit-tests removed

@janthiel
7 weeks ago

Updated function comment from copied test case to what the function really does. Missed that in the first place.

This ticket was mentioned in Slack in #core-multisite by hellofromtonya. View the logs.


7 weeks ago

#29 @janthiel
7 weeks ago

This code should be backward compatible without breaking anything.

Still there are some plugins which might be affected by this. Because up until now the only way to implement any way of custom space calculation was using the pre_get_space_used filter.
This forced one to re-implement the dirsize caching oneself which will not be compatible with the new code. If there is a custom implementation around the pre_get_space_used filter working with the dirsize_cache transient, this patch won't break anything but the new optimized caching would not kick in. So the worst case is that there is no performance gain. This should be well documented and communicated so that plugin authors can adapt.

If a plugin does anything around get_dirsize, recurse_dirsize the pre_get_space_used filter or the dirsize_cache transient, it might be affected and requires changes to fully leverage this optimizations. Those plugin authors should consider moving to the new calculate_current_dirsize filter instead.

I reached out to some of the Media Offloading Plugin vendors and asked for some feedback based on their codebase.

#30 @ianmjones
7 weeks ago

@janthiel, thanks for getting in touch.

WP Offload Media does use the pre_get_space_used filter. It queries its own data to get size information for files that have been offloaded and removed from the server, and then adds that to the total retrieved from a call to get_dirsize for the uploads basedir.​​​​

From the sounds of things that would not be broken by this proposed change?

Would it however cause performance issues?

#31 @janthiel
7 weeks ago

@ianmjones Thanks for getting back to this so quickly :-)

From what you describe, this change will only affect you in all the positive ways it should. The call to get_dirsize will not change in any way. But the caching under the hood of it will be much more scalable for growing upload folders. So as long as you don't handle the dirsize_cache by yourself, everything will continue without any change required on your end.

Performancewise: On "new" or small sites you will most probably not notice anything at all. But for larger ones or those running on NFS you should notice the increased performance as well.

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


7 weeks ago

#34 @helen
7 weeks ago

I mirrored the two latest patches onto GitHub to allow the tests to run: https://github.com/WordPress/wordpress-develop/pull/608

You'll find some PHPCS errors in there :) the tests seem to pass though. I'd like to do a little more asking around about people who use these filters but otherwise I think we can attempt this for the 5.6 beta.

@janthiel
7 weeks ago

PHPCS Fixes

@janthiel
7 weeks ago

PHPCS Fixes

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


6 weeks ago

#36 @hellofromTonya
6 weeks ago

  • Keywords needs-testing added

Notes from today's scrub:

Per @helen:

I did not get any feedback from anybody that they’re actually interacting with any hooks in the area, so it’s probably a matter of committing and seeing if anything breaks at this point…

Per @SergeyBiryukov:

I'm not sure normalize_dirsize_cache_path() is needed for just two instances, I would look into using wp_normalize_path() and getting rid of DIRECTORY_SEPARATOR, as we don't use that anywhere else for similar actions. I can take a closer look and test on Windows.

#37 @helen
6 weeks ago

  • Owner set to helen
  • Resolution set to fixed
  • Status changed from new to closed

In 49212:

Multisite: More specific caching for get_dirsize.

Instead of one cache entry for all upload folders for a site on multisite, this now caches for each folder and invalidates that cache based on context. In multisite, this should speed up get_dirsize calls since older directories that are much less likely to change will no longer have the size recalculated.

Props janthiel, A5hleyRich, batmoo.
Fixes #19879.

#39 @SergeyBiryukov
6 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Type changed from enhancement to task (blessed)

Reopening for some documentation fixes and testing.

  1. I believe this should be renamed to clean_dirsize_cache(), for consistency with other functions of a similar purpose:
    • clean_bookmark_cache()
    • clean_category_cache()
    • clean_comment_cache()
    • clean_site_details_cache()
    • clean_network_cache()
    • clean_blog_cache()
    • clean_post_cache()
    • clean_attachment_cache()
    • clean_object_term_cache()
    • clean_term_cache()
    • clean_taxonomy_cache()
    • clean_user_cache()
  1. As noted above, I'm not sure normalize_dirsize_cache_path() is needed for just two instances, seems a bit too specific for just two lines of code.
  2. tests/multisite/dirsizeCache.php file name should match the name of the function being tested.
Last edited 6 weeks ago by SergeyBiryukov (previous) (diff)

#41 @helen
4 weeks ago

@SergeyBiryukov I agree with all of the above, there are other instances of us doing the same PHP that normalize_dirsize_cache_path() is doing anyway so it doesn't make sense to name it for some usage but not all. Opened https://github.com/WordPress/wordpress-develop/pull/686 for review on those changes and to make sure tests pass.

This ticket was mentioned in Slack in #core by daisyo. View the logs.


4 weeks ago

#43 @hellofromTonya
4 weeks ago

  • Keywords commit added; needs-testing dev-feedback removed

#44 @helen
2 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 49616:

Multisite: More consistency for clean_dirsize_cache().

Props SergeyBiryukov.
Fixes #19879.

#46 @SergeyBiryukov
2 weeks ago

In 49628:

Docs: Adjust comments for recurse_dirsize() and related tests per the documentation standards.

Follow-up to [49212], [49616].

See #19879.

#47 @SergeyBiryukov
2 weeks ago

In 49629:

Multisite: Rename the calculate_current_dirsize filter to pre_recurse_dirsize.

Set the default value to false. This brings some consistency with the pre_get_space_used filter.

Follow-up to [49212], [49616], [49628].

See #19879.

#48 @SergeyBiryukov
2 weeks ago

In 49630:

Docs: Further remove tautology from comments in recurse_dirsize() tests.

Follow-up to [49212], [49616], [49628].

See #19879.

#49 @iandunn
4 hours ago

It looks like the difference in transient structure can sometimes cause fatals on PHP 8. #51913 has details. Please add thoughts if you have any.

Note: See TracTickets for help on using tickets.