Make WordPress Core

Opened 10 months ago

Closed 4 months ago

Last modified 2 months ago

#58919 closed enhancement (fixed)

Add a hook to set_cached_mo_files() to allow flexible caching strategies for globbing *.mo files

Reported by: mreishus's profile mreishus Owned by: swissspidy's profile swissspidy
Milestone: 6.5 Priority: normal
Severity: normal Version:
Component: I18N Keywords: has-patch has-unit-tests has-dev-note
Focuses: Cc:

Description

WordPress uses MO files to manage translations in plugins, themes, and core itself. One of the methods in the core class WP_Textdomain_Registry, set_cached_mo_files(), uses the glob() function to retrieve all MO files in a specific directory. On sites with a large number of number of language files, the glob() operation can become expensive if run on a directory with thousands or tens of thousands of files.

Currently, the set_cached_mo_files() method doesn't allow for a persistent caching strategy. The solution proposed here adds a hook in this method that would allow hosting providers to come up with their own caching strategy for the glob( $path . '/*.mo' ) call. This would allow for improving overall performance in scenarios where there are a large number of language files.

This change is about providing flexibility for different setups and does not suggest a specific caching implementation. We've used wp_cache_set/wp_cache_get, but it doesn't yet feel viable for core because of its lack of invalidation mechanism. Other hosting providers could leverage different strategies that best suit their environments.

Here's a potential approach:

        private function set_cached_mo_files( $path ) {
                $this->cached_mo_files[ $path ] = array();

                /**
                 * Filters the .mo files retrieved from a specified path.
                 *
                 * This filter allows you to change the way .mo files are fetched from a given path.
                 * This can be useful in situations where the directory contains a large number of files
                 * and the default glob() function becomes expensive in terms of performance.
                 *
                 * @since TBD
                 *
                 * @param null|array $mo_files Null by default, allowing the usual glob() function to run. If a non-null value
                 *                             is returned by the filter, this value is used instead of running glob().
                 *                             Should be an array of .mo files if not null.
                 * @param string $path The path from which .mo files are being fetched.
                **/
                $mo_files = apply_filters( 'get_mo_files_from_path', null, $path );
                if ( null === $mo_files ) {
                        $mo_files = glob( $path . '/*.mo' );
                }

                if ( $mo_files ) {
                        $this->cached_mo_files[ $path ] = $mo_files;
                }
        }

With this implementation, a filter get_mo_files_from_path is added, allowing a return of null to fall back to the current behavior, or an array of MO files from an alternative cache or storage system.

In our testing, with a directory containing around 12,000 files, we've found this modification to provide a noticeable improvement in response times. The glob takes 5-10ms, even in a production environment. Fetching the same data from from APCu can take ~0.25ms. While the exact performance benefit may vary based on the specifics of the caching mechanism and the number of files involved, the added flexibility allows WordPress to scale better in larger setups.

Attachments (2)

add_get_mo_files_from_path_1.diff (1.4 KB) - added by mreishus 10 months ago.
add_get_mo_files_from_path_2.diff (5.0 KB) - added by mreishus 9 months ago.

Download all attachments as: .zip

Change History (25)

#1 @mreishus
10 months ago

Here is an example use of this hook:

function wpcom_get_mo_files_from_path( $mo_files, $path ) {
        // We're only concerned with this one directory.
        if ( ABSPATH . 'wp-content/languages/plugins' !== $path ) {
                return null;
        }

        $cache_group = 'lang-files';
        $cache_key   = 'mo_files_' . md5( $path );
        $cached      = wp_cache_get( $cache_key, $cache_group );

        if ( false !== $cached ) {
                return $cached;
        }

        $mo_files = glob( $path . '/*.mo' );
        wp_cache_set( $cache_key, $mo_files, $cache_group, 300 );
        return $mo_files;
}

add_filter( 'get_mo_files_from_path', 'wpcom_get_mo_files_from_path', 10, 2 );

I didn't include it because it has no invalidation, and it assumes a persistent cache is set up. However, I thought it might be useful to show a potential way the hook could be used.

#2 @swissspidy
10 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to Future Release

Thanks for your suggestion!

We'd need to slightly change the filter name & usage to match existing patterns in core, but otherwise it looks reasonable.

We could also make the caching built-in and simply revalidate the cache whenever translations get installed.

Also, get_available_languages might need a similar change, though there are usually much less files at play for core translations.

#3 @mreishus
10 months ago

Thank you for the feedback:

  1. I agree with aligning the filter name and its usage with existing core practices. Could you provide some guidance on a better-suited name and any changes in usage that would more closely follow core patterns?
  1. I like the idea of built-in caching with invalidation whenever translations get installed. Would the 'upgrader_process_complete' hook be the place for this? Also, I'm curious if we should include this in the current ticket's scope or create a separate one?
  1. Regarding get_available_languages, my exploration revealed fewer files in play for core translations compared to set_cached_mo_files(). The impact was less significant, but it might still warrant some attention.

In the WordPress.com context, here's what I've found:

  • set_cached_mo_files() on ./wp-content/languages/plugins examines ~12000 files and takes ~7-10ms.
  • get_available_languages() on ./wp-content/languages examines ~600 files and takes ~0.75-1ms.

One note is the effectiveness of the caching system could play into the benefits realized here.

#4 @swissspidy
9 months ago

Could you provide some guidance on a better-suited name and any changes in usage that would more closely follow core patterns?

Typically in core we would do something like this:

/**
 * Filters the .mo files retrieved from a specified path before the actual lookup.
 *
 * Returning a non-null value from the filter will effectively short-circuit
 * the MO files lookup, returning that value instead.
 *
 * This can be useful in situations where the directory contains a large number of files
 * and the default glob() function becomes expensive in terms of performance.
 *
 * @since 6.4.0
 *
 * @param null|array $mo_files List of .mo files. Default null.
 * @param string $path The path from which .mo files are being fetched.
 **/
$mo_files = apply_filters( 'pre_set_cached_mo_files', null, $path );
if ( null !== $mo_files ) {
        $this->cached_mo_files[ $path ] = $mo_files;
        return;
}

Sometimes accompanied by a second filter to allow modifying the list after the regular lookup, but I don't think that's necessary here.

I like the idea of built-in caching with invalidation whenever translations get installed. Would the 'upgrader_process_complete' hook be the place for this?

Yeah that seems like a good hook, especially since translations can't really be uninstalled (unless a plugin/theme gets uninstalled), so this should suffice.

Also, I'm curious if we should include this in the current ticket's scope or create a separate one?

I'd say let's explore it in this ticket. It would help speed this function up by default for all sites, not just ones using this filter.
If it turns out to be too complex, we can always create a separate one.

Regarding get_available_languages, my exploration revealed fewer files in play for core translations compared to set_cached_mo_files(). The impact was less significant, but it might still warrant some attention.

Thanks for testing!

I find caching get_available_languages() interesting because it's used in many different places and could speed up things like the settings page, user edit form, etc.

Plus, the function does basically the same as set_cached_mo_files() (if you ignore the filtering).

So there's an opportunity to reduce repetition and feed 2 birds with 1 stone by introducing a shared function that does the globbing & caching. This function can then be used by get_available_languages() and set_cached_mo_files().

Does that make sense?

@SergeyBiryukov @ocean90 curious to hear your thoughts too

#5 @mreishus
9 months ago

Thanks for the feedback. I've given it another shot with covering both functions, built in caching and invalidation. And sorry for the delay as I'm not too used to working on core.

My uncertainties are:

  • How to organize the built-in caching: Where the functions should be added, where the add_filter and add_actions calls are, and what the functions are named.
  • The cache group to be used.
  • The invalidation strategy - I store a list of paths in another cache key, so we can retrieve them later when invalidating. Is there a better way to do this?
  • The invalidation strategy - I do it on all upgrader_process_complete events without checking the parameters. It seems safest and there shouldn't be much of a downside since I assume these events don't happen too often.
  • Persistent cache presence - Should the entire thing be turned off if no persistent object cache exists? And is this detectable?
  • The first parameter naming to cache_mo_files() - currently $pre.

Thanks again for taking a look.

Last edited 9 months ago by mreishus (previous) (diff)

#6 @swissspidy
9 months ago

  • Focuses performance added
  • Owner set to swissspidy
  • Status changed from new to assigned

Currently on vacation but will look into this more closely once I'm back. I do have some WIP improvements to the patch that I worked on during WCUS, so will be able to submit a PR soon.

This ticket was mentioned in PR #5664 on WordPress/wordpress-develop by @swissspidy.


6 months ago
#7

Expands scope of WP_Textdomain_Registry to cache list of language file paths in object cache and provide a way to invalidate that cache upon translation updates. glob() can be expensive if involving thousands of files, hence the caching.

Tools like WP-CLI could directly call wp_cache_delete( 'cached_mo_files_' . md5( $path ), 'translations' ); where needed to invalidate the cache.

Trac ticket: https://core.trac.wordpress.org/ticket/58919

#8 @swissspidy
6 months ago

@mreishus Finally got around to look into this again. Please check out https://github.com/WordPress/wordpress-develop/pull/5664

#9 @mreishus
6 months ago

Nice, thanks. I tried it out on a WordPress.com environment and it effectively solves the problem of slow globs while giving us extra flexibility via the pre_ hook.

#10 @swissspidy
6 months ago

  • Milestone changed from Future Release to 6.5

#11 @swissspidy
6 months ago

  • Keywords has-unit-tests added

This ticket was mentioned in Slack in #core-i18n by swissspidy. View the logs.


6 months ago

This ticket was mentioned in Slack in #core-performance by mukeshpanchal27. View the logs.


5 months ago

#14 @swissspidy
5 months ago

  • Focuses performance removed

#15 @swissspidy
4 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 57287:

I18N: Cache list of language file paths in WP_Textdomain_Registry.

Loading a list of language file paths using glob() can be expensive if involving thousands of files.

Expands scope of WP_Textdomain_Registry to cache list of language file paths in object cache and provides a way to invalidate that cache upon translation updates. Plugins can clear the cache using calls such as wp_cache_delete( 'cached_mo_files_' . md5( $path ), 'translations' );

Props mreishus, swissspidy
Fixes #58919

#17 @swissspidy
4 months ago

In 57290:

I18N: Do not use trailingslashit in WP_Textdomain_Registry.

This usage of trailingslashit(), introduced in [57287], is not only redundant, but also discouraged in order to avoid formatting.php dependency (which might not always be loaded).

Props SergeyBiryukov.
See #58919.

#18 @SergeyBiryukov
4 months ago

In 57298:

Tests: Remove leftover trailingslashit() calls in WP_Textdomain_Registry tests.

Follow-up to [57287], [57290].

See #58919.

#19 @SergeyBiryukov
4 months ago

In 57299:

I18N: Correctly invalidate language file paths in WP_Textdomain_Registry.

Since the cache key in ::get_language_files_from_path() is based on a path that always includes a trailing slash, the path in ::invalidate_mo_files_cache() should include the trailing slash as well.

Includes adjusting the test expectations accordingly.

Follow-up to [57287], [57290], [57298].

See #58919.

#20 @swissspidy
4 months ago

In 57303:

I18N: Prevent PHP warning in WP_Textdomain_Registry.

Prevents a warning upon cache invalidation after language pack updates if the arguments don’t have the expected format.

Follow-up to [57287], [57290], [57298], [57299].

See #58919.

#21 @stevenlinx
3 months ago

  • Keywords add-to-field-guide added

#22 @swissspidy
3 months ago

  • Keywords needs-dev-note added; add-to-field-guide removed

@stevenlinx No field guide. I'm adding this to the general i18n dev-note for #59656 instead.

Note: See TracTickets for help on using tickets.