Make WordPress Core


Ignore:
Timestamp:
09/23/2025 03:34:20 AM (6 months ago)
Author:
dmsnell
Message:

Charset: Improve UTF-8 scrubbing ability via new UTF-8 scanning pipeline.

This is the fourth in a series of patches to modernize and standardize UTF-8 handling.

wp_check_invalid_utf8() has long been dependent on the runtime configuration of the system running it. This has led to hard-to-diagnose issues with text containing invalid UTF-8. The function has also had an apparent defect since its inception: when requesting to strip invalid bytes it returns an empty string.

This patch updates the function to remove all dependency on the system running it. It defers to the mbstring extension if that’s available, falling back to the new UTF-8 scanning pipeline.

To support this work, wp_scrub_utf8() is created with a proper fallback so that the remaining logic inside of wp_check_invalid_utf8() can be minimized. The defect in this function has been fixed, but instead of stripping the invalid bytes it will replace them with the Unicode replacement character for stronger security guarantees.

Developed in https://github.com/WordPress/wordpress-develop/pull/9498
Discussed in https://core.trac.wordpress.org/ticket/63837

Follow-up to: [60768].
Props askapache, chriscct7, Cyrille37, desrosj, dmsnell, helen, jonsurrell, kitchin, miqrogroove, pbearne, shailu25.
Fixes #63837, #29717.
See #63863.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/compat-utf8.php

    r60768 r60793  
    228228 * Fallback mechanism for safely validating UTF-8 bytes.
    229229 *
    230  * @see wp_is_valid_utf8()
    231  *
    232230 * @since 6.9.0
    233231 * @access private
     232 *
     233 * @see wp_is_valid_utf6()
    234234 *
    235235 * @param string $bytes String which might contain text encoded as UTF-8.
     
    249249    return $bytes_length === $next_byte_at && 0 === $invalid_length;
    250250}
     251
     252/**
     253 * Fallback mechanism for replacing invalid spans of UTF-8 bytes.
     254 *
     255 * Example:
     256 *
     257 *     'Pi�a' === _wp_scrub_utf8_fallback( "Pi\xF1a" ); // “ñ” is 0xF1 in Windows-1252.
     258 *
     259 * @since 6.9.0
     260 * @access private
     261 *
     262 * @see wp_scrub_utf8()
     263 *
     264 * @param string $bytes UTF-8 encoded string which might contain spans of invalid bytes.
     265 * @return string Input string with spans of invalid bytes swapped with the replacement character.
     266 */
     267function _wp_scrub_utf8_fallback( string $bytes ): string {
     268    $bytes_length   = strlen( $bytes );
     269    $next_byte_at   = 0;
     270    $was_at         = 0;
     271    $invalid_length = 0;
     272    $scrubbed       = '';
     273
     274    while ( $next_byte_at <= $bytes_length ) {
     275        _wp_scan_utf8( $bytes, $next_byte_at, $invalid_length );
     276
     277        if ( $next_byte_at >= $bytes_length ) {
     278            if ( 0 === $was_at ) {
     279                return $bytes;
     280            }
     281
     282            return $scrubbed . substr( $bytes, $was_at, $next_byte_at - $was_at - $invalid_length );
     283        }
     284
     285        $scrubbed .= substr( $bytes, $was_at, $next_byte_at - $was_at );
     286        $scrubbed .= "\u{FFFD}";
     287
     288        $next_byte_at += $invalid_length;
     289        $was_at        = $next_byte_at;
     290    }
     291
     292    return $scrubbed;
     293}
Note: See TracChangeset for help on using the changeset viewer.