Make WordPress Core

Changeset 53754


Ignore:
Timestamp:
07/21/2022 09:09:56 PM (2 years ago)
Author:
audrasjb
Message:

Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the normalizer_normalize() PHP function which is available with the recommended intl PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

Location:
trunk
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/formatting.php

    r53455 r53754  
    15851585 * @since 5.7.0 Added locale support for `de_AT`.
    15861586 * @since 6.0.0 Added the `$locale` parameter.
     1587 * @since 6.1.0 Added Unicode NFC encoding normalization support.
    15871588 *
    15881589 * @param string $string Text that might have accent characters.
     
    15981599
    15991600    if ( seems_utf8( $string ) ) {
     1601
     1602        // Unicode sequence normalization from NFD (Normalization Form Decomposed)
     1603        // to NFC (Normalization Form [Pre]Composed), the encoding used in this function.
     1604        if ( function_exists( 'normalizer_normalize' ) ) {
     1605            if ( ! normalizer_is_normalized( $string, Normalizer::FORM_C ) ) {
     1606                $string = normalizer_normalize( $string, Normalizer::FORM_C );
     1607            }
     1608        }
     1609
    16001610        $chars = array(
    16011611            // Decompositions for Latin-1 Supplement.
  • trunk/tests/phpunit/tests/formatting/removeAccents.php

    r53562 r53754  
    1010    public function test_remove_accents_simple() {
    1111        $this->assertSame( 'abcdefghijkl', remove_accents( 'abcdefghijkl' ) );
     12    }
     13
     14    /**
     15     * @ticket 24661
     16     *
     17     * Tests Unicode sequence normalization from NFD (Normalization Form Decomposed)
     18     * to NFC (Normalization Form [Pre]Composed), the encoding used in `remove_accents()`.
     19     *
     20     * For more information on Unicode normalization, see
     21     * https://unicode.org/faq/normalization.html.
     22     *
     23     * @requires extension intl
     24     */
     25    public function test_remove_accents_latin1_supplement_nfd_encoding() {
     26        $input  = 'ªºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ';
     27        $output = 'aoAAAAAAAECEEEEIIIIDNOOOOOOUUUUYTHsaaaaaaaeceeeeiiiidnoooooouuuuythy';
     28
     29        $this->assertSame( $output, remove_accents( $input ), 'remove_accents replaces Latin-1 Supplement with NFD encoding' );
    1230    }
    1331
Note: See TracChangeset for help on using the changeset viewer.