Make WordPress Core


Ignore:
Timestamp:
09/02/2025 11:51:40 PM (5 months ago)
Author:
dmsnell
Message:

Charset: Add explanatory note about what consitutes “valid” UTF-8.

This patch adds a clarifying note about what constitutes a valid UTF-8 byte stream. This was brought up in review as a potentially ambiguous term, so a link to the spec has been provided to fix the behavior to the standard.

Developed in https://github.com/WordPress/wordpress-develop/pull/9716
Discussed in https://core.trac.wordpress.org/ticket/38044

Follow-up to [60630].

Props dmsnell, agulbra.
See #38044.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/formatting.php

    r60695 r60702  
    940940 *                                                     // E.g. The “ü” in ISO-8859-1 is a single byte 0xFC,
    941941 *                                                     // but in UTF-8 is the two-byte sequence 0xC3 0xBC.
     942 *
     943 * A “valid” string consists of “well-formed UTF-8 code unit sequence[s],” meaning
     944 * that the bytes conform to the UTF-8 encoding scheme, all characters use the minimal
     945 * byte sequence required by UTF-8, and that no sequence encodes a UTF-16 surrogate
     946 * code point or any character above the representable range.
     947 *
     948 * @see https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-3/#G32860
    942949 *
    943950 * @see _wp_is_valid_utf8_fallback
Note: See TracChangeset for help on using the changeset viewer.