Make WordPress Core

Opened 18 years ago

Closed 16 years ago

Last modified 15 months ago

#5337 closed defect (bug) (invalid)

Wrong timezone with comment_time and the_time since DST ended

Reported by: jrawle's profile jrawle Owned by:
Milestone: Priority: normal
Severity: normal Version: 2.3.1
Component: General Keywords: time date timezone dst utc has-patch has-unit-tests
Focuses: Cc:

Description

I am designing a theme where I want the current timezone to be displayed after the time in comment and post timestamps. For example, I use <?php comment_time() ?> with my default time format set to H:i T.

I am in the UK, and until the clocks changed on 28 October everything was fine. Posts or comments written during a summer gave the time as e.g. "14:00 BST" while those written in the winter said "13:00 GMT".

However, since BST ended, I now see "UTC" for all datestamps, irrespective of whether they were posted during the summer or winter. So the two examples given above would now show "14:00 UTC" and "13:00 UTC" respectively, despite the fact that the first one is incorrect. I also don't understand why it switched from using GMT to UTC.

I do not know whether this bug is present in other timezones.

Change History (6)

#1 @jrawle
18 years ago

  • Keywords time date timezone dst utc added
  • Milestone changed from 2.3.2 to 2.4

From the PHP manual http://www.php.net/mktime :

is_dst

This parameter can be set to 1 if the time is during daylight savings time (DST), 0 if it is not, or -1 (the default) if it is unknown whether the time is within daylight savings time or not. If it's unknown, PHP tries to figure it out itself. This can cause unexpected (but not incorrect) results.

The parameter is deprecated in PHP 5.1.0 and "the new timezone handling features should be used instead."

For servers in the UK, it seems PHP is "figuring out" that they are on British time in the summer, but in the winter it gets it slightly wrong and assumes the machine is permanently on UTC (hence UTC not GMT).

Perhaps this should be addressed in Wordpress along with other issues surrounding timezones, for example automatic DST?

Other timezones could be affected if PHP makes the wrong guess as to the timezone.

#2 @lloydbudd
18 years ago

  • Milestone changed from 2.4 to 2.5

#3 @Viper007Bond
16 years ago

  • Milestone 2.9 deleted
  • Resolution set to invalid
  • Status changed from new to closed

comment_time() reports what your clock said when you wrote the comment. Changing the timezone setting in your settings does not change that value for past comments.

Use the data from the GMT column in the comments object if you wish to ignore or update the timezone.

This ticket was mentioned in PR #6387 on WordPress/wordpress-develop by @dmsnell.


15 months ago
#4

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-61072
Token Map Trac ticket: Core-60698

From #5337 takes the HTML text decoder.

Replaces WordPress/gutenberg#47040

## Status

The code should be working now with this, and fully spec-compliant.
Tests are covered generally by the html5lib test suite.

### Performance

After some initial testing this appears to be around 20% slower in its current state at decoding text values compared to using html_entity_decode(). I tested against a set of 296,046 web pages at the root domain for a list of the top-ranked domains that I found online.

The decoder itself, when run against a worst-case scenario document with one million randomly-generated hexadecimal and decimal numeric character references is much slower than html_entity_decode() (which itself is moot given the reliability issues with that function). html_entity_decode() processes that test file at 195 MB/s while the HTML decoder processes it at 6.40 MB/s. With the pre-decoder patch this rises to 9.78 MB/s.
Understanding context is important here because most documents aren't worse-case documents. Typical web performance is reported in this PR based on real web pages on high-ranked domains. The fact that this step is slow does not make page rendering slow because it's such a small part of a real web path. In WordPress this would be even less impactful because we're not parsing all text content of all HTML on the server; only a fraction of it.

The impact is quite marginal, adding around 60 µs per page. For the set of close to 300k pages that took the total runtime from 87s to 105s. I tested with the following main loop, using microtime( true ) before and after the loop to add to the total time in an attempt to eliminate the I/O wait time from the results. This is a worst-case scenario where decode every attribute and every text node. Again, in practice, WordPress would only likely experience a fraction of that 60 µs because it's not decodingevery text node and every attribute of the HTML it ships to a browser.

I attempted to avoid string allocations and this raised a challenge: strpos() doesn't provide a way to stop at a given index. This led me to try replacing it with a simple look to advance character by character until finding a &. This slowed it down to about 25% slower than html_entity_decode() so I removed that and instead relied on using strpos() with the possibility that it scans much further past the end of the value. On the test set of data it was still faster.

while ( $p->next_token() ) {
        $token_name = $p->get_token_name();
        $token_type = $p->get_token_type();

        if ( '#tag' === $token_type && ! $p->is_tag_closer() ) {
                foreach ( $p->get_attribute_names_with_prefix( '' ) ?? array() as $name ) {
                        $chunk = $p->get_attribute( $name );
                        if ( is_string( $chunk ) ) {
                                $total_code_points += mb_strlen( $chunk );
                        }
                }
        }

        $text = $p->get_modifiable_text();
        if ( '' !== $text ) {
                $total_code_points += mb_strlen( $text );
        }
}

For comparison, I built a version that skips the WP_Token_Map and instead relies on a basic associative array whose keys are the character reference names and whose values are the replacements. This was 840% slower than html_decode_entities() and increased the average page processing time by 2.175 ms. The token map is thus approximately 36x faster than the naive implementation.

##### Pre-decoding

In an attempt to rely more on html_entity_decode() I added a pre-decoding step that would handle all well-formed numeric character encodings. The logic here is that if we can use a quick preg_replace_callback() pass to get as much into C-code as we can, by means of html_entity_decode(), then maybe it would be worth it even with the additional pass.

Unfortunately the results were instantly slower, adding another 20% slowdown in my first 100k domains under test. That is, it's over 40% slower than a pure html_entity_decode() whereas the code without the pre-encoding step is only 20% slower.

<details><summary>The Pre-Decoder</summary>

// pre-decode certain known numeric character references.
                $text = preg_replace_callback(
                        '~&#((?P<is_hex>[Xx])0*(?P<hex_digits>[1-9A-Fa-f][0-9A-Fa-f]{0,5})|0*(?P<dec_digits>[1-9][0-9]{0,6}));~',
                        static function ( $matches ) {
                                $is_hex = strlen( $matches['is_hex'] ) > 0;
                                $digits = $matches[ $is_hex ? 'hex_digits' : 'dec_digits' ];

                                if ( ( $is_hex ? 2 : 3 ) === strlen( $digits ) ) {
                                        $code_point = intval( $digits, $is_hex ? 16 : 10 );

                                        /*
                                         * Noncharacters, 0x0D, and non-ASCII-whitespace control characters.
                                         *
                                         * > A noncharacter is a code point that is in the range U+FDD0 to U+FDEF,
                                         * > inclusive, or U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF,
                                         * > U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE,
                                         * > U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF,
                                         * > U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE,
                                         * > U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, or U+10FFFF.
                                         *
                                         * A C0 control is a code point that is in the range of U+00 to U+1F,
                                         * but ASCII whitespace includes U+09, U+0A, U+0C, and U+0D.
                                         *
                                         * These characters are invalid but still decode as any valid character.
                                         * This comment is here to note and explain why there's no check to
                                         * remove these characters or replace them.
                                         *
                                         * @see https://infra.spec.whatwg.org/#noncharacter
                                         */

                                        /*
                                         * Code points in the C1 controls area need to be remapped as if they
                                         * were stored in Windows-1252. Note! This transformation only happens
                                         * for numeric character references. The raw code points in the byte
                                         * stream are not translated.
                                         *
                                         * > If the number is one of the numbers in the first column of
                                         * > the following table, then find the row with that number in
                                         * > the first column, and set the character reference code to
                                         * > the number in the second column of that row.
                                         */
                                        if ( $code_point >= 0x80 && $code_point <= 0x9F ) {
                                                $windows_1252_mapping = array(
                                                        '€', // 0x80 -> EURO SIGN (€).
                                                        "\xC2\x81",   // 0x81 -> (no change).
                                                        '‚', // 0x82 -> SINGLE LOW-9 QUOTATION MARK (‚).
                                                        'ƒ', // 0x83 -> LATIN SMALL LETTER F WITH HOOK (ƒ).
                                                        '„', // 0x84 -> DOUBLE LOW-9 QUOTATION MARK („).
                                                        '…', // 0x85 -> HORIZONTAL ELLIPSIS (…).
                                                        '†', // 0x86 -> DAGGER (†).
                                                        '‡', // 0x87 -> DOUBLE DAGGER (‡).
                                                        'ˆ', // 0x88 -> MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ).
                                                        '‰', // 0x89 -> PER MILLE SIGN (‰).
                                                        'Š', // 0x8A -> LATIN CAPITAL LETTER S WITH CARON (Š).
                                                        '‹', // 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK (‹).
                                                        'Œ', // 0x8C -> LATIN CAPITAL LIGATURE OE (Œ).
                                                        "\xC2\x8D",   // 0x8D -> (no change).
                                                        'Ž', // 0x8E -> LATIN CAPITAL LETTER Z WITH CARON (Ž).
                                                        "\xC2\x8F",   // 0x8F -> (no change).
                                                        "\xC2\x90",   // 0x90 -> (no change).
                                                        '‘', // 0x91 -> LEFT SINGLE QUOTATION MARK (‘).
                                                        '’', // 0x92 -> RIGHT SINGLE QUOTATION MARK (’).
                                                        '“', // 0x93 -> LEFT DOUBLE QUOTATION MARK (“).
                                                        '”', // 0x94 -> RIGHT DOUBLE QUOTATION MARK (”).
                                                        '•', // 0x95 -> BULLET (•).
                                                        '–', // 0x96 -> EN DASH (–).
                                                        '—', // 0x97 -> EM DASH (—).
                                                        '˜', // 0x98 -> SMALL TILDE (˜).
                                                        '™', // 0x99 -> TRADE MARK SIGN (™).
                                                        'š', // 0x9A -> LATIN SMALL LETTER S WITH CARON (š).
                                                        '›', // 0x9B -> SINGLE RIGHT-POINTING ANGLE QUOTATION MARK (›).
                                                        'œ', // 0x9C -> LATIN SMALL LIGATURE OE (œ).
                                                        "\xC2\x9D",   // 0x9D -> (no change).
                                                        'ž', // 0x9E -> LATIN SMALL LETTER Z WITH CARON (ž).
                                                        'Ÿ', // 0x9F -> LATIN CAPITAL LETTER Y WITH DIAERESIS (Ÿ).
                                                );

                                                return $windows_1252_mapping[ $code_point - 0x80 ];
                                        }
                                }

                                return html_entity_decode( $matches[0], ENT_SUBSTITUTE, 'UTF-8' );
                        },
                        $text
                );

</details>

##### Faster integer decoding.

I attempted to parse the code point inline while scanning the digits in hopes to save some time computing, but this dramatically slowed down the interpret. I think that the per-character parsing is much slower than intval().

##### Faster digit detection.

I attempted to replace strspn( $text, $numeric_digits ) with a custom look examining each character for whether it was in the digit ranges, but this was just as slow as the custom integer decoder.

##### Quick table lookup of group/small token indexing.

On the idea that looking up the group or small word in the lookup strings might be slow, given that it's required to iterate every time, I tried adding a patch to introduce an index table for direct lookup into where words of the given starting letter start, and if they even exist in the table at all.

<details><summary>Table-lookup patch</summary>

  • src/wp-includes/class-wp-token-map.php

    diff --git a/src/wp-includes/class-wp-token-map.php b/src/wp-includes/class-wp-token-map.php
    index c7bb9316ed..60a189b37d 100644
    a b class WP_Token_Map { 
    182182         */
    183183        private $groups = '';
    184184
     185        /**
     186         * Indicates where the first group key starting with a given
     187         * letter is to be found in the groups string.
     188         *
     189         *  - Each position in the string corresponds to the first byte
     190         *    of a lookup key. E.g. keys starting with `A` will look
     191         *    at the 65th index (A is U+0041) to see if it's in the set
     192         *    and where.
     193         *
     194         *  - Each index is stored as a two-byte unsigned integer
     195         *    indicating where to start looking. This limits the
     196         *    key size to 256 bytes.
     197         *
     198         *  - A null value indicates that there are no words present
     199         *    starting with the given byte.
     200         *
     201         * Example:
     202         *
     203         *     Suppose that there exists a key starting with `B` at
     204         *     offset 481 (0x01 0xE1) but none starting with `A`.
     205         *     ┌────────┬───────┬───────┬───────┐
     206         *     │  ...   │   A   │   B   │  ...  │
     207         *     ├────────┼───────┼───────┼───────┤
     208         *     │        │ 00 00 │ 01 E1 │       │
     209         *     └────────┴───────┴───────┴───────┘
     210         *
     211         * @since 6.6.0
     212         *
     213         * @var string
     214         */
     215        private $group_index;
     216
    185217        /**
    186218         * Stores an optimized row of small words, where every entry is
    187219         * `$this->key_size + 1` bytes long and zero-extended.
    class WP_Token_Map { 
    200232         */
    201233        private $small_words = '';
    202234
     235        /**
     236         * Indicates where the first word starting with a given letter
     237         * is to be found in the small words string.
     238         *
     239         *  - Each position in the string corresponds to the first byte
     240         *    of a lookup word. E.g. words starting with `A` will look
     241         *    at the 65th index (A is U+0041) to see if it's in the set
     242         *    and where.
     243         *
     244         *  - Each index is stored as a two-byte unsigned integer
     245         *    indicating where to start looking. This limits the
     246         *    key size to 256 bytes.
     247         *
     248         *  - A null value indicates that there are no words present
     249         *    starting with the given byte.
     250         *
     251         * Example:
     252         *
     253         *     Suppose that there exists a word starting with `B` at
     254         *     offset 481 (0x01 0xE1) but none starting with `A`.
     255         *     ┌────────┬───────┬───────┬───────┐
     256         *     │  ...   │   A   │   B   │  ...  │
     257         *     ├────────┼───────┼───────┼───────┤
     258         *     │        │ 00 00 │ 01 E1 │       │
     259         *     └────────┴───────┴───────┴───────┘
     260         *
     261         * @since 6.6.0
     262         *
     263         * @var string
     264         */
     265        private $small_index;
     266
    203267        /**
    204268         * Replacements for the small words, in the same order they appear.
    205269         *
    class WP_Token_Map { 
    287351                        );
    288352                }
    289353
     354                // Prime the search indices.
     355                $map->small_index = str_repeat( "\xFF\xFF", 256 );
     356                $map->group_index = str_repeat( "\xFF\xFF", 256 );
     357
    290358                // Finally construct the optimized lookups.
    291359
     360                $last_byte = "\x00";
    292361                foreach ( $shorts as $word ) {
     362                        if ( $last_byte !== $word[0] ) {
     363                                $last_byte                         = $word[0];
     364                                $index_at                          = 2 * ord( $last_byte );
     365                                $offset                            = pack( 'n', strlen( $map->small_words ) );
     366                                $map->small_index[ $index_at ]     = $offset[0];
     367                                $map->small_index[ $index_at + 1 ] = $offset[1];
     368                        }
     369
    293370                        $map->small_words     .= str_pad( $word, $key_length + 1, "\x00", STR_PAD_RIGHT );
    294371                        $map->small_mappings[] = $mappings[ $word ];
    295372                }
    class WP_Token_Map { 
    297374                $group_keys = array_keys( $groups );
    298375                sort( $group_keys );
    299376
     377                $last_byte = "\x00";
    300378                foreach ( $group_keys as $group ) {
     379                        if ( $last_byte !== $group[0] ) {
     380                                $last_byte                         = $group[0];
     381                                $index_at                          = 2 * ord( $last_byte );
     382                                $offset                            = pack( 'n', strlen( $map->groups ) );
     383                                $map->group_index[ $index_at ]     = $offset[0];
     384                                $map->group_index[ $index_at + 1 ] = $offset[1];
     385                        }
     386
    301387                        $map->groups .= "{$group}\x00";
    302388
    303389                        $group_string = '';
    class WP_Token_Map { 
    327413         *
    328414         * @param int    $key_length     Group key length.
    329415         * @param string $groups         Group lookup index.
     416         * @param string $group_index    Locations in the group lookup where each character starts.
    330417         * @param array  $large_words    Large word groups and packed strings.
    331418         * @param string $small_words    Small words packed string.
     419         * @param string $small_index    Locations in the small word lookup where each character starts.
    332420         * @param array  $small_mappings Small word mappings.
    333421         *
    334422         * @return WP_Token_Map Map with precomputed data loaded.
    335423         */
    336         public static function from_precomputed_table( $key_length, $groups, $large_words, $small_words, $small_mappings ) {
     424        public static function from_precomputed_table( $key_length, $groups, $group_index, $large_words, $small_words, $small_index, $small_mappings ) {
    337425                $map = new WP_Token_Map();
    338426
    339427                $map->key_length     = $key_length;
    340428                $map->groups         = $groups;
     429                $map->group_index    = $group_index;
    341430                $map->large_words    = $large_words;
    342431                $map->small_words    = $small_words;
     432                $map->small_index    = $small_index;
    343433                $map->small_mappings = $small_mappings;
    344434
    345435                return $map;
    class WP_Token_Map { 
    454544                if ( $text_length > $this->key_length ) {
    455545                        $group_key = substr( $text, $offset, $this->key_length );
    456546
    457                         $group_at = $ignore_case ? stripos( $this->groups, $group_key ) : strpos( $this->groups, $group_key );
     547                        $group_index = unpack( 'n', $this->group_index, 2 * ord( $text[ $offset ] ) )[1];
     548                        if ( 0xFFFF === $group_index && ! $ignore_case ) {
     549                                // Perhaps a short word then.
     550                                return strlen( $this->small_words ) > 0
     551                                        ? $this->read_small_token( $text, $offset, $skip_bytes, $case_sensitivity )
     552                                        : false;
     553                        }
     554
     555                        $group_at = $ignore_case
     556                                ? stripos( $this->groups, $group_key )
     557                                : strpos( $this->groups, $group_key, $group_index );
     558
    458559                        if ( false === $group_at ) {
    459560                                // Perhaps a short word then.
    460561                                return strlen( $this->small_words ) > 0
    class WP_Token_Map { 
    499600         * @return string|false Mapped value of lookup key if found, otherwise `false`.
    500601         */
    501602        private function read_small_token( $text, $offset, &$skip_bytes, $case_sensitivity = 'case-sensitive' ) {
    502                 $ignore_case  = 'case-insensitive' === $case_sensitivity;
     603                $ignore_case = 'case-insensitive' === $case_sensitivity;
     604
     605                // Quickly eliminate impossible matches.
     606                $small_index = unpack( 'n', $this->small_index, 2 * ord( $text[ $offset ] ) )[1];
     607                if ( 0xFFFF === $small_index && ! $ignore_case ) {
     608                        return false;
     609                }
     610
    503611                $small_length = strlen( $this->small_words );
    504612                $search_text  = substr( $text, $offset, $this->key_length );
    505613                if ( $ignore_case ) {
    class WP_Token_Map { 
    507615                }
    508616                $starting_char = $search_text[0];
    509617
    510                 $at = 0;
     618                $at = $ignore_case ? 0 : $small_index;
    511619                while ( $at < $small_length ) {
    512620                        if (
    513621                                $starting_char !== $this->small_words[ $at ] &&
    class WP_Token_Map { 
    621729                $group_line = str_replace( "\x00", "\\x00", $this->groups );
    622730                $output    .= "{$i1}\"{$group_line}\",\n";
    623731
     732                $group_index        = '';
     733                $group_index_length = strlen( $this->group_index );
     734                for ( $i = 0; $i < $group_index_length; $i++ ) {
     735                        $group_index .= '\\x' . str_pad( dechex( ord( $this->group_index[ $i ] ) ), 2, '0', STR_PAD_LEFT );
     736                }
     737                $output .= "{$i1}\"{$group_index}\",\n";
     738
    624739                $output .= "{$i1}array(\n";
    625740
    626741                $prefixes = explode( "\x00", $this->groups );
    class WP_Token_Map { 
    685800                $small_text = str_replace( "\x00", '\x00', implode( '', $small_words ) );
    686801                $output    .= "{$i1}\"{$small_text}\",\n";
    687802
     803                $small_index        = '';
     804                $small_index_length = strlen( $this->small_index );
     805                for ( $i = 0; $i < $small_index_length; $i++ ) {
     806                        $small_index .= '\\x' . str_pad( dechex( ord( $this->small_index[ $i ] ) ), 2, '0', STR_PAD_LEFT );
     807                }
     808                $output .= "{$i1}\"{$small_index}\",\n";
     809
    688810                $output .= "{$i1}array(\n";
    689811                foreach ( $this->small_mappings as $mapping ) {
    690812                        $output .= "{$i2}\"{$mapping}\",\n";
  • src/wp-includes/html-api/html5-named-character-references.php

    diff --git a/src/wp-includes/html-api/html5-named-character-references.php b/src/wp-includes/html-api/html5-named-character-references.php
    index 41c467e268..b7827f9b8e 100644
    a b global $html5_named_character_references; 
    3131$html5_named_character_references = WP_Token_Map::from_precomputed_table(
    3232        2,
    3333        "AE\x00AM\x00Aa\x00Ab\x00Ac\x00Af\x00Ag\x00Al\x00Am\x00An\x00Ao\x00Ap\x00Ar\x00As\x00At\x00Au\x00Ba\x00Bc\x00Be\x00Bf\x00Bo\x00Br\x00Bs\x00Bu\x00CH\x00CO\x00Ca\x00Cc\x00Cd\x00Ce\x00Cf\x00Ch\x00Ci\x00Cl\x00Co\x00Cr\x00Cs\x00Cu\x00DD\x00DJ\x00DS\x00DZ\x00Da\x00Dc\x00De\x00Df\x00Di\x00Do\x00Ds\x00EN\x00ET\x00Ea\x00Ec\x00Ed\x00Ef\x00Eg\x00El\x00Em\x00Eo\x00Ep\x00Eq\x00Es\x00Et\x00Eu\x00Ex\x00Fc\x00Ff\x00Fi\x00Fo\x00Fs\x00GJ\x00GT\x00Ga\x00Gb\x00Gc\x00Gd\x00Gf\x00Gg\x00Go\x00Gr\x00Gs\x00Gt\x00HA\x00Ha\x00Hc\x00Hf\x00Hi\x00Ho\x00Hs\x00Hu\x00IE\x00IJ\x00IO\x00Ia\x00Ic\x00Id\x00If\x00Ig\x00Im\x00In\x00Io\x00Is\x00It\x00Iu\x00Jc\x00Jf\x00Jo\x00Js\x00Ju\x00KH\x00KJ\x00Ka\x00Kc\x00Kf\x00Ko\x00Ks\x00LJ\x00LT\x00La\x00Lc\x00Le\x00Lf\x00Ll\x00Lm\x00Lo\x00Ls\x00Lt\x00Ma\x00Mc\x00Me\x00Mf\x00Mi\x00Mo\x00Ms\x00Mu\x00NJ\x00Na\x00Nc\x00Ne\x00Nf\x00No\x00Ns\x00Nt\x00Nu\x00OE\x00Oa\x00Oc\x00Od\x00Of\x00Og\x00Om\x00Oo\x00Op\x00Or\x00Os\x00Ot\x00Ou\x00Ov\x00Pa\x00Pc\x00Pf\x00Ph\x00Pi\x00Pl\x00Po\x00Pr\x00Ps\x00QU\x00Qf\x00Qo\x00Qs\x00RB\x00RE\x00Ra\x00Rc\x00Re\x00Rf\x00Rh\x00Ri\x00Ro\x00Rr\x00Rs\x00Ru\x00SH\x00SO\x00Sa\x00Sc\x00Sf\x00Sh\x00Si\x00Sm\x00So\x00Sq\x00Ss\x00St\x00Su\x00TH\x00TR\x00TS\x00Ta\x00Tc\x00Tf\x00Th\x00Ti\x00To\x00Tr\x00Ts\x00Ua\x00Ub\x00Uc\x00Ud\x00Uf\x00Ug\x00Um\x00Un\x00Uo\x00Up\x00Ur\x00Us\x00Ut\x00Uu\x00VD\x00Vb\x00Vc\x00Vd\x00Ve\x00Vf\x00Vo\x00Vs\x00Vv\x00Wc\x00We\x00Wf\x00Wo\x00Ws\x00Xf\x00Xi\x00Xo\x00Xs\x00YA\x00YI\x00YU\x00Ya\x00Yc\x00Yf\x00Yo\x00Ys\x00Yu\x00ZH\x00Za\x00Zc\x00Zd\x00Ze\x00Zf\x00Zo\x00Zs\x00aa\x00ab\x00ac\x00ae\x00af\x00ag\x00al\x00am\x00an\x00ao\x00ap\x00ar\x00as\x00at\x00au\x00aw\x00bN\x00ba\x00bb\x00bc\x00bd\x00be\x00bf\x00bi\x00bk\x00bl\x00bn\x00bo\x00bp\x00br\x00bs\x00bu\x00ca\x00cc\x00cd\x00ce\x00cf\x00ch\x00ci\x00cl\x00co\x00cr\x00cs\x00ct\x00cu\x00cw\x00cy\x00dA\x00dH\x00da\x00db\x00dc\x00dd\x00de\x00df\x00dh\x00di\x00dj\x00dl\x00do\x00dr\x00ds\x00dt\x00du\x00dw\x00dz\x00eD\x00ea\x00ec\x00ed\x00ee\x00ef\x00eg\x00el\x00em\x00en\x00eo\x00ep\x00eq\x00er\x00es\x00et\x00eu\x00ex\x00fa\x00fc\x00fe\x00ff\x00fi\x00fj\x00fl\x00fn\x00fo\x00fp\x00fr\x00fs\x00gE\x00ga\x00gb\x00gc\x00gd\x00ge\x00gf\x00gg\x00gi\x00gj\x00gl\x00gn\x00go\x00gr\x00gs\x00gt\x00gv\x00hA\x00ha\x00hb\x00hc\x00he\x00hf\x00hk\x00ho\x00hs\x00hy\x00ia\x00ic\x00ie\x00if\x00ig\x00ii\x00ij\x00im\x00in\x00io\x00ip\x00iq\x00is\x00it\x00iu\x00jc\x00jf\x00jm\x00jo\x00js\x00ju\x00ka\x00kc\x00kf\x00kg\x00kh\x00kj\x00ko\x00ks\x00lA\x00lB\x00lE\x00lH\x00la\x00lb\x00lc\x00ld\x00le\x00lf\x00lg\x00lh\x00lj\x00ll\x00lm\x00ln\x00lo\x00lp\x00lr\x00ls\x00lt\x00lu\x00lv\x00mD\x00ma\x00mc\x00md\x00me\x00mf\x00mh\x00mi\x00ml\x00mn\x00mo\x00mp\x00ms\x00mu\x00nG\x00nL\x00nR\x00nV\x00na\x00nb\x00nc\x00nd\x00ne\x00nf\x00ng\x00nh\x00ni\x00nj\x00nl\x00nm\x00no\x00np\x00nr\x00ns\x00nt\x00nu\x00nv\x00nw\x00oS\x00oa\x00oc\x00od\x00oe\x00of\x00og\x00oh\x00oi\x00ol\x00om\x00oo\x00op\x00or\x00os\x00ot\x00ou\x00ov\x00pa\x00pc\x00pe\x00pf\x00ph\x00pi\x00pl\x00pm\x00po\x00pr\x00ps\x00pu\x00qf\x00qi\x00qo\x00qp\x00qs\x00qu\x00rA\x00rB\x00rH\x00ra\x00rb\x00rc\x00rd\x00re\x00rf\x00rh\x00ri\x00rl\x00rm\x00rn\x00ro\x00rp\x00rr\x00rs\x00rt\x00ru\x00rx\x00sa\x00sb\x00sc\x00sd\x00se\x00sf\x00sh\x00si\x00sl\x00sm\x00so\x00sp\x00sq\x00sr\x00ss\x00st\x00su\x00sw\x00sz\x00ta\x00tb\x00tc\x00td\x00te\x00tf\x00th\x00ti\x00to\x00tp\x00tr\x00ts\x00tw\x00uA\x00uH\x00ua\x00ub\x00uc\x00ud\x00uf\x00ug\x00uh\x00ul\x00um\x00uo\x00up\x00ur\x00us\x00ut\x00uu\x00uw\x00vA\x00vB\x00vD\x00va\x00vc\x00vd\x00ve\x00vf\x00vl\x00vn\x00vo\x00vp\x00vr\x00vs\x00vz\x00wc\x00we\x00wf\x00wo\x00wp\x00wr\x00ws\x00xc\x00xd\x00xf\x00xh\x00xi\x00xl\x00xm\x00xn\x00xo\x00xr\x00xs\x00xu\x00xv\x00xw\x00ya\x00yc\x00ye\x00yf\x00yi\x00yo\x00ys\x00yu\x00za\x00zc\x00zd\x00ze\x00zf\x00zh\x00zi\x00zo\x00zs\x00zw\x00",
     34        "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\x00\x30\x00\x48\x00\x72\x00\x93\x00\xc3\x00\xd2\x00\xf6\x01\x0e\x01\x38\x01\x47\x01\x5c\x01\x7d\x01\x95\x01\xb0\x01\xda\x01\xf5\x02\x01\x02\x25\x02\x4c\x02\x6d\x02\x97\x02\xb2\x02\xc1\x02\xcd\x02\xe8\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x03\x00\x03\x30\x03\x60\x03\x8d\x03\xc6\x03\xfc\x04\x20\x04\x53\x04\x71\x04\x9e\x04\xb0\x04\xc8\x05\x0d\x05\x37\x05\x7f\x05\xb5\x05\xd9\x05\xeb\x06\x2a\x06\x63\x06\x8a\x06\xc0\x06\xed\x07\x02\x07\x2c\x07\x44\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
    3435        array(
    3536                // AElig;[Æ] AElig[Æ].
    3637                "\x04lig;\x02Æ\x03lig\x02Æ",
    $html5_named_character_references = WP_Token_Map::from_precomputed_table( 
    12941295                "\x03nj;\x03‌\x02j;\x03‍",
    12951296        ),
    12961297        "GT\x00LT\x00gt\x00lt\x00",
     1298        "\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\x00\x03\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\x00\x06\xff\xff\xff\xff\xff\xff\xff\xff\x00\x09\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff\xff",
    12971299        array(
    12981300                ">",
    12991301                "<",

</details>

This did not introduce a measurable speedup or slowdown on the dataset of 300k HTML pages. While I believe that the table lookup could speed up certain workloads that are heavy with named character references, it does not justify itself on realistic data and so I'm leaving the patch out.

#### Metrics on character references.

From the same set of 296k webpages I counted the frequency of each character reference. This includes the full syntax, so if we were to have come across &#x00000000039 it would appear in the list. The linked file contains ANSII terminal codes, so view it through cat or less -R.

all-text-and-ref-counts.txt

Based on this data I added a special-case for &quot;, &nbsp;, and &amp; before calling into the WP_Token_Map but it didn't have a measurable impact on performance. I'm led to conclude from this that it's not those common character references slowing things down. Possibly it's the numeric character references.

In another experiment I replaced my custom code_point_to_utf8_bytes() function with a call to mb_chr(), and again the impact wasn't significant. That method performs the same computation within PHP that this application-level does, so this is not surprising.

For clearer performance direction it's probably most helpful to profile a run of decoding and see where the CPU is spending its time. It appears to be fairly quick as it is in this patch.

### Attempted alternatives

  • Tried to use an associative array whose keys are the character reference names and whose values are the translated UTF-8 strings. This operated more than 8x slower than the WP_Token_Map implementation.
  • Based on the frequency of named character references I tried short-circuiting &quot;, &amp;, and &nbsp;, as they might account for up to 70-80% of all named character references in practice. This didn't impact the runtime. Runtime is likely dominated by numeric character reference decoding.
  • In 0351b78bb9 tried to micro-optimize by eliminating if checks, rearranging code for frequency-analysis of code points, and replaced the Windows-1252 remapping with direct replacement. In a test 10 million randomly generated numeric character references, this performed around 3-5% faster than in the branch, but in real tests I could not measure any impact. The micro-optimizations are likely inert in a real context.

In my benchmark of decoding 10 million randomly-generated numeric character references about half the time is spent exclusively inside read_character_reference() and the other half is spent in code_point_to_utf8_bytes().

  • I tried replacing substr() + intval() with an unrolled table-lookup custom string-to-integer decoder. While that decoder performed significantly better than a native pure-PHP decoder, it was still noticeably slower than intval().

I'm led to believe that his is nearly optimal for a pure PHP solution.

### Character-set detections.

The following CSV file is the result of surveying the / path of popular domains. It includes detections of whether the given HTML found at that path is valid UTF8, valid Windows-1252, valid ASCII, and whether it's valid in its self-reported character sets.

A 1 indicates that the HTML passes mb_check_encoding() for the encoding of the given column. A 0 indicates that it doesn't. A missing value indicates that the site did not self-report to contain that encoding.

Note that a site might self-report being encoded in multiple simultaneous and mutually-exclusive encodings.

charset-detections.csv

### html5lib tests

Results Output
Before Tests: 609, Assertions: 172, Failures: 63, Skipped: 435.
After Tests: 607, Assertions: 172, Skipped: 435.

<details><summary>Tests that are now possible to run that previously weren't.</summary>
https://github.com/WordPress/wordpress-develop/assets/5431237/8f63c5c9-aec2-487e-98b0-b2b723f38982
</details>

## Differences from html_entity_decode()

<details><summary>PHP misses 720 character references</summary>
&amp;AElig &amp;AMP &amp;AMP; &amp;Aacute &amp;Acirc &amp;Agrave &amp;ApplyFunction; &amp;Aring &amp;Assign; &amp;Atilde &amp;Auml &amp;Backslash; &amp;Barwed; &amp;Bernoullis; &amp;Bumpeq; &amp;COPY &amp;COPY; &amp;Cayleys; &amp;Ccedil &amp;CircleMinus; &amp;ClockwiseContourIntegral; &amp;CloseCurlyDoubleQuote; &amp;CloseCurlyQuote; &amp;Conint; &amp;Copf; &amp;CounterClockwiseContourIntegral; &amp;DD; &amp;Del; &amp;DiacriticalDot; &amp;DiacriticalGrave; &amp;Diamond; &amp;Dot; &amp;DotEqual; &amp;DoubleDownArrow; &amp;DoubleLeftRightArrow; &amp;DoubleLeftTee; &amp;DoubleLongLeftArrow; &amp;DoubleLongLeftRightArrow; &amp;DoubleRightArrow; &amp;DoubleUpDownArrow; &amp;DoubleVerticalBar; &amp;DownArrow; &amp;DownLeftVector; &amp;DownRightVector; &amp;ETH &amp;Eacute &amp;Ecirc &amp;Egrave &amp;Element; &amp;EqualTilde; &amp;Equilibrium; &amp;Escr; &amp;Euml &amp;ExponentialE; &amp;FilledVerySmallSquare; &amp;ForAll; &amp;Fscr; &amp;GT &amp;GT; &amp;GreaterEqual; &amp;GreaterEqualLess; &amp;GreaterFullEqual; &amp;GreaterLess; &amp;GreaterSlantEqual; &amp;Gt; &amp;Hscr; &amp;HumpDownHump; &amp;Iacute &amp;Icirc &amp;Igrave &amp;Im; &amp;Intersection; &amp;InvisibleComma; &amp;Iscr; &amp;Iuml &amp;LT &amp;LT; &amp;Laplacetrf; &amp;LeftAngleBracket; &amp;LeftArrow; &amp;LeftArrowRightArrow; &amp;LeftCeiling; &amp;LeftDownVector; &amp;LeftRightArrow; &amp;LeftTee; &amp;LeftTriangle; &amp;LeftUpVector; &amp;LeftVector; &amp;Leftarrow; &amp;Leftrightarrow; &amp;LessEqualGreater; &amp;LessFullEqual; &amp;LessGreater; &amp;LessSlantEqual; &amp;Lleftarrow; &amp;LongLeftArrow; &amp;Longleftarrow; &amp;Longleftrightarrow; &amp;Longrightarrow; &amp;LowerLeftArrow; &amp;Lscr; &amp;Lt; &amp;Mscr; &amp;NegativeMediumSpace; &amp;NegativeThickSpace; &amp;NegativeThinSpace; &amp;NegativeVeryThinSpace; &amp;NestedGreaterGreater; &amp;NestedLessLess; &amp;NonBreakingSpace; &amp;Nopf; &amp;NotDoubleVerticalBar; &amp;NotElement; &amp;NotEqualTilde; &amp;NotExists; &amp;NotGreater; &amp;NotGreaterEqual; &amp;NotGreaterSlantEqual; &amp;NotGreaterTilde; &amp;NotHumpDownHump; &amp;NotHumpEqual; &amp;NotLeftTriangle; &amp;NotLeftTriangleEqual; &amp;NotLessGreater; &amp;NotLessLess; &amp;NotLessSlantEqual; &amp;NotLessTilde; &amp;NotPrecedes; &amp;NotReverseElement; &amp;NotRightTriangle; &amp;NotSubset; &amp;NotSuperset; &amp;NotTildeEqual; &amp;NotTildeFullEqual; &amp;NotTildeTilde; &amp;NotVerticalBar; &amp;Ntilde &amp;Oacute &amp;Ocirc &amp;Ograve &amp;Oslash &amp;Otilde &amp;Ouml &amp;OverBar; &amp;PartialD; &amp;PlusMinus; &amp;Poincareplane; &amp;Popf; &amp;Precedes; &amp;PrecedesEqual; &amp;PrecedesTilde; &amp;Product; &amp;Proportion; &amp;Proportional; &amp;QUOT &amp;QUOT; &amp;Qopf; &amp;RBarr; &amp;REG &amp;REG; &amp;Rarr; &amp;Re; &amp;ReverseEquilibrium; &amp;RightArrow; &amp;RightArrowLeftArrow; &amp;RightTee; &amp;RightTeeArrow; &amp;RightTriangle; &amp;RightVector; &amp;Rightarrow; &amp;Rrightarrow; &amp;Rscr; &amp;Rsh; &amp;ShortDownArrow; &amp;ShortLeftArrow; &amp;ShortRightArrow; &amp;ShortUpArrow; &amp;SmallCircle; &amp;SquareIntersection; &amp;SquareSubset; &amp;SquareSuperset; &amp;SquareUnion; &amp;Subset; &amp;Succeeds; &amp;SucceedsSlantEqual; &amp;SuchThat; &amp;Sum; &amp;Sup; &amp;Superset; &amp;SupersetEqual; &amp;THORN &amp;TRADE; &amp;Therefore; &amp;Tilde; &amp;TildeEqual; &amp;TildeTilde; &amp;Uacute &amp;Ucirc &amp;Ugrave &amp;UnderBar; &amp;UnderBracket; &amp;Union; &amp;UpArrow; &amp;UpArrowDownArrow; &amp;UpEquilibrium; &amp;UpTee; &amp;Uparrow; &amp;UpperLeftArrow; &amp;Upsi; &amp;Uuml &amp;Vee; &amp;Vert; &amp;VerticalBar; &amp;VerticalLine; &amp;VerticalTilde; &amp;VeryThinSpace; &amp;Wedge; &amp;Yacute &amp;aacute &amp;acirc &amp;acute &amp;acute; &amp;aelig &amp;agrave &amp;alefsym; &amp;amp &amp;ang; &amp;angst; &amp;ap; &amp;approxeq; &amp;aring &amp;asymp; &amp;asympeq; &amp;atilde &amp;auml &amp;backcong; &amp;backsim; &amp;barwedge; &amp;becaus; &amp;because; &amp;bepsi; &amp;bernou; &amp;bigodot; &amp;bigstar; &amp;bigvee; &amp;bigwedge; &amp;blacklozenge; &amp;blacksquare; &amp;bot; &amp;bottom; &amp;boxh; &amp;boxtimes; &amp;bprime; &amp;breve; &amp;brvbar &amp;bsime; &amp;bullet; &amp;bumpe; &amp;bumpeq; &amp;caron; &amp;ccedil &amp;cedil &amp;cedil; &amp;cent &amp;centerdot; &amp;checkmark; &amp;circlearrowleft; &amp;circlearrowright; &amp;circledR; &amp;circledS; &amp;circledast; &amp;circledcirc; &amp;circleddash; &amp;cire; &amp;clubsuit; &amp;colone; &amp;complement; &amp;cong; &amp;conint; &amp;coprod; &amp;copy &amp;cuepr; &amp;cularr; &amp;curlyeqsucc; &amp;curren &amp;curvearrowright; &amp;cuvee; &amp;cuwed; &amp;dArr; &amp;dash; &amp;dblac; &amp;dd; &amp;ddagger; &amp;ddarr; &amp;deg &amp;dharr; &amp;diam; &amp;diams; &amp;die; &amp;digamma; &amp;div; &amp;divide &amp;divideontimes; &amp;dlcorn; &amp;doteq; &amp;dotminus; &amp;dotplus; &amp;dotsquare; &amp;downarrow; &amp;downharpoonleft; &amp;downharpoonright; &amp;dtri; &amp;dtrif; &amp;duarr; &amp;duhar; &amp;eDDot; &amp;eDot; &amp;eacute &amp;ecirc &amp;ecolon; &amp;ee; &amp;efDot; &amp;egrave &amp;emptyset; &amp;emptyv; &amp;epsilon; &amp;epsiv; &amp;eqcirc; &amp;eqsim; &amp;eqslantgtr; &amp;eqslantless; &amp;equiv; &amp;erDot; &amp;eth &amp;euml &amp;exist; &amp;fork; &amp;frac12 &amp;frac12; &amp;frac14 &amp;frac34 &amp;gE; &amp;gel; &amp;geq; &amp;geqslant; &amp;ggg; &amp;gnE; &amp;gnapprox; &amp;gneq; &amp;gsim; &amp;gt &amp;gtdot; &amp;gtrapprox; &amp;gtreqqless; &amp;gtrless; &amp;gtrsim; &amp;gvnE; &amp;hamilt; &amp;hbar; &amp;heartsuit; &amp;hksearow; &amp;hkswarow; &amp;hookleftarrow; &amp;hookrightarrow; &amp;hslash; &amp;iacute &amp;icirc &amp;iexcl &amp;iff; &amp;igrave &amp;ii; &amp;iiint; &amp;image; &amp;imagpart; &amp;imath; &amp;in; &amp;int; &amp;integers; &amp;intercal; &amp;intprod; &amp;iquest &amp;isin; &amp;it; &amp;iuml &amp;kappav; &amp;lArr; &amp;lEg; &amp;lang; &amp;laquo &amp;larrb; &amp;lcub; &amp;ldquo; &amp;ldquor; &amp;le; &amp;leftarrow; &amp;leftarrowtail; &amp;leftleftarrows; &amp;leftrightarrow; &amp;leftrightarrows; &amp;leftrightsquigarrow; &amp;leftthreetimes; &amp;leg; &amp;leqq; &amp;leqslant; &amp;lessapprox; &amp;lesssim; &amp;lfloor; &amp;lg; &amp;lhard; &amp;lharu; &amp;lmoustache; &amp;lnE; &amp;lnapprox; &amp;lneq; &amp;lobrk; &amp;longleftrightarrow; &amp;longmapsto; &amp;longrightarrow; &amp;looparrowleft; &amp;loz; &amp;lrcorner; &amp;lrhar; &amp;lsh; &amp;lsim; &amp;lsqb; &amp;lsquo; &amp;lsquor; &amp;lt &amp;ltdot; &amp;ltrie; &amp;ltrif; &amp;lvnE; &amp;macr &amp;malt; &amp;mapsto; &amp;mapstodown; &amp;mapstoleft; &amp;mapstoup; &amp;measuredangle; &amp;micro &amp;midast; &amp;middot &amp;middot; &amp;minusb; &amp;mldr; &amp;mnplus; &amp;mp; &amp;mstpos; &amp;multimap; &amp;nGtv; &amp;nLeftrightarrow; &amp;nRightarrow; &amp;nap; &amp;natural; &amp;nbsp &amp;ne; &amp;nearr; &amp;nearrow; &amp;nequiv; &amp;nesear; &amp;nexists; &amp;ngE; &amp;nge; &amp;ngeqq; &amp;ngeqslant; &amp;ngt; &amp;nharr; &amp;ni; &amp;niv; &amp;nlArr; &amp;nlarr; &amp;nle; &amp;nleq; &amp;nleqq; &amp;nleqslant; &amp;nless; &amp;nlt; &amp;nmid; &amp;not &amp;notinva; &amp;notni; &amp;npar; &amp;nprcue; &amp;npre; &amp;nprec; &amp;npreceq; &amp;nrightarrow; &amp;nrtri; &amp;nrtrie; &amp;nsc; &amp;nsccue; &amp;nsce; &amp;nshortparallel; &amp;nsim; &amp;nsimeq; &amp;nsmid; &amp;nspar; &amp;nsqsube; &amp;nsqsupe; &amp;nsube; &amp;nsubset; &amp;nsubseteq; &amp;nsubseteqq; &amp;nsucc; &amp;nsucceq; &amp;nsupE; &amp;nsupe; &amp;nsupseteq; &amp;ntgl; &amp;ntilde &amp;ntriangleleft; &amp;ntrianglelefteq; &amp;ntrianglerighteq; &amp;nwarr; &amp;oacute &amp;ocirc &amp;odot; &amp;ograve &amp;ohm; &amp;oint; &amp;oplus; &amp;order; &amp;ordf &amp;ordm &amp;oscr; &amp;oslash &amp;otilde &amp;otimes; &amp;ouml &amp;par; &amp;para &amp;parallel; &amp;phiv; &amp;phmmat; &amp;plankv; &amp;plusb; &amp;plusmn &amp;pm; &amp;pound &amp;pr; &amp;prap; &amp;prcue; &amp;pre; &amp;preccurlyeq; &amp;prnE; &amp;prnap; &amp;prnsim; &amp;propto; &amp;prsim; &amp;qint; &amp;quaternions; &amp;questeq; &amp;quot &amp;rArr; &amp;rBarr; &amp;radic; &amp;rang; &amp;rangle; &amp;raquo &amp;rarr; &amp;rarrb; &amp;rarrlp; &amp;rbarr; &amp;rbrace; &amp;rbrack; &amp;rceil; &amp;rdquor; &amp;real; &amp;realpart; &amp;reals; &amp;reg &amp;rfloor; &amp;rightarrow; &amp;rightarrowtail; &amp;rightharpoondown; &amp;rightharpoonup; &amp;rightrightarrows; &amp;rightsquigarrow; &amp;rightthreetimes; &amp;rlarr; &amp;rlhar; &amp;rmoustache; &amp;robrk; &amp;rsquor; &amp;rtrie; &amp;rtrif; &amp;sc; &amp;scap; &amp;sccue; &amp;sce; &amp;scnap; &amp;scsim; &amp;searr; &amp;searrow; &amp;sect &amp;setminus; &amp;setmn; &amp;sfrown; &amp;shortmid; &amp;shy &amp;sigmaf; &amp;sime; &amp;slarr; &amp;smallsetminus; &amp;smid; &amp;spades; &amp;spar; &amp;sqsube; &amp;sqsubset; &amp;sqsubseteq; &amp;sqsup; &amp;sqsupe; &amp;sqsupseteq; &amp;squ; &amp;square; &amp;squf; &amp;ssmile; &amp;sstarf; &amp;strns; &amp;sube; &amp;subnE; &amp;subne; &amp;subset; &amp;subseteq; &amp;subseteqq; &amp;succeq; &amp;succneqq; &amp;succnsim; &amp;succsim; &amp;sup1 &amp;sup2 &amp;sup3 &amp;supE; &amp;supne; &amp;supset; &amp;supseteq; &amp;supsetneqq; &amp;swarrow; &amp;szlig &amp;tbrk; &amp;tdot; &amp;therefore; &amp;thetav; &amp;thickapprox; &amp;thicksim; &amp;thinsp; &amp;thkap; &amp;thksim; &amp;thorn &amp;tilde; &amp;times &amp;top; &amp;tosa; &amp;triangleleft; &amp;trianglelefteq; &amp;triangleright; &amp;trianglerighteq; &amp;trie; &amp;twixt; &amp;twoheadleftarrow; &amp;uArr; &amp;uacute &amp;ucirc &amp;ugrave &amp;uharr; &amp;ulcorn; &amp;uml &amp;uml; &amp;uparrow; &amp;updownarrow; &amp;upharpoonleft; &amp;upharpoonright; &amp;uplus; &amp;upsilon; &amp;urcorn; &amp;utri; &amp;utrif; &amp;uuarr; &amp;uuml &amp;vArr; &amp;vDash; &amp;varepsilon; &amp;varnothing; &amp;varphi; &amp;varpi; &amp;varpropto; &amp;varr; &amp;varrho; &amp;varsigma; &amp;varsubsetneq; &amp;varsubsetneqq; &amp;varsupsetneq; &amp;vartheta; &amp;vartriangleright; &amp;vee; &amp;verbar; &amp;vltri; &amp;vnsup; &amp;vprop; &amp;vsupnE; &amp;wedge; &amp;weierp; &amp;wreath; &amp;xcap; &amp;xcirc; &amp;xcup; &amp;xdtri; &amp;xharr; &amp;xlarr; &amp;xoplus; &amp;xotime; &amp;xrArr; &amp;xrarr; &amp;xsqcup; &amp;xuplus; &amp;xutri; &amp;yacute &amp;yen &amp;yuml &amp;zeetrf;
</details>

In this list are many named character references without a trailing ;. This is because HTML does not require one in all cases. There's another behavior concerning numeric character references where the trailing ; isn't required at certain boundaries.

Further, whether or not the trailing ; is required is subject to the ambiguous ampersand rule, which guards a legacy behavior for certain query args in URL attributes which weren't properly encoded.

---

Outputs from this PR

The FromPHP column shows how html_entity_decode( $input, ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML5 ) would decode the input.

The Data and Attribute columns show how the HTML API decodes the text in the context of markup (data) and of an attribute value (attribute). These are different in HTML, and unfortunately PHP does not provide a way to differentiate them. The main difference is in the so-called "ambiguous ampersand" rule which allows many "entities" to be written without the terminating semicolon ; (though not all of the named character references may do this). In attributes, however, some of these can look like URL query params. E.g. is &not=dogs supposed to be ¬=dogs or a query arg named not whose value is dogs? HTML chose to ensure the safety of URLs and forbid decoding character references in these ambiguous cases.

https://github.com/WordPress/wordpress-develop/assets/5431237/d290dc9d-7ea1-42c1-911e-9ff59eb880d6

Outputs from a browser

I've compared Firefox and Safari. The middle column shows the data value and the right column has extracted the title attribute of the input and set it as the innerHTML of the TD.

https://github.com/WordPress/wordpress-develop/assets/5431237/9f99b895-1fb5-46be-a145-df0fd20d451b

The empty boxes represent unrendered Unicode charactered. While some characters, like the null byte, are replaced with a Replacement Character , "non-characters" are passed through, even though they are parser errors.

Trac ticket:

@jonsurrell commented on PR #6387:


15 months ago
#5

I've pushed a few changes, I think there was some inconsistency in null/false in the read_character_reference return type. This resulted in calling strlen( null ) showing up in tests as:

1) Tests_HtmlApi_WpHtmlDecoder::test_detects_ascii_case_insensitive_attribute_prefixes with data set "&#X6A;avascript&colon" ('&#X6A;avascript&colon', 'javascript:')
strlen(): Passing null to parameter #1 ($string) of type string is deprecated

/var/www/src/wp-includes/html-api/class-wp-html-decoder.php:63
/var/www/tests/phpunit/tests/html-api/wpHtmlDecoder.php:27
/var/www/vendor/bin/phpunit:122
Note: See TracTickets for help on using tickets.