Make WordPress Core

Changeset 57211


Ignore:
Timestamp:
12/20/2023 05:50:04 PM (6 months ago)
Author:
Bernhard Reiter
Message:

HTML API: Avoid processing incomplete tokens.

Currently the Tag Processor assumes that an input document is a full HTML document. Because of this, if there's lingering content after the last tag match it will treat that content as plaintext and skip over it. This is fine for the Tag Processor because if there is lingering content that isn't a valid tag then there's nothing for next_tag() to match.

However, in order to support a number of feature expansions it is important to recognize that the remaining content may involve partial syntax elements, such as incomplete tags, attributes, or comments.

In this patch we're adding a mode inside the Tag Processor which will flip when we start parsing HTML syntax but the document finishes before the token does. This will provide the ability to:

  • extend the input document,
  • avoid misinterpreting syntax as text, and
  • guess if we have a complete document, know if we have an incomplete document.

In the process of building this patch a few fixes were identified and fixed in the Tag Processor, namely in the handling of incomplete syntax elements.

Props dmsnell, jonsurrell.
Fixes #60122, #60108.

Location:
trunk
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • trunk/src/wp-includes/html-api/class-wp-html-tag-processor.php

    r57179 r57211  
    1616 *    This would increase the size of the changes for some operations but leave more
    1717 *    natural-looking output HTML.
    18  *  - Decode HTML character references within class names when matching. E.g. match having
    19  *    class `1<"2` needs to recognize `class="1&lt;&quot;2"`. Currently the Tag Processor
    20  *    will fail to find the right tag if the class name is encoded as such.
    2118 *  - Properly decode HTML character references in `get_attribute()`. PHP's
    2219 *    `html_entity_decode()` is wrong in a couple ways: it doesn't account for the
     
    107104 * given, it will return `true` (the only way to set `false` for an
    108105 * attribute is to remove it).
     106 *
     107 * #### When matching fails
     108 *
     109 * When `next_tag()` returns `false` it could mean different things:
     110 *
     111 *  - The requested tag wasn't found in the input document.
     112 *  - The input document ended in the middle of an HTML syntax element.
     113 *
     114 * When a document ends in the middle of a syntax element it will pause
     115 * the processor. This is to make it possible in the future to extend the
     116 * input document and proceed - an important requirement for chunked
     117 * streaming parsing of a document.
     118 *
     119 * Example:
     120 *
     121 *     $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' );
     122 *     false === $processor->next_tag();
     123 *
     124 * If a special element (see next section) is encountered but no closing tag
     125 * is found it will count as an incomplete tag. The parser will pause as if
     126 * the opening tag were incomplete.
     127 *
     128 * Example:
     129 *
     130 *     $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' );
     131 *     false === $processor->next_tag();
     132 *
     133 *     $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' );
     134 *     true === $processor->next_tag( 'DIV' );
     135 *
     136 * #### Special elements
     137 *
     138 * Some HTML elements are handled in a special way; their start and end tags
     139 * act like a void tag. These are special because their contents can't contain
     140 * HTML markup. Everything inside these elements is handled in a special way
     141 * and content that _appears_ like HTML tags inside of them isn't. There can
     142 * be no nesting in these elements.
     143 *
     144 * In the following list, "raw text" means that all of the content in the HTML
     145 * until the matching closing tag is treated verbatim without any replacements
     146 * and without any parsing.
     147 *
     148 *  - IFRAME allows no content but requires a closing tag.
     149 *  - NOEMBED (deprecated) content is raw text.
     150 *  - NOFRAMES (deprecated) content is raw text.
     151 *  - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment.
     152 *  - STYLE content is raw text.
     153 *  - TITLE content is plain text but character references are decoded.
     154 *  - TEXTAREA content is plain text but character references are decoded.
     155 *  - XMP (deprecated) content is raw text.
    109156 *
    110157 * ### Modifying HTML attributes for a found tag
     
    242289 * unquoted values will appear in the output with double-quotes.
    243290 *
     291 * ### Scripting Flag
     292 *
     293 * The Tag Processor parses HTML with the "scripting flag" disabled. This means
     294 * that it doesn't run any scripts while parsing the page. In a browser with
     295 * JavaScript enabled, for example, the script can change the parse of the
     296 * document as it loads. On the server, however, evaluating JavaScript is not
     297 * only impractical, but also unwanted.
     298 *
     299 * Practically this means that the Tag Processor will descend into NOSCRIPT
     300 * elements and process its child tags. Were the scripting flag enabled, such
     301 * as in a typical browser, the contents of NOSCRIPT are skipped entirely.
     302 *
     303 * This allows the HTML API to process the content that will be presented in
     304 * a browser when scripting is disabled, but it offers a different view of a
     305 * page than most browser sessions will experience. E.g. the tags inside the
     306 * NOSCRIPT disappear.
     307 *
     308 * ### Text Encoding
     309 *
     310 * The Tag Processor assumes that the input HTML document is encoded with a
     311 * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=',
     312 * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab,
     313 * carriage-return, newline, and form-feed.
     314 *
     315 * In practice, this includes almost every single-byte encoding as well as
     316 * UTF-8. Notably, however, it does not include UTF-16. If providing input
     317 * that's incompatible, then convert the encoding beforehand.
     318 *
    244319 * @since 6.2.0
    245320 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive.
    246321 * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE.
     322 * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token.
     323 *              Introduces "special" elements which act like void elements, e.g. STYLE.
    247324 */
    248325class WP_HTML_Tag_Processor {
     
    316393     */
    317394    private $stop_on_tag_closers;
     395
     396    /**
     397     * Specifies mode of operation of the parser at any given time.
     398     *
     399     * | State         | Meaning                                                              |
     400     * | --------------|----------------------------------------------------------------------|
     401     * | *Ready*       | The parser is ready to run.                                          |
     402     * | *Complete*    | There is nothing left to parse.                                      |
     403     * | *Incomplete*  | The HTML ended in the middle of a token; nothing more can be parsed. |
     404     * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes.           |
     405     *
     406     * @since 6.5.0
     407     *
     408     * @see WP_HTML_Tag_Processor::STATE_READY
     409     * @see WP_HTML_Tag_Processor::STATE_COMPLETE
     410     * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE
     411     * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
     412     *
     413     * @var string
     414     */
     415    private $parser_state = self::STATE_READY;
    318416
    319417    /**
     
    545643     *
    546644     * @since 6.2.0
     645     * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token.
    547646     *
    548647     * @param array|string|null $query {
     
    563662
    564663        do {
    565             if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     664            if ( false === $this->next_token() ) {
    566665                return false;
    567666            }
    568667
    569             // Find the next tag if it exists.
    570             if ( false === $this->parse_next_tag() ) {
    571                 $this->bytes_already_parsed = strlen( $this->html );
    572 
    573                 return false;
    574             }
    575 
    576             // Parse all of its attributes.
    577             while ( $this->parse_next_attribute() ) {
     668            if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    578669                continue;
    579670            }
    580671
    581             // Ensure that the tag closes before the end of the document.
    582             if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
    583                 return false;
    584             }
    585 
    586             $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
    587             if ( false === $tag_ends_at ) {
    588                 return false;
    589             }
    590             $this->token_length         = $tag_ends_at - $this->token_starts_at;
    591             $this->bytes_already_parsed = $tag_ends_at;
    592 
    593             // Finally, check if the parsed tag and its attributes match the search query.
    594672            if ( $this->matches() ) {
    595673                ++$already_found;
    596674            }
    597 
    598             /*
    599              * For non-DATA sections which might contain text that looks like HTML tags but
    600              * isn't, scan with the appropriate alternative mode. Looking at the first letter
    601              * of the tag name as a pre-check avoids a string allocation when it's not needed.
    602              */
    603             $t = $this->html[ $this->tag_name_starts_at ];
    604             if (
    605                 ! $this->is_closing_tag &&
     675        } while ( $already_found < $this->sought_match_offset );
     676
     677        return true;
     678    }
     679
     680    /**
     681     * Finds the next token in the HTML document.
     682     *
     683     * An HTML document can be viewed as a stream of tokens,
     684     * where tokens are things like HTML tags, HTML comments,
     685     * text nodes, etc. This method finds the next token in
     686     * the HTML document and returns whether it found one.
     687     *
     688     * If it starts parsing a token and reaches the end of the
     689     * document then it will seek to the start of the last
     690     * token and pause, returning `false` to indicate that it
     691     * failed to find a complete token.
     692     *
     693     * Possible token types, based on the HTML specification:
     694     *
     695     *  - an HTML tag, whether opening, closing, or void.
     696     *  - a text node - the plaintext inside tags.
     697     *  - an HTML comment.
     698     *  - a DOCTYPE declaration.
     699     *  - a processing instruction, e.g. `<?xml version="1.0" ?>`.
     700     *
     701     * The Tag Processor currently only supports the tag token.
     702     *
     703     * @since 6.5.0
     704     *
     705     * @return bool Whether a token was parsed.
     706     */
     707    public function next_token() {
     708        $this->get_updated_html();
     709        $was_at = $this->bytes_already_parsed;
     710
     711        // Don't proceed if there's nothing more to scan.
     712        if (
     713            self::STATE_COMPLETE === $this->parser_state ||
     714            self::STATE_INCOMPLETE === $this->parser_state
     715        ) {
     716            return false;
     717        }
     718
     719        /*
     720         * The next step in the parsing loop determines the parsing state;
     721         * clear it so that state doesn't linger from the previous step.
     722         */
     723        $this->parser_state = self::STATE_READY;
     724
     725        if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     726            $this->parser_state = self::STATE_COMPLETE;
     727            return false;
     728        }
     729
     730        // Find the next tag if it exists.
     731        if ( false === $this->parse_next_tag() ) {
     732            if ( self::STATE_INCOMPLETE === $this->parser_state ) {
     733                $this->bytes_already_parsed = $was_at;
     734            }
     735
     736            return false;
     737        }
     738
     739        // Parse all of its attributes.
     740        while ( $this->parse_next_attribute() ) {
     741            continue;
     742        }
     743
     744        // Ensure that the tag closes before the end of the document.
     745        if (
     746            self::STATE_INCOMPLETE === $this->parser_state ||
     747            $this->bytes_already_parsed >= strlen( $this->html )
     748        ) {
     749            // Does this appropriately clear state (parsed attributes)?
     750            $this->parser_state         = self::STATE_INCOMPLETE;
     751            $this->bytes_already_parsed = $was_at;
     752
     753            return false;
     754        }
     755
     756        $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
     757        if ( false === $tag_ends_at ) {
     758            $this->parser_state         = self::STATE_INCOMPLETE;
     759            $this->bytes_already_parsed = $was_at;
     760
     761            return false;
     762        }
     763        $this->parser_state         = self::STATE_MATCHED_TAG;
     764        $this->token_length         = $tag_ends_at - $this->token_starts_at;
     765        $this->bytes_already_parsed = $tag_ends_at;
     766
     767        /*
     768         * For non-DATA sections which might contain text that looks like HTML tags but
     769         * isn't, scan with the appropriate alternative mode. Looking at the first letter
     770         * of the tag name as a pre-check avoids a string allocation when it's not needed.
     771         */
     772        $t = $this->html[ $this->tag_name_starts_at ];
     773        if (
     774            ! $this->is_closing_tag &&
     775            (
     776                'i' === $t || 'I' === $t ||
     777                'n' === $t || 'N' === $t ||
     778                's' === $t || 'S' === $t ||
     779                't' === $t || 'T' === $t ||
     780                'x' === $t || 'X' === $t
     781            )
     782        ) {
     783            $tag_name = $this->get_tag();
     784
     785            if ( 'SCRIPT' === $tag_name && ! $this->skip_script_data() ) {
     786                $this->parser_state         = self::STATE_INCOMPLETE;
     787                $this->bytes_already_parsed = $was_at;
     788
     789                return false;
     790            } elseif (
     791                ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) &&
     792                ! $this->skip_rcdata( $tag_name )
     793            ) {
     794                $this->parser_state         = self::STATE_INCOMPLETE;
     795                $this->bytes_already_parsed = $was_at;
     796
     797                return false;
     798            } elseif (
    606799                (
    607                     'i' === $t || 'I' === $t ||
    608                     'n' === $t || 'N' === $t ||
    609                     's' === $t || 'S' === $t ||
    610                     't' === $t || 'T' === $t
    611                 ) ) {
    612                 $tag_name = $this->get_tag();
    613 
    614                 if ( 'SCRIPT' === $tag_name && ! $this->skip_script_data() ) {
    615                     $this->bytes_already_parsed = strlen( $this->html );
    616                     return false;
    617                 } elseif (
    618                     ( 'TEXTAREA' === $tag_name || 'TITLE' === $tag_name ) &&
    619                     ! $this->skip_rcdata( $tag_name )
    620                 ) {
    621                     $this->bytes_already_parsed = strlen( $this->html );
    622                     return false;
    623                 } elseif (
    624                     (
    625                         'IFRAME' === $tag_name ||
    626                         'NOEMBED' === $tag_name ||
    627                         'NOFRAMES' === $tag_name ||
    628                         'NOSCRIPT' === $tag_name ||
    629                         'STYLE' === $tag_name
    630                     ) &&
    631                     ! $this->skip_rawtext( $tag_name )
    632                 ) {
    633                     /*
    634                      * "XMP" should be here too but its rules are more complicated and require the
    635                      * complexity of the HTML Processor (it needs to close out any open P element,
    636                      * meaning it can't be skipped here or else the HTML Processor will lose its
    637                      * place). For now, it can be ignored as it's a rare HTML tag in practice and
    638                      * any normative HTML should be using PRE instead.
    639                      */
    640                     $this->bytes_already_parsed = strlen( $this->html );
    641                     return false;
    642                 }
    643             }
    644         } while ( $already_found < $this->sought_match_offset );
     800                    'IFRAME' === $tag_name ||
     801                    'NOEMBED' === $tag_name ||
     802                    'NOFRAMES' === $tag_name ||
     803                    'STYLE' === $tag_name ||
     804                    'XMP' === $tag_name
     805                ) &&
     806                ! $this->skip_rawtext( $tag_name )
     807            ) {
     808                $this->parser_state         = self::STATE_INCOMPLETE;
     809                $this->bytes_already_parsed = $was_at;
     810
     811                return false;
     812            }
     813        }
    645814
    646815        return true;
    647816    }
    648817
     818    /**
     819     * Whether the processor paused because the input HTML document ended
     820     * in the middle of a syntax element, such as in the middle of a tag.
     821     *
     822     * Example:
     823     *
     824     *     $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' );
     825     *     false      === $processor->get_next_tag();
     826     *     true       === $processor->paused_at_incomplete_token();
     827     *
     828     * @since 6.5.0
     829     *
     830     * @return bool Whether the parse paused at the start of an incomplete token.
     831     */
     832    public function paused_at_incomplete_token() {
     833        return self::STATE_INCOMPLETE === $this->parser_state;
     834    }
    649835
    650836    /**
     
    665851     */
    666852    public function class_list() {
     853        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
     854            return;
     855        }
     856
    667857        /** @var string $class contains the string value of the class attribute, with character references decoded. */
    668858        $class = $this->get_attribute( 'class' );
     
    720910     */
    721911    public function has_class( $wanted_class ) {
    722         if ( ! $this->tag_name_starts_at ) {
     912        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    723913            return null;
    724914        }
     
    8171007     */
    8181008    public function set_bookmark( $name ) {
    819         if ( null === $this->tag_name_starts_at ) {
     1009        // It only makes sense to set a bookmark if the parser has paused on a concrete token.
     1010        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    8201011            return false;
    8211012        }
     
    8961087            // Fail if there is no possible tag closer.
    8971088            if ( false === $at || ( $at + $tag_length ) >= $doc_length ) {
    898                 $this->bytes_already_parsed = $doc_length;
    8991089                return false;
    9001090            }
     
    9231113            $at                        += $tag_length;
    9241114            $this->bytes_already_parsed = $at;
     1115
     1116            if ( $at >= strlen( $html ) ) {
     1117                return false;
     1118            }
    9251119
    9261120            /*
     
    10741268                }
    10751269
     1270                if ( $this->bytes_already_parsed >= $doc_length ) {
     1271                    $this->parser_state = self::STATE_INCOMPLETE;
     1272
     1273                    return false;
     1274                }
     1275
    10761276                if ( '>' === $html[ $this->bytes_already_parsed ] ) {
    10771277                    $this->bytes_already_parsed = $closer_potentially_starts_at;
     
    11081308        while ( false !== $at && $at < $doc_length ) {
    11091309            $at = strpos( $html, '<', $at );
     1310
     1311            /*
     1312             * This does not imply an incomplete parse; it indicates that there
     1313             * can be nothing left in the document other than a #text node.
     1314             */
    11101315            if ( false === $at ) {
    11111316                return false;
     
    11141319            $this->token_starts_at = $at;
    11151320
    1116             if ( '/' === $this->html[ $at + 1 ] ) {
     1321            if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) {
    11171322                $this->is_closing_tag = true;
    11181323                ++$at;
     
    11481353             * the document. There is nothing left to parse.
    11491354             */
    1150             if ( $at + 1 >= strlen( $html ) ) {
     1355            if ( $at + 1 >= $doc_length ) {
     1356                $this->parser_state = self::STATE_INCOMPLETE;
     1357
    11511358                return false;
    11521359            }
     
    11621369                 */
    11631370                if (
    1164                     strlen( $html ) > $at + 3 &&
     1371                    $doc_length > $at + 3 &&
    11651372                    '-' === $html[ $at + 2 ] &&
    11661373                    '-' === $html[ $at + 3 ]
     
    11681375                    $closer_at = $at + 4;
    11691376                    // If it's not possible to close the comment then there is nothing more to scan.
    1170                     if ( strlen( $html ) <= $closer_at ) {
     1377                    if ( $doc_length <= $closer_at ) {
     1378                        $this->parser_state = self::STATE_INCOMPLETE;
     1379
    11711380                        return false;
    11721381                    }
     
    11861395                     */
    11871396                    --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping.
    1188                     while ( ++$closer_at < strlen( $html ) ) {
     1397                    while ( ++$closer_at < $doc_length ) {
    11891398                        $closer_at = strpos( $html, '--', $closer_at );
    11901399                        if ( false === $closer_at ) {
     1400                            $this->parser_state = self::STATE_INCOMPLETE;
     1401
    11911402                            return false;
    11921403                        }
    11931404
    1194                         if ( $closer_at + 2 < strlen( $html ) && '>' === $html[ $closer_at + 2 ] ) {
     1405                        if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) {
    11951406                            $at = $closer_at + 3;
    11961407                            continue 2;
    11971408                        }
    11981409
    1199                         if ( $closer_at + 3 < strlen( $html ) && '!' === $html[ $closer_at + 2 ] && '>' === $html[ $closer_at + 3 ] ) {
     1410                        if ( $closer_at + 3 < $doc_length && '!' === $html[ $closer_at + 2 ] && '>' === $html[ $closer_at + 3 ] ) {
    12001411                            $at = $closer_at + 4;
    12011412                            continue 2;
     
    12101421                 */
    12111422                if (
    1212                     strlen( $html ) > $at + 8 &&
     1423                    $doc_length > $at + 8 &&
    12131424                    '[' === $html[ $at + 2 ] &&
    12141425                    'C' === $html[ $at + 3 ] &&
     
    12211432                    $closer_at = strpos( $html, ']]>', $at + 9 );
    12221433                    if ( false === $closer_at ) {
     1434                        $this->parser_state = self::STATE_INCOMPLETE;
     1435
    12231436                        return false;
    12241437                    }
     
    12341447                 */
    12351448                if (
    1236                     strlen( $html ) > $at + 8 &&
     1449                    $doc_length > $at + 8 &&
    12371450                    ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) &&
    12381451                    ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) &&
     
    12451458                    $closer_at = strpos( $html, '>', $at + 9 );
    12461459                    if ( false === $closer_at ) {
     1460                        $this->parser_state = self::STATE_INCOMPLETE;
     1461
    12471462                        return false;
    12481463                    }
     
    12541469                /*
    12551470                 * Anything else here is an incorrectly-opened comment and transitions
    1256                  * to the bogus comment state - skip to the nearest >.
     1471                 * to the bogus comment state - skip to the nearest >. If no closer is
     1472                 * found then the HTML was truncated inside the markup declaration.
    12571473                 */
    12581474                $at = strpos( $html, '>', $at + 1 );
     1475                if ( false === $at ) {
     1476                    $this->parser_state = self::STATE_INCOMPLETE;
     1477
     1478                    return false;
     1479                }
     1480
    12591481                continue;
    12601482            }
     
    12621484            /*
    12631485             * </> is a missing end tag name, which is ignored.
     1486             *
     1487             * This was also known as the "presumptuous empty tag"
     1488             * in early discussions as it was proposed to close
     1489             * the nearest previous opening tag.
    12641490             *
    12651491             * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name
     
    12771503                $closer_at = strpos( $html, '>', $at + 2 );
    12781504                if ( false === $closer_at ) {
     1505                    $this->parser_state = self::STATE_INCOMPLETE;
     1506
    12791507                    return false;
    12801508                }
     
    12911519             */
    12921520            if ( $this->is_closing_tag ) {
     1521                // No chance of finding a closer.
     1522                if ( $at + 3 > $doc_length ) {
     1523                    return false;
     1524                }
     1525
    12931526                $closer_at = strpos( $html, '>', $at + 3 );
    12941527                if ( false === $closer_at ) {
     1528                    $this->parser_state = self::STATE_INCOMPLETE;
     1529
    12951530                    return false;
    12961531                }
     
    13171552        $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed );
    13181553        if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     1554            $this->parser_state = self::STATE_INCOMPLETE;
     1555
    13191556            return false;
    13201557        }
     
    13391576        $this->bytes_already_parsed += $name_length;
    13401577        if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     1578            $this->parser_state = self::STATE_INCOMPLETE;
     1579
    13411580            return false;
    13421581        }
     
    13441583        $this->skip_whitespace();
    13451584        if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     1585            $this->parser_state = self::STATE_INCOMPLETE;
     1586
    13461587            return false;
    13471588        }
     
    13521593            $this->skip_whitespace();
    13531594            if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
     1595                $this->parser_state = self::STATE_INCOMPLETE;
     1596
    13541597                return false;
    13551598            }
     
    13781621
    13791622        if ( $attribute_end >= strlen( $this->html ) ) {
     1623            $this->parser_state = self::STATE_INCOMPLETE;
     1624
    13801625            return false;
    13811626        }
     
    14441689     */
    14451690    private function after_tag() {
    1446         $this->get_updated_html();
    14471691        $this->token_starts_at      = null;
    14481692        $this->token_length         = null;
     
    17872031     */
    17882032    private function get_enqueued_attribute_value( $comparable_name ) {
     2033        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
     2034            return false;
     2035        }
     2036
    17892037        if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) {
    17902038            return false;
     
    18542102     */
    18552103    public function get_attribute( $name ) {
    1856         if ( null === $this->tag_name_starts_at ) {
     2104        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    18572105            return null;
    18582106        }
     
    19342182     */
    19352183    public function get_attribute_names_with_prefix( $prefix ) {
    1936         if ( $this->is_closing_tag || null === $this->tag_name_starts_at ) {
     2184        if (
     2185            self::STATE_MATCHED_TAG !== $this->parser_state ||
     2186            $this->is_closing_tag
     2187        ) {
    19372188            return null;
    19382189        }
     
    19662217     */
    19672218    public function get_tag() {
    1968         if ( null === $this->tag_name_starts_at ) {
     2219        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    19692220            return null;
    19702221        }
     
    19932244     */
    19942245    public function has_self_closing_flag() {
    1995         if ( ! $this->tag_name_starts_at ) {
     2246        if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
    19962247            return false;
    19972248        }
     
    20252276     */
    20262277    public function is_tag_closer() {
    2027         return $this->is_closing_tag;
     2278        return (
     2279            self::STATE_MATCHED_TAG === $this->parser_state &&
     2280            $this->is_closing_tag
     2281        );
    20282282    }
    20292283
     
    20452299     */
    20462300    public function set_attribute( $name, $value ) {
    2047         if ( $this->is_closing_tag || null === $this->tag_name_starts_at ) {
     2301        if (
     2302            self::STATE_MATCHED_TAG !== $this->parser_state ||
     2303            $this->is_closing_tag
     2304        ) {
    20482305            return false;
    20492306        }
     
    21782435     */
    21792436    public function remove_attribute( $name ) {
    2180         if ( $this->is_closing_tag ) {
     2437        if (
     2438            self::STATE_MATCHED_TAG !== $this->parser_state ||
     2439            $this->is_closing_tag
     2440        ) {
    21812441            return false;
    21822442        }
     
    22552515     */
    22562516    public function add_class( $class_name ) {
    2257         if ( $this->is_closing_tag ) {
     2517        if (
     2518            self::STATE_MATCHED_TAG !== $this->parser_state ||
     2519            $this->is_closing_tag
     2520        ) {
    22582521            return false;
    22592522        }
    22602523
    2261         if ( null !== $this->tag_name_starts_at ) {
    2262             $this->classname_updates[ $class_name ] = self::ADD_CLASS;
    2263         }
     2524        $this->classname_updates[ $class_name ] = self::ADD_CLASS;
    22642525
    22652526        return true;
     
    22752536     */
    22762537    public function remove_class( $class_name ) {
    2277         if ( $this->is_closing_tag ) {
     2538        if (
     2539            self::STATE_MATCHED_TAG !== $this->parser_state ||
     2540            $this->is_closing_tag
     2541        ) {
    22782542            return false;
    22792543        }
     
    24812745        return true;
    24822746    }
     2747
     2748    /**
     2749     * Parser Ready State
     2750     *
     2751     * Indicates that the parser is ready to run and waiting for a state transition.
     2752     * It may not have started yet, or it may have just finished parsing a token and
     2753     * is ready to find the next one.
     2754     *
     2755     * @since 6.5.0
     2756     *
     2757     * @access private
     2758     */
     2759    const STATE_READY = 'STATE_READY';
     2760
     2761    /**
     2762     * Parser Complete State
     2763     *
     2764     * Indicates that the parser has reached the end of the document and there is
     2765     * nothing left to scan. It finished parsing the last token completely.
     2766     *
     2767     * @since 6.5.0
     2768     *
     2769     * @access private
     2770     */
     2771    const STATE_COMPLETE = 'STATE_COMPLETE';
     2772
     2773    /**
     2774     * Parser Incomplete State
     2775     *
     2776     * Indicates that the parser has reached the end of the document before finishing
     2777     * a token. It started parsing a token but there is a possibility that the input
     2778     * HTML document was truncated in the middle of a token.
     2779     *
     2780     * The parser is reset at the start of the incomplete token and has paused. There
     2781     * is nothing more than can be scanned unless provided a more complete document.
     2782     *
     2783     * @since 6.5.0
     2784     *
     2785     * @access private
     2786     */
     2787    const STATE_INCOMPLETE = 'STATE_INCOMPLETE';
     2788
     2789    /**
     2790     * Parser Matched Tag State
     2791     *
     2792     * Indicates that the parser has found an HTML tag and it's possible to get
     2793     * the tag name and read or modify its attributes (if it's not a closing tag).
     2794     *
     2795     * @since 6.5.0
     2796     *
     2797     * @access private
     2798     */
     2799    const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
    24832800}
  • trunk/tests/phpunit/tests/html-api/wpHtmlTagProcessor.php

    r56703 r57211  
    17571757     *
    17581758     * @covers WP_HTML_Tag_Processor::next_tag
     1759     * @covers WP_HTML_Tag_Processor::paused_at_incomplete_token
    17591760     */
    17601761    public function test_unclosed_script_tag_should_not_cause_an_infinite_loop() {
    1761         $p = new WP_HTML_Tag_Processor( '<script>' );
    1762         $p->next_tag();
    1763         $this->assertSame( 'SCRIPT', $p->get_tag(), 'Did not find script tag' );
     1762        $p = new WP_HTML_Tag_Processor( '<script><div>' );
     1763        $this->assertFalse(
     1764            $p->next_tag(),
     1765            'Should not have stopped on an opening SCRIPT tag without a proper closing tag in the document.'
     1766        );
     1767        $this->assertTrue(
     1768            $p->paused_at_incomplete_token(),
     1769            "Should have paused the parser because of the incomplete SCRIPT tag but didn't."
     1770        );
     1771
     1772        // Run this to ensure that the test ends (not in an infinite loop).
    17641773        $p->next_tag();
    17651774    }
     
    19331942
    19341943    /**
     1944     * Ensures matching elements inside NOSCRIPT elements.
     1945     *
     1946     * In a browser when the scripting flag is enabled, everything inside
     1947     * the NOSCRIPT element will be ignored and treated at RAW TEXT. This
     1948     * means that it's valid to send what looks like incomplete or partial
     1949     * HTML syntax without impacting a rendered page. The Tag Processor is
     1950     * a parser with the scripting flag disabled, however, and needs to
     1951     * expose all the potential content that some code might want to modify.
     1952     *
     1953     * Were it not for this then the NOSCRIPT tag would be handled like the
     1954     * other tags in the RAW TEXT special group, e.g. NOEMBED or STYLE.
     1955     *
     1956     * @ticket 60122
     1957     *
     1958     * @covers WP_HTML_Tag_Processor::next_tag
     1959     */
     1960    public function test_processes_inside_of_noscript_elements() {
     1961        $p = new WP_HTML_Tag_Processor( '<noscript><input type="submit"></noscript><div>' );
     1962
     1963        $this->assertTrue( $p->next_tag( 'INPUT' ), 'Failed to find INPUT element inside NOSCRIPT element.' );
     1964        $this->assertTrue( $p->next_tag( 'DIV' ), 'Failed to find DIV element after NOSCRIPT element.' );
     1965    }
     1966
     1967    /**
    19351968     * @ticket 59292
    19361969     *
     
    19631996            'NOEMBED'          => array( '<noembed><p></p></noembed><div target>' ),
    19641997            'NOFRAMES'         => array( '<noframes><p>Check the rules here.</p></noframes><div target>' ),
    1965             'NOSCRIPT'         => array( '<noscript><span>This assumes that scripting mode is enabled.</span></noscript><p target>' ),
    19661998            'STYLE'            => array( '<style>* { margin: 0 }</style><div target>' ),
    19671999            'STYLE hiding DIV' => array( '<style>li::before { content: "<div non-target>" }</style><div target>' ),
     
    21402172     *
    21412173     * @covers WP_HTML_Tag_Processor::next_tag
     2174     * @covers WP_HTML_Tag_Processor::paused_at_incomplete_token
    21422175     *
    21432176     * @dataProvider data_html_with_unclosed_comments
    21442177     *
    2145      * @param string $html_ending_before_comment_close HTML with opened comments that aren't closed
     2178     * @param string $html_ending_before_comment_close HTML with opened comments that aren't closed.
    21462179     */
    21472180    public function test_documents_may_end_with_unclosed_comment( $html_ending_before_comment_close ) {
    21482181        $p = new WP_HTML_Tag_Processor( $html_ending_before_comment_close );
    21492182
    2150         $this->assertFalse( $p->next_tag() );
     2183        $this->assertFalse(
     2184            $p->next_tag(),
     2185            "Should not have found any tag, but found {$p->get_tag()}."
     2186        );
     2187
     2188        $this->assertTrue(
     2189            $p->paused_at_incomplete_token(),
     2190            "Should have indicated that the parser found an incomplete token but didn't."
     2191        );
    21512192    }
    21522193
     
    22812322
    22822323    /**
     2324     * Ensures that no tags are matched in a document containing only non-tag content.
     2325     *
     2326     * @ticket 60122
     2327     *
     2328     * @covers WP_HTML_Tag_Processor::next_tag
     2329     * @covers WP_HTML_Tag_Processor::paused_at_incomplete_token
     2330     *
     2331     * @dataProvider data_html_without_tags
     2332     *
     2333     * @param string $html_without_tags HTML without any tags in it.
     2334     */
     2335    public function test_next_tag_returns_false_when_there_are_no_tags( $html_without_tags ) {
     2336        $processor = new WP_HTML_Tag_Processor( $html_without_tags );
     2337
     2338        $this->assertFalse(
     2339            $processor->next_tag(),
     2340            "Shouldn't have found any tags but found {$processor->get_tag()}."
     2341        );
     2342
     2343        $this->assertFalse(
     2344            $processor->paused_at_incomplete_token(),
     2345            'Should have indicated that end of document was reached without evidence that elements were truncated.'
     2346        );
     2347    }
     2348
     2349    /**
     2350     * Data provider.
     2351     *
     2352     * @return array[]
     2353     */
     2354    public function data_html_without_tags() {
     2355        return array(
     2356            'DOCTYPE declaration'    => array( '<!DOCTYPE html>Just some HTML' ),
     2357            'No tags'                => array( 'this is nothing more than a text node' ),
     2358            'Text with comments'     => array( 'One <!-- sneaky --> comment.' ),
     2359            'Empty tag closer'       => array( '</>' ),
     2360            'Processing instruction' => array( '<?xml version="1.0"?>' ),
     2361            'Combination XML-like'   => array( '<!DOCTYPE xml><?xml version=""?><!-- this is not a real document. --><![CDATA[it only serves as a test]]>' ),
     2362        );
     2363    }
     2364
     2365    /**
     2366     * Ensures that the processor doesn't attempt to match an incomplete token.
     2367     *
    22832368     * @ticket 58637
    22842369     *
    22852370     * @covers WP_HTML_Tag_Processor::next_tag
     2371     * @covers WP_HTML_Tag_Processor::paused_at_incomplete_token
    22862372     *
    22872373     * @dataProvider data_incomplete_syntax_elements
     
    22892375     * @param string $incomplete_html HTML text containing some kind of incomplete syntax.
    22902376     */
    2291     public function test_returns_false_for_incomplete_syntax_elements( $incomplete_html ) {
     2377    public function test_next_tag_returns_false_for_incomplete_syntax_elements( $incomplete_html ) {
    22922378        $p = new WP_HTML_Tag_Processor( $incomplete_html );
    2293         $this->assertFalse( $p->next_tag() );
     2379
     2380        $this->assertFalse(
     2381            $p->next_tag(),
     2382            "Shouldn't have found any tags but found {$p->get_tag()}."
     2383        );
     2384
     2385        $this->assertTrue(
     2386            $p->paused_at_incomplete_token(),
     2387            "Should have indicated that the parser found an incomplete token but didn't."
     2388        );
    22942389    }
    22952390
     
    23012396    public function data_incomplete_syntax_elements() {
    23022397        return array(
    2303             'No tags'                              => array( 'this is nothing more than a text node' ),
    23042398            'Incomplete tag name'                  => array( '<swit' ),
    23052399            'Incomplete tag (no attributes)'       => array( '<div' ),
     
    23142408            'Incomplete DOCTYPE'                   => array( '<!DOCTYPE html' ),
    23152409            'Partial DOCTYPE'                      => array( '<!DOCTY' ),
    2316             'Incomplete CDATA'                     => array( '<[CDATA[something inside of here needs to get out' ),
    2317             'Partial CDATA'                        => array( '<[CDA' ),
    2318             'Partially closed CDATA]'              => array( '<[CDATA[cannot escape]' ),
    2319             'Partially closed CDATA]>'             => array( '<[CDATA[cannot escape]>' ),
     2410            'Incomplete CDATA'                     => array( '<![CDATA[something inside of here needs to get out' ),
     2411            'Partial CDATA'                        => array( '<![CDA' ),
     2412            'Partially closed CDATA]'              => array( '<![CDATA[cannot escape]' ),
     2413            'Partially closed CDATA]>'             => array( '<![CDATA[cannot escape]>' ),
     2414            'Unclosed IFRAME'                      => array( '<iframe><div>' ),
     2415            'Unclosed NOEMBED'                     => array( '<noembed><div>' ),
     2416            'Unclosed NOFRAMES'                    => array( '<noframes><div>' ),
     2417            'Unclosed SCRIPT'                      => array( '<script><div>' ),
     2418            'Unclosed STYLE'                       => array( '<style><div>' ),
     2419            'Unclosed TEXTAREA'                    => array( '<textarea><div>' ),
     2420            'Unclosed TITLE'                       => array( '<title><div>' ),
     2421            'Unclosed XMP'                         => array( '<xmp><div>' ),
     2422            'Partially closed IFRAME'              => array( '<iframe><div></iframe' ),
     2423            'Partially closed NOEMBED'             => array( '<noembed><div></noembed' ),
     2424            'Partially closed NOFRAMES'            => array( '<noframes><div></noframes' ),
     2425            'Partially closed SCRIPT'              => array( '<script><div></script' ),
     2426            'Partially closed STYLE'               => array( '<style><div></style' ),
     2427            'Partially closed TEXTAREA'            => array( '<textarea><div></textarea' ),
     2428            'Partially closed TITLE'               => array( '<title><div></title' ),
     2429            'Partially closed XMP'                 => array( '<xmp><div></xmp' ),
    23202430        );
    23212431    }
     
    24162526    public function test_updating_attributes_in_malformed_html( $html, $expected ) {
    24172527        $p = new WP_HTML_Tag_Processor( $html );
    2418         $p->next_tag();
     2528        $this->assertTrue( $p->next_tag(), 'Could not find first tag.' );
    24192529        $p->set_attribute( 'foo', 'bar' );
    24202530        $p->add_class( 'firstTag' );
     
    24352545     */
    24362546    public function data_updating_attributes_in_malformed_html() {
    2437         $null_byte = chr( 0 );
    2438 
    24392547        return array(
    24402548            'Invalid entity inside attribute value'        => array(
     
    24952603            ),
    24962604            'id without double quotation marks around null byte' => array(
    2497                 'input'    => '<hr id' . $null_byte . 'zero="test"><span>test</span>',
    2498                 'expected' => '<hr class="firstTag" foo="bar" id' . $null_byte . 'zero="test"><span class="secondTag">test</span>',
     2605                'input'    => "<hr id\x00zero=\"test\"><span>test</span>",
     2606                'expected' => "<hr class=\"firstTag\" foo=\"bar\" id\x00zero=\"test\"><span class=\"secondTag\">test</span>",
    24992607            ),
    25002608            'Unexpected > before an attribute'             => array(
     
    25842692        );
    25852693    }
     2694
     2695    /**
     2696     * @covers WP_HTML_Tag_Processor::next_tag
     2697     */
     2698    public function test_handles_malformed_taglike_open_short_html() {
     2699        $p      = new WP_HTML_Tag_Processor( '<' );
     2700        $result = $p->next_tag();
     2701        $this->assertFalse( $result, 'Did not handle "<" html properly.' );
     2702    }
     2703
     2704    /**
     2705     * @covers WP_HTML_Tag_Processor::next_tag
     2706     */
     2707    public function test_handles_malformed_taglike_close_short_html() {
     2708        $p      = new WP_HTML_Tag_Processor( '</ ' );
     2709        $result = $p->next_tag();
     2710        $this->assertFalse( $result, 'Did not handle "</ " html properly.' );
     2711    }
    25862712}
Note: See TracChangeset for help on using the changeset viewer.