WordPress.org

Make WordPress Core

Opened 18 months ago

Last modified 3 months ago

#39153 new defect (bug)

Bug in wp_html_split with unclosed PHP tag (or HTML tag <)

Reported by: crosp Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 4.6.1
Component: Formatting Keywords: needs-patch needs-unit-tests
Focuses: administration, template Cc:

Description

The problem is in the shortcodes.php file, but exact problem is function wp_html_split in formatting.php

This bug is completely described in this question forum thread.

https://wordpress.org/support/topic/bug-in-wp_html_split-with-unclosed-php-tag/

Consider following post code.

Some amount of useless text <!--more-->

[code-highlight line-numbers="table" linenostart="53" highlight-lines="1,3,8" style="native" lang="html+php" pyg-id="1" ]
<?php
//This callback registers our plug-in
function wpse72394_register_tinymce_plugin($plugin_array) {
    $plugin_array['wpse72394_button'] = 'path/to/shortcode.js';
    return $plugin_array;
}

//This callback adds our button to the toolbar
function wpse72394_add_tinymce_button($buttons) {
            //Add the button ID to the $button array
    $buttons[] = "wpse72394_button";
    return $buttons;
}
?
[/code-highlight]

Some amount of useless text <strong>checkstyle</strong>

[code-highlight style="native" lang="perl" pyg-id="2" ]
(?:s+)(?:(/*([^*]|[rn]|(*+([^*/]|[rn])))**+/)|(//(?!.*(CHECKSTYLE)).*))
[/code-highlight]

Here dump after this line

$textarr = wp_html_split( $content );
    var_dump($textarr);
    exit;
                array(25) {
  [0]=>
  string(0) ""
  [1]=>
  string(3) "<p>"
  [2]=>
  string(28) "Some amount of useless text "
  [3]=>
  string(11) "<!--more-->"
  [4]=>
  string(0) ""
  [5]=>
  string(4) "</p>"
  [6]=>
  string(1) "
"
  [7]=>
  string(3) "<p>"
  [8]=>
  string(121) "[code-highlight line-numbers="table" linenostart="53" highlight-lines="1,3,8" style="native" lang="html+php" pyg-id="1" ]"
  [9]=>
  string(6) "<br />"
  [10]=>
  string(1) "
"
  [11]=>
  string(464) "<?php
//This callback registers our plug-in
function wpse72394_register_tinymce_plugin($plugin_array) {
    $plugin_array['wpse72394_button'] = 'path/to/shortcode.js';
    return $plugin_array;
}

//This callback adds our button to the toolbar
function wpse72394_add_tinymce_button($buttons) {
            //Add the button ID to the $button array
    $buttons[] = "wpse72394_button";
    return $buttons;
}
?
[/code-highlight]

Some amount of useless text <strong>"
  [12]=>
  string(10) "checkstyle"
  [13]=>
  string(9) "</strong>"
  [14]=>
  string(0) ""
  [15]=>
  string(4) "</p>"
  [16]=>
  string(56) "
[code-highlight style="native" lang="perl" pyg-id="2" ]"
  [17]=>
  string(6) "<br />"
  [18]=>
  string(72) "
(?:s+)(?:(/*([^*]|[rn]|(*+([^*/]|[rn])))**+/)|(//(?!.*(CHECKSTYLE)).*))"
  [19]=>
  string(6) "<br />"
  [20]=>
  string(19) "
[/code-highlight]
"
  [21]=>
  string(3) "<p>"
  [22]=>
  string(15) "Some Text Again"
  [23]=>
  string(4) "</p>"
  [24]=>
  string(1) "
"
}

As you can see one shortcode was not splitted, and here the problem. If php closing tag is present (?>) than everything works fine.

Problematic regex provider

<?php
function get_html_split_regex() {
        static $regex;

        if ( ! isset( $regex ) ) {
                $comments =
                          '!'           // Start of comment, after the <.
                        . '(?:'         // Unroll the loop: Consume everything until --> is found.
                        .     '-(?!->)' // Dash not followed by end of comment.
                        .     '[^\-]*+' // Consume non-dashes.
                        . ')*+'         // Loop possessively.
                        . '(?:-->)?';   // End of comment. If not found, match all input.

                $cdata =
                          '!\[CDATA\['  // Start of comment, after the <.
                        . '[^\]]*+'     // Consume non-].
                        . '(?:'         // Unroll the loop: Consume everything until ]]> is found.
                        .     '](?!]>)' // One ] not followed by end of comment.
                        .     '[^\]]*+' // Consume non-].
                        . ')*+'         // Loop possessively.
                        . '(?:]]>)?';   // End of comment. If not found, match all input.

                $escaped =
                          '(?='           // Is the element escaped?
                        .    '!--'
                        . '|'
                        .    '!\[CDATA\['
                        . ')'
                        . '(?(?=!-)'      // If yes, which type?
                        .     $comments
                        . '|'
                        .     $cdata
                        . ')';

                $regex =
                          '/('              // Capture the entire match.
                        .     '<'           // Find start of element.
                        .     '(?'          // Conditional expression follows.
                        .         $escaped  // Find end of escaped element.
                        .     '|'           // ... else ...
                        .         '[^>]*>?' // Find end of normal element.
                        .     ')'
                        . ')/';
        }

        return $regex;
}

Without any doubts this case should be included in regex.

Change History (10)

#1 @crosp
18 months ago

I've explored code, but don't understand what the purpose of splitting input, why not just split text by shortcodes ?

I have also tried to fix regex by myself, but this leads to other bugs. Can someone help to find the right way to fix this problem ?

#2 @crosp
18 months ago

In my case simple commenting this processing of shortcodes in HTML tags

<?php
//      $content = do_shortcodes_in_html_tags( $content, $ignore_html, $tagnames );

in

function do_shortcode

solves the problem, could someone provide an example of case where shortcodes are located inside HTML tags ?

#3 @crosp
17 months ago

Please review this issue

#4 @crosp
16 months ago

I still have to change the source code of the Wordpress distr after any update. Any thoughts about this problem ?

#5 @SergeyBiryukov
16 months ago

  • Keywords needs-patch needs-unit-tests added

#6 @crosp
12 months ago

The same problem with the new Wordpress.

<?php
    //$content = do_shortcodes_in_html_tags( $content, $ignore_html, $tagnames );

Is the only solution that works for me. Could someone investigate this ?

Last edited 12 months ago by crosp (previous) (diff)

#7 @soulseekah
3 months ago

get_html_split_regex doesn't have any unit tests, writing some would be a great start. Without unit tests touching that regex is a bad, bad, bad idea. Probably even worse than trying to parse HTML with regex to begin with :)

Try to compile a list of inputs and expected outputs. Each case should be really tiny, demonstrating a specific scenario that regular expression goes through.

It is, however, tested via functions and methods that call it, so I'm pretty sure your patch will fail some of the shortcodes tests.

Last edited 3 months ago by soulseekah (previous) (diff)

#8 @soulseekah
3 months ago

Your example is not valid HTML. wp_html_split assumes valid HTML, one sanitized via wp_kses for example (see docblock for do_shortcodes_in_html_tags):

* Assumes $content processed by KSES already.  Users with unfiltered_html
* capability may get unexpected output if angle braces are nested in tags.

You can leave the "tag" open, if you use HTML entities, which is the proper way to do it, I think.

#10 @soulseekah
3 months ago

Related valid cases #43457, #43456

Note: See TracTickets for help on using tickets.