WordPress.org

Make WordPress Core

Opened 2 years ago

Closed 2 years ago

#19855 closed defect (bug) (fixed)

Too much text in <pre> tag causes no output

Reported by: helenyhou Owned by:
Milestone: 3.4 Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

Posting this on behalf of Sirisian in IRC who sent the question to wp-hackers on my suggestion and didn't get anywhere.

I was writing an article on my blog with a huge code dump at the end inside of a pre tag. I noticed that this breaks wordpress. This is 100% reproducible by pasting http://pastebin.com/6m3RVGbr into a post then hitting preview (You can use blank pre tags. I just had stuff in them for a syntax highlighter plugin. It doesn't generate a page even with all plugins disabled anyway).

I believe it's caused because there's over 64 KB of data in the pre tags. Talking to someone in the IRC and they said:

looks like there's a preg_replace_callback called clean_pre, probably that's what's failing due to size
basically, probably wpautop()'s fault

Attachments (4)

19855.patch (1.7 KB) - added by kurtpayne 2 years ago.
Replace pre blocks with tokens at the start of wpautop, then reverse the process at the end
19855-2.patch (4.6 KB) - added by azaozz 2 years ago.
19855-3.patch (4.6 KB) - added by kurtpayne 2 years ago.
Adding newlines before pre tags
19855-4.patch (4.5 KB) - added by azaozz 2 years ago.

Download all attachments as: .zip

Change History (24)

comment:1 solarissmoke2 years ago

Sounds similar to #15600

comment:2 AdamBackstrom2 years ago

  • Cc adam@… added

This is affected by the pcre.backtrack_limit in php.ini. The default in PHP < 5.3.7 was 100000, and in PHP 5.3.7 they increased limit 10x, which is still 10x lower than PCRE internal defaults.

You observe the limit with this bit of code:

function clean_pre($matches) {
	return $matches[1] . $matches[2] . "</pre>";
}

$pee = '<pre>' . str_repeat( 'a', 99995 ) . '</pre>';

ini_set( 'pcre.backtrack_limit', 100000 ); // php < 5.3.7 default
$pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee );
var_dump( null == $pee );

ini_set( 'pcre.backtrack_limit', 99999 );
$pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee );
var_dump( null == $pee );

See also: PHP Request #40846 pcre.backtrack_limit too restrictive. Both #8553 and #15600 reference the backtrack limit in relation to shortcodes.

comment:3 follow-ups: azaozz2 years ago

Generally wpautop should be leaving the content of the <pre> tags untouched. How about we replace any <pre> ... </pre> with placeholders at the beginning and put them back at the end? That would not only remove this preg_replace_callback() (and the backtracking) but also speed up wpautop as it wouldn't backtrack through these parts of the post_content.

Coding this wouldn't be hard: preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE), then go through the array and replace strings starting with <pre with incremental placeholder, keeping them for later in an array('placeholder_#' => 'string', ...).

kurtpayne2 years ago

Replace pre blocks with tokens at the start of wpautop, then reverse the process at the end

comment:4 in reply to: ↑ 3 kurtpayne2 years ago

  • Cc kpayne@… added
  • Keywords has-patch added

Replying to azaozz:

Coding this wouldn't be hard: preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE), then go through the array and replace strings starting with <pre with incremental placeholder, keeping them for later in an array('placeholder_#' => 'string', ...).

I used preg_match_all instead of preg_split. I didn't see a performance a significant performance difference either way. Let me know if you see any problems with the patch. The sample content did not work before the patch, and worked fine with the patch.

Last edited 2 years ago by kurtpayne (previous) (diff)

comment:5 SergeyBiryukov2 years ago

  • Keywords needs-unit-tests added

comment:6 kurtpayne2 years ago

  • Keywords needs-unit-tests removed

Unit test changesets 558 and 559 cover this test case.

Last edited 2 years ago by kurtpayne (previous) (diff)

azaozz2 years ago

comment:7 azaozz2 years ago

Added alternate patch. Thinking we should replace the non-greedy match all bit in the regex which will remove the backtracking, i.e. we don't want the

.*?

and are better off with

.*

To make that possible can split post_content in chunks each containing only one <pre> tag, then replace \n with a placeholder using greedy regex and put post_content back together.

Did some performance testing and in all cases the patched wpautop() is quite faster. That's not a big improvement as relatively very small number of posts have <pre> tags in the content, but still speeds things up a little.

Last edited 2 years ago by azaozz (previous) (diff)

comment:8 in reply to: ↑ 3 ; follow-up: kurtpayne2 years ago

azaozz, it looks like 19855-2.patch still runs the contents of <pre></pre> blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:

Input:

<p><pre>o = "<div>test</div>";</pre></p>

Output (after wpautop() with 19855-2.patch):

<pre>o = "
<div>test</div>
<p>";</pre>

This contradicts your earlier comment here:

azaozz:

Generally wpautop should be leaving the content of the <pre> tags untouched.

Is this okay?

comment:9 in reply to: ↑ 8 ; follow-up: azaozz2 years ago

Replying to kurtpayne:

azaozz, it looks like 19855-2.patch still runs the contents of <pre></pre> blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:

Input:

<p><pre>o = "<div>test</div>";</pre></p>

Keep in mind that inside <pre> tags the htmlspecialchars are encoded. So that input should be:

<pre>o = "&lt;div&gt;test&lt;/div&gt;";</pre>

Then it won't be affected by the other regex.

Yes, 19855-2.patch still runs the content of <pre> tags through the rest of wpautop() similar to how it works now. The (eventual) problem with 19855.patch is that it still uses the non-greedy match-all (.*?) that causes the problem in this ticket. Thinking if we are going to modernize/refactor the exclusion of <pre> tags from wpautop, we could avoid that.

Tried the approach from wptexturize() with preg_split('|(</?pre( [^>]*)?>)|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE);. This seems to work well but the code is more complex. Can add a patch later.

Last edited 2 years ago by azaozz (previous) (diff)

kurtpayne2 years ago

Adding newlines before pre tags

comment:10 in reply to: ↑ 9 ; follow-ups: kurtpayne2 years ago

Replying to azaozz:

Keep in mind that inside <pre> tags the htmlspecialchars are encoded

Thanks, I've updated the unit test appropriately. 562

19855-2.patch still has a slight bug:

Input

Here's some cool code <pre>my cool code</pre> Wasn't that cool?

Output:

<p>Here's some cool code
<pre>my cool code</pre>
<p> Wasn't that cool?</p>

The top line is missing a closing </p> tag.

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

comment:11 in reply to: ↑ 10 kurtpayne2 years ago

Replying to kurtpayne:

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

On second thought ... is there a reason we couldn't pass $pee through force_balance_tags() before returning it? Seems less hacky.

comment:12 in reply to: ↑ 10 ; follow-up: azaozz2 years ago

Replying to kurtpayne:

...
The top line is missing a closing </p> tag.

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too. There are more edge cases with other block tags similar to this one, perhaps they can be fixed all together.

On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag. That's the whole point of the HTML editor: it lets you use HTML code but still has an easy to read, "natural" layout.

Replying to kurtpayne:

On second thought ... is there a reason we couldn't pass $pee through force_balance_tags()...

Not a good idea imho, force_balance_tags() is (mostly) for the comments where only limited number of tags are allowed and can't handle more complex HTML.

Last edited 2 years ago by azaozz (previous) (diff)

comment:13 in reply to: ↑ 12 kurtpayne2 years ago

Replying to azaozz:

Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too.

On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag.

Updated unit test to more closely match real world usage 575.

azaozz2 years ago

comment:14 follow-up: azaozz2 years ago

The 19855-4.patch combines "the best" of the previous three patches. As the <pre> tags cannot be nested, there's no need to use regex at all. Splitting on the closing tag and then on the opening tag in each substring only needs strpos() and substr(). Also uses a dummy <pre...></pre> as placeholder so the rest of wpautop() can handle it correctly.

The only thing left seems to be to try and find some old post_content that includes <pre> and was created with WP 2.0 - 2.3. Think then there were some inconsistencies inside the <pre> tags.

Last edited 2 years ago by azaozz (previous) (diff)

comment:15 in reply to: ↑ 14 ; follow-ups: kurtpayne2 years ago

Replying to azaozz:

there's no need to use regex at all.

Playing devil's advocate here ... if an author is using the HTML editor and uses upper case tags, this patch won't work.

Suggestions:

  1. Nothing. If people are using upper case tags in the HTML editor, then it's user error
  2. Force pre tags to lower case before splitting. This can be done without a regex, but it won't work work for mixed case
    $pee = str_replace(array('<pre', '<PRE'), '<pre', $pee);
    $pee = str_replace(array('</pre>', '</PRE>'), '</pre>', $pee);
    
  3. Revert to a regex split

comment:16 in reply to: ↑ 15 azaozz2 years ago

Replying to kurtpayne:

... if an author is using the HTML editor and uses upper case tags, this patch won't work.

Right. However the editor is (still) set to output XHTML 1.0 compliant code so upper case tags or attributes are not valid. Expecting users that would type code in the HTML editor (and manually add the htmlspecialchars) to be aware of that :)

On the other hand nearly all of the autop related functions/filters allow for upper case, so we should probably normalize the <pre> tags to lower case (using regex for the splits is a bit slower when there are several <pre> tags).

Not sure how this is going to be handled in the future, HTML 5.0 allows upper case and also some of the old (HTML 3?) syntax like no closing tags for <p>, no self-closing, etc. Probably we will stick to all lower case tag names and attributes as coding standard, but follow the rest of the HTML 5 changes.

comment:17 azaozz2 years ago

In [20307]:

Do not process <pre> tags with wpautop, replace them with placeholders, process the rest of the content and then put them back. Part props kurtpayne, see #19855

comment:18 in reply to: ↑ 15 azaozz2 years ago

Replying to kurtpayne:

On second thought: we were already checking only for the lower case <pre in if (strpos($pee, '<pre') !== false) so there's no change in behavior.

comment:19 nacin2 years ago

  • Milestone changed from Awaiting Review to 3.4

comment:20 markjaquith2 years ago

  • Resolution set to fixed
  • Status changed from new to closed

Closing as fixed, per azaozz and kurtpayne.

Note: See TracTickets for help on using tickets.