Make WordPress Core

Opened 12 years ago

Closed 12 years ago

Last modified 4 years ago

#19855 closed defect (bug) (fixed)

Too much text in <pre> tag causes no output

Reported by: helenyhou's profile helenyhou Owned by:
Milestone: 3.4 Priority: normal
Severity: normal Version: 1.2
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

Posting this on behalf of Sirisian in IRC who sent the question to wp-hackers on my suggestion and didn't get anywhere.

I was writing an article on my blog with a huge code dump at the end inside of a pre tag. I noticed that this breaks wordpress. This is 100% reproducible by pasting http://pastebin.com/6m3RVGbr into a post then hitting preview (You can use blank pre tags. I just had stuff in them for a syntax highlighter plugin. It doesn't generate a page even with all plugins disabled anyway).

I believe it's caused because there's over 64 KB of data in the pre tags. Talking to someone in the IRC and they said:

looks like there's a preg_replace_callback called clean_pre, probably that's what's failing due to size
basically, probably wpautop()'s fault

Attachments (4)

19855.patch (1.7 KB) - added by kurtpayne 12 years ago.
Replace pre blocks with tokens at the start of wpautop, then reverse the process at the end
19855-2.patch (4.6 KB) - added by azaozz 12 years ago.
19855-3.patch (4.6 KB) - added by kurtpayne 12 years ago.
Adding newlines before pre tags
19855-4.patch (4.5 KB) - added by azaozz 12 years ago.

Download all attachments as: .zip

Change History (25)

#1 @solarissmoke
12 years ago

Sounds similar to #15600

#2 @AdamBackstrom
12 years ago

  • Cc adam@… added

This is affected by the pcre.backtrack_limit in php.ini. The default in PHP < 5.3.7 was 100000, and in PHP 5.3.7 they increased limit 10x, which is still 10x lower than PCRE internal defaults.

You observe the limit with this bit of code:

function clean_pre($matches) {
	return $matches[1] . $matches[2] . "</pre>";
}

$pee = '<pre>' . str_repeat( 'a', 99995 ) . '</pre>';

ini_set( 'pcre.backtrack_limit', 100000 ); // php < 5.3.7 default
$pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee );
var_dump( null == $pee );

ini_set( 'pcre.backtrack_limit', 99999 );
$pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee );
var_dump( null == $pee );

See also: PHP Request #40846 pcre.backtrack_limit too restrictive. Both #8553 and #15600 reference the backtrack limit in relation to shortcodes.

#3 follow-ups: @azaozz
12 years ago

Generally wpautop should be leaving the content of the <pre> tags untouched. How about we replace any <pre> ... </pre> with placeholders at the beginning and put them back at the end? That would not only remove this preg_replace_callback() (and the backtracking) but also speed up wpautop as it wouldn't backtrack through these parts of the post_content.

Coding this wouldn't be hard: preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE), then go through the array and replace strings starting with <pre with incremental placeholder, keeping them for later in an array('placeholder_#' => 'string', ...).

@kurtpayne
12 years ago

Replace pre blocks with tokens at the start of wpautop, then reverse the process at the end

#4 in reply to: ↑ 3 @kurtpayne
12 years ago

  • Cc kpayne@… added
  • Keywords has-patch added

Replying to azaozz:

Coding this wouldn't be hard: preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE), then go through the array and replace strings starting with <pre with incremental placeholder, keeping them for later in an array('placeholder_#' => 'string', ...).

I used preg_match_all instead of preg_split. I didn't see a performance a significant performance difference either way. Let me know if you see any problems with the patch. The sample content did not work before the patch, and worked fine with the patch.

Last edited 12 years ago by kurtpayne (previous) (diff)

#5 @SergeyBiryukov
12 years ago

  • Keywords needs-unit-tests added

#6 @kurtpayne
12 years ago

  • Keywords needs-unit-tests removed

Unit test changesets 558 and 559 cover this test case.

Last edited 12 years ago by kurtpayne (previous) (diff)

@azaozz
12 years ago

#7 @azaozz
12 years ago

Added alternate patch. Thinking we should replace the non-greedy match all bit in the regex which will remove the backtracking, i.e. we don't want the

.*?

and are better off with

.*

To make that possible can split post_content in chunks each containing only one <pre> tag, then replace \n with a placeholder using greedy regex and put post_content back together.

Did some performance testing and in all cases the patched wpautop() is quite faster. That's not a big improvement as relatively very small number of posts have <pre> tags in the content, but still speeds things up a little.

Last edited 12 years ago by azaozz (previous) (diff)

#8 in reply to: ↑ 3 ; follow-up: @kurtpayne
12 years ago

azaozz, it looks like 19855-2.patch still runs the contents of <pre></pre> blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:

Input:

<p><pre>o = "<div>test</div>";</pre></p>

Output (after wpautop() with 19855-2.patch):

<pre>o = "
<div>test</div>
<p>";</pre>

This contradicts your earlier comment here:

azaozz:

Generally wpautop should be leaving the content of the <pre> tags untouched.

Is this okay?

#9 in reply to: ↑ 8 ; follow-up: @azaozz
12 years ago

Replying to kurtpayne:

azaozz, it looks like 19855-2.patch still runs the contents of <pre></pre> blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:

Input:

<p><pre>o = "<div>test</div>";</pre></p>

Keep in mind that inside <pre> tags the htmlspecialchars are encoded. So that input should be:

<pre>o = "&lt;div&gt;test&lt;/div&gt;";</pre>
Version 2, edited 12 years ago by azaozz (previous) (next) (diff)

@kurtpayne
12 years ago

Adding newlines before pre tags

#10 in reply to: ↑ 9 ; follow-ups: @kurtpayne
12 years ago

Replying to azaozz:

Keep in mind that inside <pre> tags the htmlspecialchars are encoded

Thanks, I've updated the unit test appropriately. 562

19855-2.patch still has a slight bug:

Input

Here's some cool code <pre>my cool code</pre> Wasn't that cool?

Output:

<p>Here's some cool code
<pre>my cool code</pre>
<p> Wasn't that cool?</p>

The top line is missing a closing </p> tag.

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

#11 in reply to: ↑ 10 @kurtpayne
12 years ago

Replying to kurtpayne:

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

On second thought ... is there a reason we couldn't pass $pee through force_balance_tags() before returning it? Seems less hacky.

#12 in reply to: ↑ 10 ; follow-up: @azaozz
12 years ago

Replying to kurtpayne:

...
The top line is missing a closing </p> tag.

In 19855-3.patch I added this:

$pee = str_ireplace( '<pre', "\n\n<pre", $pee );

to ensure that all <pre> tags have sufficient whitespace to be processed normally.

Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too. There are more edge cases with other block tags similar to this one, perhaps they can be fixed all together.

On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag. That's the whole point of the HTML editor: it lets you use HTML code but still has an easy to read, "natural" layout.

Replying to kurtpayne:

On second thought ... is there a reason we couldn't pass $pee through force_balance_tags()...

Not a good idea imho, force_balance_tags() is (mostly) for the comments where only limited number of tags are allowed and can't handle more complex HTML.

Last edited 12 years ago by azaozz (previous) (diff)

#13 in reply to: ↑ 12 @kurtpayne
12 years ago

Replying to azaozz:

Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too.

On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag.

Updated unit test to more closely match real world usage 575.

@azaozz
12 years ago

#14 follow-up: @azaozz
12 years ago

The 19855-4.patch combines "the best" of the previous three patches. As the <pre> tags cannot be nested, there's no need to use regex at all. Splitting on the closing tag and then on the opening tag in each substring only needs strpos() and substr(). Also uses a dummy <pre...></pre> as placeholder so the rest of wpautop() can handle it correctly.

The only thing left seems to be to try and find some old post_content that includes <pre> and was created with WP 2.0 - 2.3. Think then there were some inconsistencies inside the <pre> tags.

Last edited 12 years ago by azaozz (previous) (diff)

#15 in reply to: ↑ 14 ; follow-ups: @kurtpayne
12 years ago

Replying to azaozz:

there's no need to use regex at all.

Playing devil's advocate here ... if an author is using the HTML editor and uses upper case tags, this patch won't work.

Suggestions:

  1. Nothing. If people are using upper case tags in the HTML editor, then it's user error
  2. Force pre tags to lower case before splitting. This can be done without a regex, but it won't work work for mixed case
    $pee = str_replace(array('<pre', '<PRE'), '<pre', $pee);
    $pee = str_replace(array('</pre>', '</PRE>'), '</pre>', $pee);
    
  3. Revert to a regex split

#16 in reply to: ↑ 15 @azaozz
12 years ago

Replying to kurtpayne:

... if an author is using the HTML editor and uses upper case tags, this patch won't work.

Right. However the editor is (still) set to output XHTML 1.0 compliant code so upper case tags or attributes are not valid. Expecting users that would type code in the HTML editor (and manually add the htmlspecialchars) to be aware of that :)

On the other hand nearly all of the autop related functions/filters allow for upper case, so we should probably normalize the <pre> tags to lower case (using regex for the splits is a bit slower when there are several <pre> tags).

Not sure how this is going to be handled in the future, HTML 5.0 allows upper case and also some of the old (HTML 3?) syntax like no closing tags for <p>, no self-closing, etc. Probably we will stick to all lower case tag names and attributes as coding standard, but follow the rest of the HTML 5 changes.

#17 @azaozz
12 years ago

In [20307]:

Do not process <pre> tags with wpautop, replace them with placeholders, process the rest of the content and then put them back. Part props kurtpayne, see #19855

#18 in reply to: ↑ 15 @azaozz
12 years ago

Replying to kurtpayne:

On second thought: we were already checking only for the lower case <pre in if (strpos($pee, '<pre') !== false) so there's no change in behavior.

#19 @nacin
12 years ago

  • Milestone changed from Awaiting Review to 3.4

#20 @markjaquith
12 years ago

  • Resolution set to fixed
  • Status changed from new to closed

Closing as fixed, per azaozz and kurtpayne.

This ticket was mentioned in Slack in #themereview by ilovewpcom. View the logs.


4 years ago

Note: See TracTickets for help on using tickets.