#19855 closed defect (bug) (fixed)
Too much text in <pre> tag causes no output
Reported by: | helenyhou | Owned by: | |
---|---|---|---|
Milestone: | 3.4 | Priority: | normal |
Severity: | normal | Version: | 1.2 |
Component: | Formatting | Keywords: | has-patch |
Focuses: | Cc: |
Description
Posting this on behalf of Sirisian in IRC who sent the question to wp-hackers on my suggestion and didn't get anywhere.
I was writing an article on my blog with a huge code dump at the end inside of a pre tag. I noticed that this breaks wordpress. This is 100% reproducible by pasting http://pastebin.com/6m3RVGbr into a post then hitting preview (You can use blank pre tags. I just had stuff in them for a syntax highlighter plugin. It doesn't generate a page even with all plugins disabled anyway).
I believe it's caused because there's over 64 KB of data in the pre tags. Talking to someone in the IRC and they said:
looks like there's a preg_replace_callback called clean_pre, probably that's what's failing due to size
basically, probably wpautop()'s fault
Attachments (4)
Change History (25)
#2
@
13 years ago
- Cc adam@… added
This is affected by the pcre.backtrack_limit
in php.ini. The default in PHP < 5.3.7 was 100000, and in PHP 5.3.7 they increased limit 10x, which is still 10x lower than PCRE internal defaults.
You observe the limit with this bit of code:
function clean_pre($matches) { return $matches[1] . $matches[2] . "</pre>"; } $pee = '<pre>' . str_repeat( 'a', 99995 ) . '</pre>'; ini_set( 'pcre.backtrack_limit', 100000 ); // php < 5.3.7 default $pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee ); var_dump( null == $pee ); ini_set( 'pcre.backtrack_limit', 99999 ); $pee = preg_replace_callback('!(<pre[^>]*>)(.*?)</pre>!is', 'clean_pre', $pee ); var_dump( null == $pee );
See also: PHP Request #40846 pcre.backtrack_limit too restrictive. Both #8553 and #15600 reference the backtrack limit in relation to shortcodes.
#3
follow-ups:
↓ 4
↓ 8
@
13 years ago
Generally wpautop should be leaving the content of the <pre>
tags untouched. How about we replace any <pre> ... </pre>
with placeholders at the beginning and put them back at the end? That would not only remove this preg_replace_callback() (and the backtracking) but also speed up wpautop as it wouldn't backtrack through these parts of the post_content.
Coding this wouldn't be hard: preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE)
, then go through the array and replace strings starting with <pre
with incremental placeholder, keeping them for later in an array('placeholder_#' => 'string', ...)
.
@
13 years ago
Replace pre blocks with tokens at the start of wpautop, then reverse the process at the end
#4
in reply to:
↑ 3
@
13 years ago
- Cc kpayne@… added
- Keywords has-patch added
Replying to azaozz:
Coding this wouldn't be hard:
preg_split('|</?pre[ >]|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE)
, then go through the array and replace strings starting with<pre
with incremental placeholder, keeping them for later in anarray('placeholder_#' => 'string', ...)
.
I used preg_match instead of preg_split. I didn't see a performance a significant performance difference either way. Let me know if you see any problems with the patch. The sample content did not work before the patch, and worked fine with the patch.
#7
@
13 years ago
Added alternate patch. Thinking we should replace the non-greedy match all bit in the regex which will remove the backtracking, i.e. we don't want the
.*?
and are better off with
.*
To make that possible can split post_content in chunks each containing only one <pre> tag, then replace \n
with a placeholder using greedy regex and put post_content back together.
Did some performance testing and in all cases the patched wpautop() is quite faster. That's not a big improvement as relatively very small number of posts have <pre> tags in the content, but still speeds things up a little.
#8
in reply to:
↑ 3
;
follow-up:
↓ 9
@
13 years ago
azaozz, it looks like 19855-2.patch still runs the contents of <pre></pre>
blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:
Input:
<p><pre>o = "<div>test</div>";</pre></p>
Output (after wpautop()
with 19855-2.patch):
<pre>o = " <div>test</div> <p>";</pre>
This contradicts your earlier comment here:
Generally wpautop should be leaving the content of the
<pre>
tags untouched.
Is this okay?
#9
in reply to:
↑ 8
;
follow-up:
↓ 10
@
13 years ago
Replying to kurtpayne:
azaozz, it looks like 19855-2.patch still runs the contents of
<pre></pre>
blocks through the string replacements. It doesn't time out as described in the original issue, but the text still has newlines inserted. For example:
Input:
<p><pre>o = "<div>test</div>";</pre></p>
Keep in mind that inside <pre> tags the htmlspecialchars are encoded. So that input should be:
<pre>o = "<div>test</div>";</pre>
Then it won't be affected by the other regex.
Yes, 19855-2.patch still runs the content of <pre> tags through the rest of wpautop() similar to how it works now. The (eventual) problem with 19855.patch is that it still uses the non-greedy match-all (.*?)
that causes the problem in this ticket. Thinking if we are going to modernize/refactor the exclusion of <pre> tags from wpautop, we could avoid that.
Tried the approach from wptexturize() with preg_split('|(</?pre( [^>]*)?>)|', $pee, -1, PREG_SPLIT_DELIM_CAPTURE);
. This seems to work well but the code is more complex. Can add a patch later.
#10
in reply to:
↑ 9
;
follow-ups:
↓ 11
↓ 12
@
13 years ago
Replying to azaozz:
Keep in mind that inside <pre> tags the htmlspecialchars are encoded
Thanks, I've updated the unit test appropriately. 562
19855-2.patch still has a slight bug:
Input
Here's some cool code <pre>my cool code</pre> Wasn't that cool?
Output:
<p>Here's some cool code <pre>my cool code</pre> <p> Wasn't that cool?</p>
The top line is missing a closing </p>
tag.
In 19855-3.patch I added this:
$pee = str_ireplace( '<pre', "\n\n<pre", $pee );
to ensure that all <pre>
tags have sufficient whitespace to be processed normally.
#11
in reply to:
↑ 10
@
13 years ago
Replying to kurtpayne:
In 19855-3.patch I added this:
$pee = str_ireplace( '<pre', "\n\n<pre", $pee );to ensure that all
<pre>
tags have sufficient whitespace to be processed normally.
On second thought ... is there a reason we couldn't pass $pee
through force_balance_tags()
before returning it? Seems less hacky.
#12
in reply to:
↑ 10
;
follow-up:
↓ 13
@
13 years ago
Replying to kurtpayne:
...
The top line is missing a closing</p>
tag.
In 19855-3.patch I added this:
$pee = str_ireplace( '<pre', "\n\n<pre", $pee );to ensure that all
<pre>
tags have sufficient whitespace to be processed normally.
Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too. There are more edge cases with other block tags similar to this one, perhaps they can be fixed all together.
On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag. That's the whole point of the HTML editor: it lets you use HTML code but still has an easy to read, "natural" layout.
Replying to kurtpayne:
On second thought ... is there a reason we couldn't pass
$pee
throughforce_balance_tags()
...
Not a good idea imho, force_balance_tags()
is (mostly) for the comments where only limited number of tags are allowed and can't handle more complex HTML.
#13
in reply to:
↑ 12
@
13 years ago
Replying to azaozz:
Yes, <pre> is a block tag so when typing in the HTML editor it has to have a line break before opening and after closing. That's how it comes out of the Visual editor too.
On the other hand when using the HTML editor it assumes the user is aware that block tags have to be treated as blocks of text, i.e. "\n\n" before the opening tag and after the closing tag.
Updated unit test to more closely match real world usage 575.
#14
follow-up:
↓ 15
@
12 years ago
The 19855-4.patch combines "the best" of the previous three patches. As the <pre> tags cannot be nested, there's no need to use regex at all. Splitting on the closing tag and then on the opening tag in each substring only needs strpos() and substr(). Also uses a dummy <pre...></pre>
as placeholder so the rest of wpautop() can handle it correctly.
The only thing left seems to be to try and find some old post_content that includes <pre> and was created with WP 2.0 - 2.3. Think then there were some inconsistencies inside the <pre> tags.
#15
in reply to:
↑ 14
;
follow-ups:
↓ 16
↓ 18
@
12 years ago
Replying to azaozz:
there's no need to use regex at all.
Playing devil's advocate here ... if an author is using the HTML editor and uses upper case tags, this patch won't work.
Suggestions:
- Nothing. If people are using upper case tags in the HTML editor, then it's user error
- Force pre tags to lower case before splitting. This can be done without a regex, but it won't work work for mixed case
$pee = str_replace(array('<pre', '<PRE'), '<pre', $pee); $pee = str_replace(array('</pre>', '</PRE>'), '</pre>', $pee);
- Revert to a regex split
#16
in reply to:
↑ 15
@
12 years ago
Replying to kurtpayne:
... if an author is using the HTML editor and uses upper case tags, this patch won't work.
Right. However the editor is (still) set to output XHTML 1.0 compliant code so upper case tags or attributes are not valid. Expecting users that would type code in the HTML editor (and manually add the htmlspecialchars) to be aware of that :)
On the other hand nearly all of the autop related functions/filters allow for upper case, so we should probably normalize the <pre> tags to lower case (using regex for the splits is a bit slower when there are several <pre> tags).
Not sure how this is going to be handled in the future, HTML 5.0 allows upper case and also some of the old (HTML 3?) syntax like no closing tags for <p>, no self-closing, etc. Probably we will stick to all lower case tag names and attributes as coding standard, but follow the rest of the HTML 5 changes.
#18
in reply to:
↑ 15
@
12 years ago
Replying to kurtpayne:
On second thought: we were already checking only for the lower case <pre
in if (strpos($pee, '<pre') !== false)
so there's no change in behavior.
#20
@
12 years ago
- Resolution set to fixed
- Status changed from new to closed
Closing as fixed, per azaozz and kurtpayne.
Sounds similar to #15600