Make WordPress Core

Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#29557 closed defect (bug) (fixed)

PHP ≤ 5.4.8 Crashes on '[' Character in Posts

Reported by: mrbobdobolina's profile MrBobDobolina Owned by: nacin's profile nacin
Milestone: 4.0.1 Priority: highest omg bbq
Severity: blocker Version: 4.0
Component: Formatting Keywords: wptexturize has-patch commit fixed-major
Focuses: Cc:

Description

After updating to WP 4, one of my longer blog posts is failing to work. But it's only the one post, everything else works fine.

The post page doesn’t render anything, it doesn’t serve anything, all you get in inspector is a blank HTML document.

The post can still be edited in the admin interface, but neither previewing or posting the article allows it to be seen.

A Wordpress 3.9.2 installation with the Twenty Twelve theme works as intended. There are no errors in the Apache or MySQL logs.

The post is 1030 lines long with paragraphs and css declarations and images on almost every line. (I have included it for testing purposes.) If I shorten it to ~793 lines it works as intended. (I have tried shortening it from both ends, with the same result. It does not appear to be related to the specific post content.)

Attachments (24)

ChatTranscript.txt (89.0 KB) - added by MrBobDobolina 10 years ago.
this is the text and html of the post that doesn't work
miqro-29557.patch (5.1 KB) - added by miqrogroove 10 years ago.
miqro-29557.2.patch (9.5 KB) - added by miqrogroove 10 years ago.
miqro-29557.3.patch (9.5 KB) - added by miqrogroove 10 years ago.
miqro-29557.4.patch (9.5 KB) - added by miqrogroove 10 years ago.
Content1.txt (11.5 KB) - added by ArsenyKovalchuk 10 years ago.
textarr_after_pregsplit.txt (15.7 KB) - added by ArsenyKovalchuk 10 years ago.
shortcode_tags_keys.txt (2.8 KB) - added by ArsenyKovalchuk 10 years ago.
miqro-29557.5.patch (9.6 KB) - added by miqrogroove 10 years ago.
Attempt to fix #29608 and further simplify regex to avoid crashes.
miqro-29557.6.patch (4.6 KB) - added by miqrogroove 10 years ago.
miqro-29557.7.patch (5.4 KB) - added by miqrogroove 10 years ago.
Don't match when whitespace appears before shortcode name.
miqro-29557.8.patch (7.3 KB) - added by miqrogroove 10 years ago.
Reduce backtracking in HTML comments.
miqro-29557.9.patch (7.5 KB) - added by miqrogroove 10 years ago.
Per kitchin, no spaces in -->
miqro-29557.10.patch (7.9 KB) - added by miqrogroove 10 years ago.
Per kitchin, allow <!---> and add tests.
beautify-29557.diff (2.6 KB) - added by miqrogroove 10 years ago.
beautify-29557.2.diff (2.7 KB) - added by miqrogroove 10 years ago.
poc_pos_29557.diff (9.7 KB) - added by kitchin 10 years ago.
POC, insufficient, do not use. Less regex, more strpos.
29557-idea-1.diff (2.8 KB) - added by miqrogroove 10 years ago.
Idea #1: Go back to excluding HTML from shortcodes, just don't crash PHP this time.
29557-idea-2.diff (3.0 KB) - added by miqrogroove 10 years ago.
Idea 2: Break Jetpack. Go back to filtering shortcode names.
29557-idea-1.2.diff (2.8 KB) - added by miqrogroove 10 years ago.
Unit test typo.
29557-idea-3.diff (15.9 KB) - added by miqrogroove 10 years ago.
Texturize hat trick: Full shortcode regexp, HTML parsed separately, no regexp inner loops.
29557-idea-3.2.diff (16.6 KB) - added by miqrogroove 10 years ago.
First strpos was buggy, added extra if block.
29557-idea-3.3.diff (17.3 KB) - added by miqrogroove 10 years ago.
Parser function refactored for performance.
29557-unittests.diff (887 bytes) - added by MikeHansenMe 10 years ago.
see #30284

Download all attachments as: .zip

Change History (155)

@MrBobDobolina
10 years ago

this is the text and html of the post that doesn't work

#1 @miqrogroove
10 years ago

  • Component changed from Posts, Post Types to Formatting
  • Keywords wptexturize added
  • Severity changed from normal to major

Neato. php-cgi.exe crashes on my Windows server when I try to run the new HTML regexp for wptexturize() against the attachment.

#2 @miqrogroove
10 years ago

  • Summary changed from No Data Received - long post rendering issue to PHP Crash on Large Block of HTML

#3 @Denis-de-Bernardy
10 years ago

  • Summary changed from PHP Crash on Large Block of HTML to No Data Received - long post rendering issue

Try adding this to your wp-config file:

        if (ini_get('pcre.backtrack_limit') < 1000000) {
            ini_set('pcre.backtrack_limit', 1000000);
        }

#4 follow-up: @miqrogroove
10 years ago

  • Summary changed from No Data Received - long post rendering issue to PHP Crash on Large Block of HTML

MrBobDobolina, can you confirm you are running version 4.0 and not one of the beta versions? I was able to reproduce the crash with a slightly older version that we fixed at [28852]. I'm not sure if this is the same problem.

#5 in reply to: ↑ 4 @MrBobDobolina
10 years ago

Replying to miqrogroove:

MrBobDobolina, can you confirm you are running version 4.0 and not one of the beta versions? I was able to reproduce the crash with a slightly older version that we fixed at [28852]. I'm not sure if this is the same problem.

Can confirm that I am running 4.0. Not a beta.

#6 follow-up: @SergeyBiryukov
10 years ago

Confirmed on 4.0. Fails on this line in wptexturize().

Increasing pcre.backtrack_limit to 1000000 didn't help.

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

#7 in reply to: ↑ 6 @miqrogroove
10 years ago

Replying to SergeyBiryukov:

Confirmed on 4.0. Fails on this line in wptexturize().

Increasing pcre.backtrack_limit to 1000000 didn't help.

Increasing will actually make this worse. I can get a crash on the beta version when the limit is greater than 50000. We thought this was fixed, and I'm getting no errors in the 4.0 version so far. Still looking.

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

#8 @miqrogroove
10 years ago

Getting a backtrack limit on the order of 16000 for the attached file, and recursion limit less than 10.

Sergey, can you confirm you do not get a crash with the backtrack limit set to 10000? This would cause the preg_split() statement to fail silently.

#9 @miqrogroove
10 years ago

Half of the backtracking can be eliminated by changing '[^\[\]<>]' to '[^\[\]<>]++'. The remainder is being caused by the nested alternation following that part. Still looking.

#10 follow-up: @SergeyBiryukov
10 years ago

On PHP 5.2.17 (PCRE 8.02 2010-03-19), I get a crash with any backtrack limit value larger than 709.

On PHP 5.4.29 (PCRE 8.32 2012-11-30), I don't get a crash even with a much larger value, 1,000,000,000.

#11 @miqrogroove
10 years ago

MrBobDobolina may have found the Achilles heel of this regexp. There is only one square brace in the entire post. Remove that one character and try to reproduce the crash again.

#12 @miqrogroove
10 years ago

We may have a poor solution for #28564. What is happening internally is PCRE is looking at every possible combination of HTML and shortcode nesting before checking if there is actually a closing brace for the shortcode. This is a wonderful argument for disallowing HTML nesting within the shortcode feature.

#13 in reply to: ↑ 10 @Denis-de-Bernardy
10 years ago

Replying to SergeyBiryukov:

On PHP 5.2.17 (PCRE 8.02 2010-03-19), I get a crash with any backtrack limit value larger than 709.

On PHP 5.4.29 (PCRE 8.32 2012-11-30), I don't get a crash even with a much larger value, 1,000,000,000.

My hunch is this is the underlying issue. PHP 5.3.7 fixes a few regex related things, and -- best I'm aware -- #8553 et al would not occur when running WP on top of that PHP version or later.

#14 @karpstrucking
10 years ago

crashing on wp4.0 and php5.3.3 until/unless sole [ is removed

#15 @miqrogroove
10 years ago

I'll get started on a patch to import the list of registered shortcodes into wptexturize(). This may be the only way to eliminate backtracking without changing the behavior of the regexp.

#16 @miqrogroove
10 years ago

In miqro-29557.patch:

  • Expand the wptexturize regexp to include the list of registered shortcodes.
  • This will cause escaped shortcodes to become texturized if not registered.
  • Registered shortcodes will not be texturized, even when escaped.
  • Remove all tests involving unregistered shortcodes.
  • Benchmark result: No measurable impact on performance.

For more consistency, we could remove the check on escaped shortcodes altogether and just always texturize them. They are escaped, so they have no special meaning, and they can always be wrapped in <code> blocks just like any other text. I only hesitate to do this because it would represent an additional change in behavior.

#17 @SergeyBiryukov
10 years ago

  • Milestone changed from Awaiting Review to 4.1

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#19 @SergeyBiryukov
10 years ago

  • Milestone changed from 4.1 to 4.0.1

Moving to 4.0.1 for review.

#20 @miqrogroove
10 years ago

There's more work to do on those unit tests to make sure the patch is legit. I'll get a second draft up here tomorrow.

#21 @miqrogroove
10 years ago

miqro-29557.3.patch attached. Made sure it was working as described above. With the shortcode name matching, it is also necessary to allow for the forward slash before the name, and then the shortcode contents after the name become optional.

Last edited 10 years ago by miqrogroove (previous) (diff)

#22 @miqrogroove
10 years ago

  • Keywords has-patch added

Typos and copypasta all fixed now.

#23 @miqrogroove
10 years ago

MrBobDobolina, the attached file miqro-29557.4.patch should fix the problem if you want to hack the formatting.php file on your website. Otherwise, the workaround appears to be removing that one [ character from the blog post until the fixed version is released. Let us know if that helps.

#24 @MrBobDobolina
10 years ago

Thanks miqrogroove

I took out the loose parenthetical the other day when I saw it might cause the problem. It seems to be working now.

You guys rocks!

#25 follow-up: @ArsenyKovalchuk
10 years ago

Hi guys. I've faced with the same issue on the line

$textarr = preg_split( $regex, $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

I attached a file Content1.txt. This is a dump of the $text variable just before preg_split. OS Windows 7 Pro, PHP version is 5.3.13, PCRE Library Version 8.12 2011-01-15, pcre.backtrack_limit = 1000000.

I applied the patch miqro-29557.4.patch, it helped to pass preg_split, but it crashes somewhere after. I'll try to locate the line. Could you please take a look as well. Thanks.

Last edited 10 years ago by ArsenyKovalchuk (previous) (diff)

#27 in reply to: ↑ 25 ; follow-up: @ArsenyKovalchuk
10 years ago

Replying to ArsenyKovalchuk:

Hi guys. I've faced with the same issue on the line

$textarr = preg_split( $regex, $text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

I attached a file Content1.txt. This is a dump of the $text variable just before preg_split. OS Windows 7 Pro, PHP version is 5.3.13, PCRE Library Version 8.12 2011-01-15, pcre.backtrack_limit = 1000000.

I applied the patch miqro-29557.4.patch, it helped to pass preg_split, but it crashes somewhere after. I'll try to locate the line. Could you please take a look as well. Thanks.

Now it crashes in the loop on the first line of the text (line 250 of the pathced code)

} elseif ( '[' === $first && 1 === preg_match( '/^\[\[?\/?' . $tagregexp . '(?:[^\[\]<>]|<[^>]+>)*+\]\]?$/', $curl ) ) { 

I get following regexp code which crashes

preg_match('/^\[\[?\/?(?:embed|wp_caption|caption|gallery|playlist|audio|video|layerslider|adrotate|acf|go_pricing|go_pricing_youtube|go_pricing_vimeo|go_pricing_screenr|go_pricing_dailymotion|go_pricing_metacafe|go_pricing_html5_video|go_pricing_soundcloud|go_pricing_mixcloud|go_pricing_beatport|go_pricing_audio|go_pricing_map|go_pricing_custom_iframe|free_adv_form|ahm\-pricing\-table|rev_slider|page_polls|poll|ratings|p2p_connected|p2p_related|contact\-form\-7|contact\-form|vc_container_anchor|dt_cell|dt_box|dt_gap|dt_divider|dt_stripe|dt_fancy_image|dt_list_item|dt_list|dt_button|dt_tooltip|dt_highlight|dt_code|dt_code_final|dt_tab|dt_tabs|dt_item|dt_accordion|dt_toggle|dt_quote|dt_call_to_action|dt_teaser|dt_banner|dt_benefits|dt_benefit|dt_progress_bars|dt_progress_bar|dt_contact_form|dt_social_icons|dt_social_icon|dt_map|dt_blog_posts_small|dt_blog_posts|dt_portfolio|dt_portfolio_jgrid|dt_portfolio_slider|dt_small_photos|dt_slideshow|dt_team|dt_testimonials|dt_logos|dt_text|dt_vc_list_item|dt_vc_list|dt_benefits_vc|dt_fancy_video_vc|dt_fancy_title|dt_fancy_separator|types|vc_row|vc_row_inner|vc_column|vc_column_text|vc_separator|vc_text_separator|vc_message|vc_facebook|vc_tweetmeme|vc_googleplus|vc_pinterest|vc_toggle|vc_single_image|vc_tabs|vc_tour|vc_tab|vc_accordion|vc_accordion_tab|vc_carousel|vc_posts_slider|vc_widget_sidebar|vc_button2|vc_cta_button2|vc_video|vc_gmaps|vc_raw_html|vc_raw_js|vc_flickr|vc_progress_bar|vc_pie|layerslider_vc|rev_slider_vc|vc_column_inner)(?![\w-])(?:[^\[\]<>]|<[^>]+>)*+\]\]?$/','[vc_row margin_top="35" margin_bottom="45" padding_left="40" padding_right="40" bg_position="top" bg_repeat="no-repeat" bg_cover="false" bg_attachment="false" padding_top="40" padding_bottom="40" parallax_speed="0.1"][vc_column width="3/4"][dt_benefits_vc columns="4" style="1" dividers="true" image_background="true" target_blank="false" header_size="h5" content_size="normal" number="4" orderby="date" order="desc" animation="scale" category="no-captions,no-buttons" image_background_color="#ffffff"]')

#28 in reply to: ↑ 27 ; follow-up: @miqrogroove
10 years ago

Replying to ArsenyKovalchuk:

I get following regexp code which crashes

preg_match('/^\[\[?\/?(?:embed|wp_caption|caption|gallery|playlist|audio|video|layerslider|adrotate|acf|go_pricing|go_pricing_youtube|go_pricing_vimeo|go_pricing_screenr|go_pricing_dailymotion|go_pricing_metacafe|go_pricing_html5_video|go_pricing_soundcloud|go_pricing_mixcloud|go_pricing_beatport|go_pricing_audio|go_pricing_map|go_pricing_custom_iframe|free_adv_form|ahm\-pricing\-table|rev_slider|page_polls|poll|ratings|p2p_connected|p2p_related|contact\-form\-7|contact\-form|vc_container_anchor|dt_cell|dt_box|dt_gap|dt_divider|dt_stripe|dt_fancy_image|dt_list_item|dt_list|dt_button|dt_tooltip|dt_highlight|dt_code|dt_code_final|dt_tab|dt_tabs|dt_item|dt_accordion|dt_toggle|dt_quote|dt_call_to_action|dt_teaser|dt_banner|dt_benefits|dt_benefit|dt_progress_bars|dt_progress_bar|dt_contact_form|dt_social_icons|dt_social_icon|dt_map|dt_blog_posts_small|dt_blog_posts|dt_portfolio|dt_portfolio_jgrid|dt_portfolio_slider|dt_small_photos|dt_slideshow|dt_team|dt_testimonials|dt_logos|dt_text|dt_vc_list_item|dt_vc_list|dt_benefits_vc|dt_fancy_video_vc|dt_fancy_title|dt_fancy_separator|types|vc_row|vc_row_inner|vc_column|vc_column_text|vc_separator|vc_text_separator|vc_message|vc_facebook|vc_tweetmeme|vc_googleplus|vc_pinterest|vc_toggle|vc_single_image|vc_tabs|vc_tour|vc_tab|vc_accordion|vc_accordion_tab|vc_carousel|vc_posts_slider|vc_widget_sidebar|vc_button2|vc_cta_button2|vc_video|vc_gmaps|vc_raw_html|vc_raw_js|vc_flickr|vc_progress_bar|vc_pie|layerslider_vc|rev_slider_vc|vc_column_inner)(?![\w-])(?:[^\[\]<>]|<[^>]+>)*+\]\]?$/','[vc_row margin_top="35" margin_bottom="45" padding_left="40" padding_right="40" bg_position="top" bg_repeat="no-repeat" bg_cover="false" bg_attachment="false" padding_top="40" padding_bottom="40" parallax_speed="0.1"][vc_column width="3/4"][dt_benefits_vc columns="4" style="1" dividers="true" image_background="true" target_blank="false" header_size="h5" content_size="normal" number="4" orderby="date" order="desc" animation="scale" category="no-captions,no-buttons" image_background_color="#ffffff"]')

So that one command crashes your server if you put it in a php file? I want to be specific, because the backtrack count for that command is only on the order of 400. This should be less than the pattern before the loop.

Last edited 10 years ago by miqrogroove (previous) (diff)

#29 @miqrogroove
10 years ago

It looks like my patch broke part of the shortcodes pattern. I'll look into getting you a better patch and then let's see what happens after that.

#30 in reply to: ↑ 28 @ArsenyKovalchuk
10 years ago

Replying to miqrogroove:

Replying to ArsenyKovalchuk:

I get following regexp code which crashes

preg_match('/^\[\[?\/?(?:embed|wp_caption|caption|gallery|playlist|audio|video|layerslider|adrotate|acf|go_pricing|go_pricing_youtube|go_pricing_vimeo|go_pricing_screenr|go_pricing_dailymotion|go_pricing_metacafe|go_pricing_html5_video|go_pricing_soundcloud|go_pricing_mixcloud|go_pricing_beatport|go_pricing_audio|go_pricing_map|go_pricing_custom_iframe|free_adv_form|ahm\-pricing\-table|rev_slider|page_polls|poll|ratings|p2p_connected|p2p_related|contact\-form\-7|contact\-form|vc_container_anchor|dt_cell|dt_box|dt_gap|dt_divider|dt_stripe|dt_fancy_image|dt_list_item|dt_list|dt_button|dt_tooltip|dt_highlight|dt_code|dt_code_final|dt_tab|dt_tabs|dt_item|dt_accordion|dt_toggle|dt_quote|dt_call_to_action|dt_teaser|dt_banner|dt_benefits|dt_benefit|dt_progress_bars|dt_progress_bar|dt_contact_form|dt_social_icons|dt_social_icon|dt_map|dt_blog_posts_small|dt_blog_posts|dt_portfolio|dt_portfolio_jgrid|dt_portfolio_slider|dt_small_photos|dt_slideshow|dt_team|dt_testimonials|dt_logos|dt_text|dt_vc_list_item|dt_vc_list|dt_benefits_vc|dt_fancy_video_vc|dt_fancy_title|dt_fancy_separator|types|vc_row|vc_row_inner|vc_column|vc_column_text|vc_separator|vc_text_separator|vc_message|vc_facebook|vc_tweetmeme|vc_googleplus|vc_pinterest|vc_toggle|vc_single_image|vc_tabs|vc_tour|vc_tab|vc_accordion|vc_accordion_tab|vc_carousel|vc_posts_slider|vc_widget_sidebar|vc_button2|vc_cta_button2|vc_video|vc_gmaps|vc_raw_html|vc_raw_js|vc_flickr|vc_progress_bar|vc_pie|layerslider_vc|rev_slider_vc|vc_column_inner)(?![\w-])(?:[^\[\]<>]|<[^>]+>)*+\]\]?$/','[vc_row margin_top="35" margin_bottom="45" padding_left="40" padding_right="40" bg_position="top" bg_repeat="no-repeat" bg_cover="false" bg_attachment="false" padding_top="40" padding_bottom="40" parallax_speed="0.1"][vc_column width="3/4"][dt_benefits_vc columns="4" style="1" dividers="true" image_background="true" target_blank="false" header_size="h5" content_size="normal" number="4" orderby="date" order="desc" animation="scale" category="no-captions,no-buttons" image_background_color="#ffffff"]')

So that one command crashes your server if you put it in a php file? I want to be specific, because the trackback count for that command is only on the order of 400. This should be less than the pattern before the loop.

Hi, as you may expect I got that command by logging following inside the loop

$str = '/^\[\[?\/?' . $tagregexp . '(?:[^\[\]<>]|<[^>]+>)*+\]\]?$/';
error_log("preg_match('$str','$curl')");

And, yes, the code crashes the server if I put it in a separate file.

#31 @miqrogroove
10 years ago

One thing I'm not understanding is why there are 3 different shortcodes in your value for $curl there.

#32 follow-up: @miqrogroove
10 years ago

ArsenyKovalchuk, if you could message me through IRC or gmail or my website, I think we could look at more details.

#33 in reply to: ↑ 32 ; follow-ups: @ArsenyKovalchuk
10 years ago

Replying to miqrogroove:

ArsenyKovalchuk, if you could message me through IRC or gmail or my website, I think we could look at more details.

It would be more convenient to contact via gmail, didn't find your email though. Mine is ars.kov@… just send me a message.

I attached a dump of the $testarr just after preg_split. The $text passed to preg_split is in the Content1.txt

#34 in reply to: ↑ 33 @ArsenyKovalchuk
10 years ago

Replying to ArsenyKovalchuk:

Replying to miqrogroove:

ArsenyKovalchuk, if you could message me through IRC or gmail or my website, I think we could look at more details.

It would be more convenient to contact via gmail, didn't find your email though. Mine is ars.kov@… just send me a message.

I attached a dump of the $testarr just after preg_split. The $text passed to preg_split is in the Content1.txt

Also I attached dump of $shortcode_tags keys inside the function call.

#35 in reply to: ↑ 33 ; follow-up: @miqrogroove
10 years ago

Replying to ArsenyKovalchuk:

I attached a dump of the $testarr just after preg_split. The $text passed to preg_split is in the Content1.txt

Yeah I'm not understanding how we arrived at that result. First I'd like to see the formatting.php file, which you could send on gmail or attach here. I need to see that the patch was applied correctly before we look for other bugs.

@miqrogroove
10 years ago

Attempt to fix #29608 and further simplify regex to avoid crashes.

#36 in reply to: ↑ 35 @ArsenyKovalchuk
10 years ago

Replying to miqrogroove:

Replying to ArsenyKovalchuk:

I attached a dump of the $testarr just after preg_split. The $text passed to preg_split is in the Content1.txt

Yeah I'm not understanding how we arrived at that result. First I'd like to see the formatting.php file, which you could send on gmail or attach here. I need to see that the patch was applied correctly before we look for other bugs.

miqro-29557.5.patch works for me. I've also noticed that WP renders pages much faster now. I also had another couple of pages that crashed the server, but they also render ok with the patch.

#37 @miqrogroove
10 years ago

#29450 and #29608 are duplicates at this point. I'm hoping the same patch resolves all three tickets.

#39 @miqrogroove
10 years ago

#29450 was marked as a duplicate.

#40 @miqrogroove
10 years ago

#29608 was marked as a duplicate.

#41 @DrewAPicture
10 years ago

In working on a project for the wp.org Code Reference and I just across this problem -- not being able pass things like less- and greater-than characters via shortcode attributes.

I tested miqro-29557.5.patch and it solves the issues I'm seeing with passing values like 'WP_Query->have_posts()', etc.

+1 for miqro-29557.5.patch.

#42 @miqrogroove
10 years ago

Updated summary of miqro-29557.5.patch:

  • Expand the wptexturize regexp to include the list of registered shortcodes.
  • Avoid backtracking after [ chars by not filtering params in registered shortcodes.
  • This will cause escaped shortcodes and their params to become texturized if not registered.
  • Registered shortcode params will never be texturized, even when escaped.
  • Move all tests involving unregistered shortcodes to a new and improved unit.
  • Update one test involving HTML within shortcode params.

Tests for dangling angle braces within shortcode params have not been added. This feature is not even documented AFAIK. In the Codex, HTML is only mentioned as allowed in between shortcode tags.

Decided not to remove the check on escaped shortcodes. There are too many potential corner cases involving the different regexps for texturize and shortcodes. Instead, I simply nested the if statement to help speed up shortcode processing.

#43 @nacin
10 years ago

  • Priority changed from normal to highest omg bbq
  • Severity changed from major to blocker

#44 @ocean90
10 years ago

#29649 was marked as a duplicate.

#45 follow-up: @miqrogroove
10 years ago

Discovered in follow up to #29649 that PHP versions 5.3.19+ and 5.4.9+ are not affected by this bug.

Currently looking for an old server I could abuse.

#46 @miqrogroove
10 years ago

  • Summary changed from PHP Crash on Large Block of HTML to PHP ≤ 5.4.8 Crashes on '[' Character in Posts

#47 in reply to: ↑ 45 @ihomefinder
10 years ago

Hi miqrogroove,

I'm using this XAMPP (PHP 5.4.7) bundle to test:

XAMPP Version 1.8.1

Replying to miqrogroove:

Discovered in follow up to #29649 that PHP versions 5.3.19+ and 5.4.9+ are not affected by this bug.

Currently looking for an old server I could abuse.

#48 follow-up: @miqrogroove
10 years ago

Results of testing PHP 5.2.4 and PHP 5.2.13.

Using 4.0-beta code.

Test case caused about 60000 backtracks in 5.2.4. It did not crash, but note this would be close to the default backtrack limit for that version.
v5.2.13 reached about 24700 backtracks and then crashed.

Using 4.0.0 code.

Test case caused about 30000 backtracks in 5.2.4.
v5.2.13 reached about 24700 backtracks and then crashed.
Note kovshenin's comment appears to be incorrect https://core.trac.wordpress.org/ticket/12690#comment:37 What we thought was a solution only decreased the backtrack count by half, meaning a post twice as long would still crash PHP.

Using 4.0.0 code with second extra possessive as discussed.

Test case caused about 8000 backtracks in 5.2.4 as well as 5.2.13.

Using the .4 patch code.

Test case caused about 10 backtracks for unregistered shortcodes and 20 backtracks for registered shortcodes.

Using the .5 patch code.

Test case caused about 10 backtracks for unregistered shortcodes and 10 backtracks for registered shortcodes.

I can't yet explain why some people seem to be having better luck with the .5 patch than the .4 patch. My comparison of the two is inconclusive, but I'm comfortable using the .5 patch based on the results so far.

Last edited 10 years ago by miqrogroove (previous) (diff)

#49 in reply to: ↑ 48 @miqrogroove
10 years ago

Replying to miqrogroove:

I can't yet explain why some people seem to be having better luck with the .5 patch than the .4 patch. My comparison of the two is inconclusive, but I'm comfortable using the .5 patch based on the results so far.

Found the answer to that now. The .4 patch did not include the second possessive quantifier, and without it the number of backtracks skyrockets to the order of 1000 in my test case. So if we need an alternative patch for some reason, that would be a possible tweak.

#50 @miqrogroove
10 years ago

#29658 was marked as a duplicate.

#51 @miqrogroove
10 years ago

#29535 was marked as a duplicate.

#52 @brianfeister
10 years ago

Of note, this bug has made posting from my wp-admin impossible, preventing login to the site. Testing @miqrogroove's patches had mixed results, both .4 and .5 patch fix the issue on a local instance I have running but the issue remains when I test on staging.

#53 @miqrogroove
10 years ago

Hi brianfeister, this bug is not related to the wp-admin system and so the patch would not resolve login problems.

#54 @miqrogroove
10 years ago

I can't attend the developer meeting today. I am in favor of committing miqro-29557.5.patch as soon as possible for release with 4.0.1. I can answer any questions about this tonight or tomorrow.

#55 @wonderboymusic
10 years ago

In 29748:

wptexturize() improvements:

  • Expand the wptexturize() RegEx to include the list of registered shortcodes.
  • Avoid backtracking after [ chars by not filtering params in registered shortcodes. This will cause escaped shortcodes and their params to become texturized if not registered.
  • Registered shortcode params will never be texturized, even when escaped.
  • Move all tests involving unregistered shortcodes to a new and improved unit.
  • Update one test involving HTML within shortcode params.

Props miqrogroove.
See #29557.

This ticket was mentioned in IRC in #wordpress-dev by nacin. View the logs.


10 years ago

#57 @nacin
10 years ago

  • Keywords commit fixed-major added

I'd like this to be reviewed by a few others before it gets merged to the branch. In particular, I'm seeking reviews from kovshenin and mdawaffe.

#58 follow-up: @miqrogroove
10 years ago

From IRC:

I think "Shortcodes do not contain other shortcodes." is wrong

To be completely precise, the ']' character is never allowed inside of a shortcode tag, so by extrapolation there is no way to put a raw tag within a tag. Between tags, almost anything is allowed.

If you'd like me to research the technical implications of allowing the '[' character inside of a shortcode, I could entertain that thought. But that sort of thing could take a couple weeks given my schedule.

#59 in reply to: ↑ 58 ; follow-up: @bobbingwide
10 years ago

Replying to miqrogroove:

If you'd like me to research the technical implications of allowing the '[' character inside of a shortcode, I could entertain that thought.

Up to WordPress 3.9, left square brackets in shortcode parameters were accepted without a problem.
Right square brackets could not be used; more often than not they simply ended the shortcode.

IMHO: There's probably little point in re-enabling support for left square brackets without allowing right square brackets as well.

BTW: Nice codex updates.

#60 in reply to: ↑ 59 @miqrogroove
10 years ago

Replying to bobbingwide:

Up to WordPress 3.9, left square brackets in shortcode parameters were accepted without a problem.

I think some combinations of HTML and left brackets might cause problems in 3.9. Much more testing would be needed to know for sure.

#61 follow-up: @jeherve
10 years ago

r29748 creates issues with plugins using shortcodes wrapped in another shortcode, e.g. Jetpack's Contact Form shortcode (multiple contact-field shortcodes wrapped into a contact-form shortcode).

#62 in reply to: ↑ 61 @miqrogroove
10 years ago

Replying to jeherve:

r29748 creates issues with plugins using shortcodes wrapped in another shortcode, e.g. Jetpack's Contact Form shortcode (multiple contact-field shortcodes wrapped into a contact-form shortcode).

There are no issues with the example code on the Jetpack site

[contact-form submit_button_text='YOUR CUSTOM SUBMIT BUTTON TEXT HERE'][contact-field label='Name' type='name' required='1'/][contact-field label='Email' type='email' required='1'/][contact-field label='Website' type='url'/][contact-field label='Comment' type='textarea' required='1'/][/contact-form]

#63 @miqrogroove
10 years ago

There does appear to be a problem with the plugin itself, which does not register the contact-field shortcode. In that case, the contact-field shortcode is always invalid.

#64 @miqrogroove
10 years ago

The Jetpack plugin also fails to add the contact-form shortcode to the no_texturize list. So, it will be up to the plugin developers to fix those problems one way or the other.

#65 @miqrogroove
10 years ago

Added more explanations at http://codex.wordpress.org/Shortcode_API#Nested_Shortcodes

Although the patch isn't yet official in 4.0.1, it would be good/best practice to expect unregistered shortcodes to become texturized.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#67 @miqrogroove
10 years ago

Per IRC discussion, we will not be able to resolve the crash by breaking non-registered shortcodes. :(

In miqro-29557.6.patch:

  • Revert parts of miqro-29557.5.patch.
  • Revert [28773] and remove 'attribute filter' from [28727].
  • Make the shortcode pattern quantifier possessive to avoid backtracks.
  • Security regression: HTML filter avoidance may be possible again per #12690.
  • Does not break unregistered shortcodes.
  • Does not break HTML comments or performance gains from 4.0.
  • Tested PHP 5.2.4, 5.2.13, 5.4.32, and 5.5.8.
  • No crashes or large backtrack counts found in any version.
  • Unit tests updated to show new outputs.

#68 @miqrogroove
10 years ago

Extra thoughts on the security impact:

From a larger perspective, neither patch would resolve potential HTML corruption. With miqro-29557.5.patch a user who is allowed the necessary contexts could do this:

[caption - Is it wise to <a title="allow user content ] here? hmm"> maybe </a> ]

Now assume the context does not support shortcodes, but is texturized. Currently wptexturize() is unaware of the shortcode filter status and will always avoid shortcodes. If shortcodes are in fact disabled, then wptexturize() has avoided the wrong code, resulting in texturized HTML elements.

I'm mentioning this here for two reasons: It is a known bug, and I think miqro-29557.6.patch does not represent a significant regression as I did earlier.

@miqrogroove
10 years ago

Don't match when whitespace appears before shortcode name.

#69 @miqrogroove
10 years ago

Found a small refinement and added more tests based on ideas above.

In miqro-29557.7.patch:

  • Revert parts of miqro-29557.5.patch.
  • Revert parts of [28773] and [28727].
  • Do not crash PHP. Make the pattern quantifier possessive to avoid backtracks.
  • Do not break unregistered shortcodes, e.g. [hello attr="value"].
  • Do not break HTML in shortcode attributes, e.g. [hello attr="<"].
  • Do not match for shortcodes when there is extra whitespace, e.g. [ hello ].
  • Add unit tests to show #12690 was not fully resolved.
  • Tested PHP 5.2.4, 5.2.13, 5.4.32, and 5.5.8.

#70 @miqrogroove
10 years ago

  • Keywords has-patch commit fixed-major removed

Need to test performance with long HTML comments. I'm not seeing any crashes but getting paranoid about this whole backtracking idea.

@miqrogroove
10 years ago

Reduce backtracking in HTML comments.

#71 @miqrogroove
10 years ago

  • Keywords has-patch added

I experimented with several more solutions and this one seems to be the most optimal so far.

In miqro-29557.8.patch:

  • Revert parts of [28773] and [28727] and [29748].
  • Do not crash PHP. Make the shortcode quantifier possessive to avoid backtracks.
  • Reduce backtracking in long HTML comments by 100x.
  • Do not ignore unclosed HTML comments.
  • Do not break unregistered shortcodes, e.g. [hello attr="value"].
  • Do not break HTML in shortcode attributes, e.g. [hello attr="<"].
  • Do not match for shortcodes when there is extra whitespace, e.g. [ hello ].
  • Add unit tests to show #12690 was not fully resolved.
  • Tested PHP 5.2.4, 5.2.13, 5.4.32, and 5.5.8.

#72 @kitchin
10 years ago

The comment parsing could be simplified. Gecko, Webkit and Trident all (apparently) use this:

(<!--.+?-->|<!--+>)

even in exotic doctypes like xml and 4.01 strict. Despite the W3C, \s*> is not allowed in 4.01 strict. And embedded -- are ignored, since sgml legacy is not as important as easily commenting out html (or xml). There's probably some scenario I missed, but this regex is much simpler than the one in the latest patch. I separated the clauses completely with | to avoid any issues combining the greedy and non-greedy parts.

Test strings:

Shows 'ab':

a<!-->b
a<!--->b
a<!---->b
a<!----->b
a<!-- c --->b
a<!-- c -- d -->b

Does not show 'ab':

a<!-- c -- >b<!-- close -->
a<!-- >b<!-- close -->
a<!---- >b<!-- close -->
a<!-- <!-- c --> -->b<!-- close -->

The regex would be a change from current WP because it disallows whitespace before closing >. But that's how the browser renders it anyway.

(Trident junk like <!--[if IE ]> doesn't matter in this context of course.)

Last edited 10 years ago by kitchin (previous) (diff)

#73 @miqrogroove
10 years ago

Can't use .+? due to the backtracking limitations but I should be able to drop the \s*

Some of those test strings should not be valid anyway. <!---> does not seem to have any useful purpose and is expressly forbidden in HTML 5.

@miqrogroove
10 years ago

Per kitchin, no spaces in -->

@miqrogroove
10 years ago

Per kitchin, allow <!---> and add tests.

#74 @miqrogroove
10 years ago

Hi kitchin, thank you for the feedback. Check out miqro-29557.10.patch which should match the browser behavior and give a lower backtrack count than the suggested pattern.

#75 @wonderboymusic
10 years ago

In 29781:

The joys of wptexturize():

  • Revert parts of [28773] and [28727] and [29748].
  • Do not crash PHP. Make the shortcode quantifier possessive to avoid backtracks.
  • Reduce backtracking in long HTML comments by 100x.
  • Do not ignore unclosed HTML comments.
  • Do not break unregistered shortcodes, e.g. [hello attr="value"].
  • Do not break HTML in shortcode attributes, e.g. [hello attr="<"].
  • Do not match for shortcodes when there is extra whitespace, e.g. [ hello ].
  • Add unit tests to show #12690 was not fully resolved.
  • Tested PHP 5.2.4, 5.2.13, 5.4.32, and 5.5.8.

Adds/modifies unit tests.

Props miqrogroove.
See #29557.

#76 follow-up: @kitchin
10 years ago

Miqrogroove, I did test patch 10 and all was fine. I compared it to an algorithm that first did the shortcode regex and then used strpos() to do the tag/comment parsing. The only tests that failed were the expected ones by doing it that way: shortcode-like text within html tags. Also tested it the other way (tags then shortcodes) and got expected failures. So you have the right tests as far as I see.

I thought skipping internal -- in comments would simplify things, but your algorithm does skip them and does not backtrack. Like you said!

#77 @helen
10 years ago

[29781] has some whitespace issues - should use spaces instead of tabs for alignment, and there's a little bit of trailing in there.

#78 @miqrogroove
10 years ago

Whitespace attached.

Codex page expanded and updated.

#79 @miqrogroove
10 years ago

#29810 has been spun off as a separate issue. I'm certain now there will be no way to fix that particular problem until 4.1 or 4.2.

#80 follow-up: @azaozz
10 years ago

Looking at the patches here and wondering: would it be better to do a double preg_split()? Something like:

if ( strpos( $text, '[' ) !== false && strpos( $text, ']' ) !== false ) {
  $has_shortcodes = true;
  $textarr = preg_split( '/generic shortcode regexp/', $text, ... );
} else {
  $textarr = array( $text );
}

foreach ( $textarr as &$part ) {
  if ( ! $has_shortcodes || not a shortcode ) {
    $split = preg_split( '/<.[^>]*>/', $part, etc... );
    foreach ( $split as &$chunk ) {
      // Process the text and tags
    }
  }
}

This will make the regexp simpler, remove a lot of backtracking, prevent cases like in comment 68, and bypass tags in shortcode attributes.

#81 in reply to: ↑ 80 @miqrogroove
10 years ago

Replying to azaozz:

This will make the regexp simpler, remove a lot of backtracking, prevent cases like in comment 68, and bypass tags in shortcode attributes.

Actually, I think it would make most of those problems worse. Let's chat on IRC.

#82 @miqrogroove
10 years ago

IRC Summary:

  • We can try some more strpos() tricks in patches for #29810.
  • A few dry runs showed a 10% speed boost in posts that don't have any shortcodes.
  • Adding preg_split() inside a loop does not seem like a good direction for performance.
  • There are no new concerns for 4.0.1 at this time.

#83 follow-up: @kitchin
10 years ago

azaozz, I tested a patch that does essentially that, and it fails that following tests:

14.Test.. '<br [gallery ...] ... />'
14.Actual '<br [gallery ...] &#8230; />'
14.Expect '<br [gallery ...] ... />'
24.Test.. '<br [[gallery ...]] ... />'
24.Actual '<br [[gallery ...]] &#8230; />'
24.Expect '<br [[gallery ...]] ... />'
39.Test.. '<!-- <br /> [gallery] ... -->'
39.Actual '<!-- <br /> [gallery] &#8230; &#8211;>'
39.Expect '<!-- <br /> [gallery] ... -->'
45.Test.. '[ regex catches this <a href="[quote]">here</a> ]'
45.Actual '[ regex catches this <a href=&#8221;[quote]&#8220;>here</a> ]'
45.Expect '[ regex catches this <a href="[quote]">here</a> ]'

My code uses better tag logic in the second part of your suggestion to catch comments (all done with strpos()), but it's the same idea. The reverse idea, tags first then shortcodes, also fails tests.

Miqro's patch is ideal for the current tests, with one caveat: the shortcode regex is executed twice for each match (once in the implicit preg_split loop and once in the foreach loop). To avoid running the regex twice I have some code that saves data on each part of the split match, but it's execution expensive.

Last edited 10 years ago by kitchin (previous) (diff)

#84 in reply to: ↑ 83 ; follow-up: @miqrogroove
10 years ago

Replying to kitchin:

Miqro's patch is ideal for the current tests, with one caveat: the tag regex is executed twice for each match (once in the implicit preg_split loop and once in the foreach loop).

preg_match_all() might be able to structure the results better than preg_split(). I'm not sure what the speed differences would be. Also, I'm not sure there would be any elegant way to capture the parts between the tags.

#85 @kitchin
10 years ago

I meant to say shortcode regex not tag regex.

#86 @miqrogroove
10 years ago

You know, the best way to avoid that second run of the shortcode regexp would be to empty the value of $default_no_texturize_shortcodes. The one default item in there isn't even a valid shortcode. An empty list would mean there is no reason to run _wptexturize_pushpop_element() for any shortcodes. If anyone would like to second that idea, we could spin that off as another ticket.

#87 follow-ups: @kovshenin
10 years ago

Per @miqrogroove's comment my assumption in #12690 was indeed wrong. The greediness did apply to the matched group, but the alteration within the group still allowed unnecessary backtracking, and without some of the newer optimizations in older PHP versions we get a segfault. Making [^\[\]<>] greedy in all three expressions should solve this specific backtracking issue.

However, I think we need to take a step back, revert everything to 3.9 behavior in 4.0.1 and revisit some of these ideas in 4.1:

  • If we're going to match shortcodes, why aren't we using the existing cached shortcode regex?
  • "Adding preg_split() inside a loop does not seem like a good direction for performance." - why not? How is it different from using preg_match() inside a loop?
  • I like @azaozz's idea of splitting the regex into smaller managable, consistent and possibly reusable chunks

#88 in reply to: ↑ 87 @kitchin
10 years ago

Replying to kovshenin:

  • If we're going to match shortcodes, why aren't we using the existing cached shortcode regex?

See #comment62 through #comment67. Unregistered shortcodes are allowed in order not to break Jetpack Contact.

  • "Adding preg_split() inside a loop does not seem like a good direction for performance." - why not? How is it different from using preg_match() inside a loop?

Don't know, maybe due to the larger arrays created. It's more like preg_match_all. The heuristic is to get the next match.

  • I like @azaozz's idea of splitting the regex into smaller managable, consistent and possibly reusable chunks

Would have to change the spec. The [] and <> parsing must occur at the same time, as it is now, as far as I know.

I do have a simplified algorithm if you want it:

  1. parse for html comments
  2. parse for shortcodes
  3. parse for html tags

It fails three tests now in trunk:

Test:   '<br [gallery ...] ... />'
Expect: '<br [gallery ...] ... />'
Actual: '<br [gallery ...] &#8230; />'

Test:   '<br [[gallery ...]] ... />'
Expect: '<br [[gallery ...]] ... />'
Actual: '<br [[gallery ...]] &#8230; />'

Test:   '[ regex catches this <a href="[quote]">here</a> ]'
Expect: '[ regex catches this <a href="[quote]">here</a> ]'
Actual: '[ regex catches this <a href=&#8221;[quote]&#8220;>here</a> ]'

I don't consider those tests important, by the way. But [] are common in PHP links, and my patch would break:

<a href="http://example.com/?action[string]=1">

A compromise would be to restrict what can go in that string to be considered a possible shortcode.

There is an active ticket on allowing shortcodes in more places, 24990, which my 3-way algorithm would also break I think.

Other re-arrangements of parsing comments, shortcodes and tags break more important tests.

#89 in reply to: ↑ 87 @miqrogroove
10 years ago

Replying to kovshenin:

However, I think we need to take a step back, revert everything to 3.9 behavior in 4.0.1 and revisit some of these ideas in 4.1:

What is your reason for a revert? The trunk version is patched and fixes all concerns raised in this ticket. If you have new concerns, speak now. :)

If we're going to match shortcodes, why aren't we using the existing cached shortcode regex?

  1. As kitchin pointed out, the cached shortcode regex does not match wptexturize() 3.9 behavior and breaks pre-trunk versions of Jetpack.
  2. The cached shortcode regex is incompatible with the HTML parsing strategy.
  3. The cached shortcode regex requires a recursive algorithm.
  4. Having a simplified pattern here is not a reason to revert all the patches for 4.0.

"Adding preg_split() inside a loop does not seem like a good direction for performance." - why not? How is it different from using preg_match() inside a loop?

You are welcome to run some benchmarks of using alternation one time vs. using preg_split against every result of a first split. It is unlikely to improve performance, and it is a red herring in terms of texturization behavior. We have no algorithmic need to do that.

I like @azaozz's idea of splitting the regex into smaller managable, consistent and possibly reusable chunks

This is happening somewhat in #29810 and does not seem relevant to 4.0.1.

#90 follow-ups: @kovshenin
10 years ago

To clarify, I'm not proposing a simplified pattern or algorithm for 4.0.1. I'm proposing to restore 3.9 behavior to stop breaking people's installs, while we take a bit more time to figure things out. The main issue in this ticket is fairly unlikely, but things like #29608, #28564 and possibly others seem pretty critical. Not sure why they were closed as duplicates of this backtracking/segfaults issue.

My concern is that r29748 and r29781 seem a bit radical for a point release. We removed matching <[^>]+> inside the shortcodes regex, allowing < and > alone to match, while we explicitly prohibited that in r28727 and then allowed only "well-formed" HTML in r28773. I guess this was done to fix #29608 but I feel like there may be side-effects. Haven't seen an actual report but a [ in a shortcode attribute also worked pre-4.0 and now it doesn't, and if we remove the rest of [^\[\]] we're pretty much back at square one :)

Re. Jetpack and other plugins that register shortcodes "just in time", obviously not in a point release, but can we look into extending the original shortcode to "support" them? Or maybe we can add a generic shortcode handler in texturize and issue a _doing_it_wrong()? The idea here is that if we're dealing with a shortcode, it's best to handle it in a consistent way across the whole codebase.

@kitchin would love to see the code for your simplified algorithm.

#91 in reply to: ↑ 90 @miqrogroove
10 years ago

Replying to kovshenin:

My concern is that r29748 and r29781 seem a bit radical for a point release.

r29748 was reverted already. What is the concern here?

r29781 is the correct solution based on all feedback so far.

Haven't seen an actual report but a [ in a shortcode attribute also worked pre-4.0 and now it doesn't

We have clear documentation that square braces are not supported in shortcodes. Solarissmoke added that warning to the Codex on 16 March 2011. Now if we want to officially support the left braces and not the right braces that's one thing, but we can't do it without breaking Jetpack or breaking the unit tests for #12690.

#92 in reply to: ↑ 90 ; follow-up: @kitchin
10 years ago

Replying to kovshenin:

@kitchin would love to see the code for your simplified algorithm.

POC patch coming up. It might be good for some future work, but the changes in the unit tests show why this patch is insufficient.

One minor thing: current code treats unclosed comments differently from other unclosed html tags. Unclosed comments already have a unit test: they are treated as comments. Unclosed html tags are treated as non-html text, and I added a unit test documenting that. Not worth worrying about, it just affects whether the broken html tag is texturized or not. The browser will hide it in either case. I just added the test to show the POC can do the same thing.

@kitchin
10 years ago

POC, insufficient, do not use. Less regex, more strpos.

#93 in reply to: ↑ 92 ; follow-up: @miqrogroove
10 years ago

Replying to kitchin:

Unclosed html tags are treated as non-html text

My intuition says we should make a new ticket for that, mark it a defect in v1.5 and have a separate discussion. Would you like to?

#94 @DrewAPicture
10 years ago

#29658 was marked as a duplicate.

#95 in reply to: ↑ 93 @kitchin
10 years ago

Replying to miqrogroove:

Replying to kitchin:

Unclosed html tags are treated as non-html text

My intuition says we should make a new ticket for that, mark it a defect in v1.5 and have a separate discussion. Would you like to?

ticket:29913

Just to be clear, the 10.patch that landed was right for performance and matching specs and is pretty well covered by tests.

#96 follow-up: @miqrogroove
10 years ago

Summary of recent comments:

kitchin In favor of trunk, experimenting to find extra optimizations.
kovshenin Opposed to trunk because of large changes that didn't unify do_shortcode and wptexturize.
azaozz In favor of trunk, plus brainstorming for 4.1.
helen Wants better whitespace (see beautify-29557.2.diff)
bobbingwide Pointed out change in brackets behavior, but not concerned about it.
nacin Seeking feedback from mdawaffe.

#97 in reply to: ↑ 96 ; follow-up: @azaozz
10 years ago

As we are excluding strings from texturize:

  • It seems we don't need to match [[shortcode]], it's enough to match [shortcode]. The extra [ ] are unaffected by texturize.
  • Following the same logic we don't need to match [/shortcode] and </tag>. There is nothing that needs excluding from texturize in there.

#98 follow-up: @SergeyBiryukov
10 years ago

In 29873:

Fix whitespace issues in [29781]. Remove a redundant comment.

props miqrogroove.
see #29557.

#99 in reply to: ↑ 97 @miqrogroove
10 years ago

Replying to azaozz:

As we are excluding strings from texturize:

Escaped and closing tags both change the behavior of texturize. This is consistent with all previous versions.

#100 in reply to: ↑ 98 @miqrogroove
10 years ago

  • Keywords commit fixed-major added

Replying to SergeyBiryukov:

In 29873:
Fix whitespace issues in [29781]. Remove a redundant comment.

Thank you Sergey and wonderboymusic for the commits. I'm putting nacin's keywords back on now.

#101 @miqrogroove
10 years ago

I'm going to work on offering one or two alternative patches that take slightly different strategies like just breaking old versions of Jetpack (3.2 was released today) or just breaking the plugins that use HTML inside of attributes. We are ultimately headed in the direction of harmonizing shortcodes and texturization, but I doubt that can happen in the 4.0 branch. There is also still some uncertainty about exactly which patch is the best fit at this point, and I want guys like nacin to be able to see the options exactly.

@miqrogroove
10 years ago

Idea #1: Go back to excluding HTML from shortcodes, just don't crash PHP this time.

@miqrogroove
10 years ago

Idea 2: Break Jetpack. Go back to filtering shortcode names.

@miqrogroove
10 years ago

Unit test typo.

#102 in reply to: ↑ 84 @miqrogroove
10 years ago

Replying to miqrogroove:

Replying to kitchin:

Miqro's patch is ideal for the current tests, with one caveat: the tag regex is executed twice for each match (once in the implicit preg_split loop and once in the foreach loop).

preg_match_all() might be able to structure the results better than preg_split(). I'm not sure what the speed differences would be. Also, I'm not sure there would be any elegant way to capture the parts between the tags.

Looks like PREG_OFFSET_CAPTURE gives us the data we need, and all that's needed is a tweak to the capturing groups. There might be benefits also from not splitting the string. I'm wondering if we can do a similar match_all using the full shortcode regexp and then loop through the results with some magical algebra. I'll play with that idea more as time allows.

#103 in reply to: ↑ 76 @miqrogroove
10 years ago

Replying to kitchin:

Miqrogroove, I did test patch 10 and all was fine. I compared it to an algorithm that first did the shortcode regex and then used strpos() to do the tag/comment parsing. The only tests that failed were the expected ones by doing it that way: shortcode-like text within html tags. Also tested it the other way (tags then shortcodes) and got expected failures. So you have the right tests as far as I see.

For the shortcodes-first strategy to work, we would have to remove the shortcodes before looking for HTML. (Assuming we continue to allow HTML in shortcodes as well). This would work only on a very limited basis, otherwise the function would become recursive and lead to DoS problems. I'll keep working on this at least to see if it runs faster or slower than the trunk code.

#104 follow-up: @kitchin
10 years ago

I guess it's worth looking at my patch (more strpos, less regex) https://core.trac.wordpress.org/attachment/ticket/29557/poc_pos_29557.diff if this is the algorithm people want:

Look for 1. HTML comments, 2. then possible shortcodes, 3. then HTML tags (not comments).

It proceeds left to right and applies the three steps as it goes. Pretty much the same as preg_split() and preg_match_all(), but with a different data structure. May need to replace (2) with actual shortcodes instead of possible shortcodes.

#105 in reply to: ↑ 104 @miqrogroove
10 years ago

Replying to kitchin:

I guess it's worth looking at my patch (more strpos, less regex) https://core.trac.wordpress.org/attachment/ticket/29557/poc_pos_29557.diff if this is the algorithm people want:

Look for 1. HTML comments, 2. then possible shortcodes, 3. then HTML tags (not comments).

Unfortunately we have not yet reached a unanimous decision on whether to allow HTML inside of shortcodes, and if we do, we cannot process HTML comment syntax first. :( I'll look at that diff again but most likely we will work with what I've written so far. And if you have time to help, there are some parts that I could delegate.

@miqrogroove
10 years ago

Texturize hat trick: Full shortcode regexp, HTML parsed separately, no regexp inner loops.

#106 @miqrogroove
10 years ago

Was able to finish up a draft tonight. See 29557-idea-3.diff Shiny new parser in only 300 lines of code. Needs performance testing to find out if it's going to work as advertised. Needs review by core team to decide if that's the direction we want to take in 4.0.1.

@miqrogroove
10 years ago

First strpos was buggy, added extra if block.

This ticket was mentioned in Slack in #core by toscho. View the logs.


10 years ago

#108 @miqrogroove
10 years ago

That latest patch idea runs about 30x slower than trunk code :( Frustrating but not unexpected.

@miqrogroove
10 years ago

Parser function refactored for performance.

#109 @miqrogroove
10 years ago

In 29557-idea-3.3.diff I refactored the bottom half of the new parser function to eliminate all uses of the array_splice() function. The new output strategy is to build a third array from the first two arrays rather than splicing them together. It runs about 3x slower than the trunk version of wptexturize(), which is much more reasonable.

This ticket was mentioned in Slack in #core by mikengarrett. View the logs.


10 years ago

#111 @miqrogroove
10 years ago

If there is any more feedback about shortcodes and wptexturize, let us know now. It looks like we are going to choose one of these three options:

  1. Block HTML when it appears in "nested" shortcodes. My latest patch is able to do this. Yay!
  2. Just block all HTML in shortcodes. It may be unpopular, but it runs fast and solves the remaining problems.
  3. Allow HTML in all shortcodes. We do not have a full patch for this yet. At minimum, we would need a plugin-friendly way to enable recursive texturization. /wince

#112 follow-ups: @MikeNGarrett
10 years ago

I definitely agree with the first approach, although I would rather go with the second. I'm afraid the second would alienate a lot of content managers and plugin authors.

I ran through phpunit tests put forward in 29557-idea-3.3 and can comfirm they passed. I also did my own testing with one of the main shortcode gallery plugins, Shortcodes Ultimate. The only feature that didn't work with the patch is their own version of allowing shortcodes to be attributes of other shortcodes, for example
[su_button url="{su_spoiler title='Spoiler title'}"] Button [/su_button]

The output in this case is broken, which is to be expected. The problem is that it's outputting invalid code to the front-end, so anyone that's currently using this format would have broken content.

On the other hand, shortcode nesting still worked, for example

[su_accordion]
[_su_spoiler title="Spoiler title"] Spoiler content [_/su_spoiler]
[_su_spoiler title="Spoiler title"] Spoiler content [_/su_spoiler]
[_su_spoiler title="Spoiler title"] Spoiler content [_/su_spoiler]
[/su_accordion]

All other examples I tested worked perfectly.

#113 @bobbingwide
10 years ago

I'm following this with interest, albeit from a bit of a distance, since I believe I may have a slightly different set of requirements to the average user.

My requirements are:

  1. Allow HTML in parameters in shortcodes e.g. [bw_blockquote text="A <b>bold</b> statement"]
  2. Allow stuff that might look like HTML in parameters in shortcodes. See #29608
  3. Allow shortcodes to pull in other content which may also contain shortcodes
  4. One day, allow shortcodes in parameters

In order to achieve the first three I've had to resequence the filter hooks for 'the_content'.

My current solution, when I need to do the above is:

  remove_filter( 'the_content', 'wpautop' );
  remove_filter( 'the_content', 'shortcode_unautop' );
  if ( $autop ) {
    add_filter( 'the_content', 'bw_wpautop', 99);
  } else {
    remove_filter( 'the_content', 'bw_wpautop', 99 );
  } 

where bw_wpautop which can be toggled on or off as required.

My preference for texturize and wpautop would be to leave them right to the end, or not run them at all.

I'm just wondering if anyone has tried a similar approach and seen what sort of results you get from the unit test cases.

BTW. I've currently got 199 shortcodes in my development environment ( $shortcode_tags).
Slightly more than the 7 shortcodes provided by core :-)

#114 in reply to: ↑ 112 @miqrogroove
10 years ago

Replying to MikeNGarrett:

The only feature that didn't work with the patch is their own version of allowing shortcodes to be attributes of other shortcodes, for example
[su_button url="{su_spoiler title='Spoiler title'}"] Button [/su_button]

This is broken? Or is the plugin doing something weird internally?

#115 @nacin
10 years ago

  • Owner set to nacin
  • Status changed from new to assigned

This ticket was mentioned in Slack in #core by bobbingwide. View the logs.


10 years ago

#117 in reply to: ↑ 112 ; follow-up: @miqrogroove
10 years ago

Replying to MikeNGarrett:

The only feature that didn't work with the patch is their own version of allowing shortcodes to be attributes of other shortcodes

I couldn't figure out if your example was valid. I was able to get this to work with the same plugin:

[su_button url="{su_meta key='url_for_testing'}"]Button text[/su_button]

I added a custom field named url_for_testing and set the value to http://wordpress.org/ and then the plugin created a button that linked me to the WordPress site as expected.

#118 in reply to: ↑ 117 @MikeNGarrett
10 years ago

Replying to miqrogroove:

Replying to MikeNGarrett:

The only feature that didn't work with the patch is their own version of allowing shortcodes to be attributes of other shortcodes

I couldn't figure out if your example was valid. I was able to get this to work with the same plugin:

[su_button url="{su_meta key='url_for_testing'}"]Button text[/su_button]

I added a custom field named url_for_testing and set the value to http://wordpress.org/ and then the plugin created a button that linked me to the WordPress site as expected.

Looks like I had my test wrong. I've been continuing to test, but haven't been able to find any issues.

#119 @nacin
10 years ago

In 30449:

Anchor texturize to shortcodes to improve regex efficiency.

props miqrogroove.
see #29557 for segfault issues.

#120 @nacin
10 years ago

In 30450:

Anchor texturize to shortcodes to improve regex efficiency.

For the 4.0 branch; see [30449] for trunk.

props miqrogroove.
see #29557 for segfault issues.

#121 @nacin
10 years ago

In 30452:

Anchor texturize to shortcodes to improve regex efficiency.

For the 3.9 branch; see [30449] for trunk.

props miqrogroove.
see #29557 for segfault issues.

#122 @nacin
10 years ago

In 30455:

Anchor texturize to shortcodes to improve regex efficiency.

Merges [30452] to the 3.8 branch.

props miqrogroove.
see #29557 for segfault issues.

#123 @nacin
10 years ago

In 30456:

Anchor texturize to shortcodes to improve regex efficiency.

Merges [30452] to the 3.7 branch.

props miqrogroove.
see #29557 for segfault issues.

#124 follow-up: @nacin
10 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

I'll be opening a new ticket for 4.0.2 with remaining issues. If you have/know of remaining issues, leave a comment here.

#125 @miqrogroove
10 years ago

Thank you nacin for the commits. I will update Codex and things relevant to #29661. Let me know if I can help write the core post as needed.

p.s. :) http://halfelf.org/2014/resilient-responses-in-reviews/alot-of-patches/

This ticket was mentioned in Slack in #core by johnbillion. View the logs.


10 years ago

This ticket was mentioned in Slack in #forums by ipstenu. View the logs.


10 years ago

This ticket was mentioned in Slack in #forums by ipstenu. View the logs.


10 years ago

This ticket was mentioned in Slack in #forums by ipstenu. View the logs.


10 years ago

This ticket was mentioned in Slack in #forums by ipstenu. View the logs.


10 years ago

#131 in reply to: ↑ 124 @speedpartner
10 years ago

Replying to nacin:
[... removed, not related ...]

Last edited 10 years ago by speedpartner (previous) (diff)
Note: See TracTickets for help on using tickets.