Make WordPress Core

Opened 15 years ago

Closed 10 years ago

Last modified 10 years ago

#12690 closed defect (bug) (fixed)

Square brackets breaking links that contain square brackets

Reported by: iandstewart's profile iandstewart Owned by: wonderboymusic's profile wonderboymusic
Milestone: 4.0 Priority: normal
Severity: normal Version:
Component: Formatting Keywords: wptexturize has-patch
Focuses: Cc:

Description

In the editor, wrapping square brackets around an anchor with a URL that contains square brackets, like so

[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&amp;ref=sr_gallery_7&amp;&amp;ga_search_query=keep+calm+and+carry+on&amp;ga_search_type=handmade&amp;ga_page=13&amp;includes[]=tags&amp;includes[]=title">KeepCalmPosters</a> ]

converts the last double prime in the href attribute to the character code for the right double quotation mark, like so

[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&amp;ref=sr_gallery_7&amp;&amp;ga_search_query=keep+calm+and+carry+on&amp;ga_search_type=handmade&amp;ga_page=13&amp;includes[]=tags&amp;includes[]=title&#8221;>KeepCalmPosters</a> ]

Attachments (11)

bmb_0003_t12690.patch (711 bytes) - added by bumbu 15 years ago.
12690.2.diff (529 bytes) - added by MikeHansenMe 12 years ago.
12690.3.patch (529 bytes) - added by kurtpayne 12 years ago.
Use plus instead of star plus
12690.unit-test.patch (781 bytes) - added by kurtpayne 12 years ago.
miqro-12690.patch (4.0 KB) - added by miqrogroove 11 years ago.
Fix shortcode matching and also #27602
miqro-12690.2.patch (4.8 KB) - added by miqrogroove 11 years ago.
Also fix #8912 using same regexp.
miqro-12690.3.patch (5.5 KB) - added by miqrogroove 10 years ago.
Tweaks suggested by mdawaffe.
miqro-12690.4.patch (6.9 KB) - added by miqrogroove 10 years ago.
More tweaks to ensure shortcode compatibility.
miqro-12690.5.patch (7.0 KB) - added by miqrogroove 10 years ago.
Revert one of the previous regexp edits not needed.
12690.diff (1.2 KB) - added by kovshenin 10 years ago.
12690.3.diff (1.1 KB) - added by kovshenin 10 years ago.

Download all attachments as: .zip

Change History (52)

#1 @nacin
15 years ago

  • Milestone changed from Unassigned to 3.0

#2 @bumbu
15 years ago

  • Cc bmbalex@… added
  • Keywords texturize added
  • Owner set to bumbu
  • Status changed from new to reviewing

problem was in dividing link into smaller parts by preg_split, and function was transforming single quotes into html format

with these patch function 'wptexturize' don't take in account sqare empty parentheses, and link with empty parantheses will not be divided into smaller parts => no unwanted quotes transformation

#3 @bumbu
15 years ago

  • Keywords has-patch added; texturize removed
  • Owner bumbu deleted

#4 @nacin
14 years ago

  • Component changed from General to Formatting
  • Keywords needs-testing needs-unit-tests added
  • Milestone changed from 3.0 to Future Release

#5 @MikeHansenMe
12 years ago

I was unable to apply the patch to my install however I looked at the code change. I was able to get the code working on my install. I will include my patch in case anyone else has trouble with it. Fixes problem for me.

#6 follow-up: @MikeHansenMe
12 years ago

another problem with this line of code is that it does not show properly in the feed.

item rdf:about="http://localhost/wp/?p=1">
	<title>Hello world!</title>
	<link>http://localhost/wp/?p=1</link>
	 <dc:date>2012-09-10T20:58:44Z</dc:date>
	<dc:creator>admin</dc:creator>
			<dc:subject><![CDATA[Uncategorized]]></dc:subject>
	<description>[photos by</description>
	<content:encoded><![CDATA[<p>[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&amp;ref=sr_gallery_7&amp;&amp;ga_search_query=keep+calm+and+carry+on&amp;ga_search_type=handmade&amp;ga_page=13&amp;includes[]=tags&amp;includes[]=title&#8221;>KeepCalmPosters</a> ]</p>
]]></content:encoded>
	</item>

@kurtpayne
12 years ago

Use plus instead of star plus

#7 in reply to: ↑ 6 @kurtpayne
12 years ago

  • Cc kurtpayne added

New patch ... almost identical to 12690.2.diff except .*+ has been simplified to .+.

Replying to MikeHansenMe:

another problem with this line of code is that it does not show properly in the feed.

	<content:encoded><![CDATA[<p>[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&amp;ref=sr_gallery_7&amp;&amp;ga_search_query=keep+calm+and+carry+on&amp;ga_search_type=handmade&amp;ga_page=13&amp;includes[]=tags&amp;includes[]=title&#8221;>KeepCalmPosters</a> ]</p>
]]></content:encoded>

The patch fixes the missing closing quote on the feed, too.

#8 @MikeHansenMe
12 years ago

12690.3.patch works great for me.

#9 @kurtpayne
12 years ago

  • Keywords dev-feedback added; needs-unit-tests removed

Unit test attached. Still needs dev feedback since it's touching wptexturize. The patch doesn't break existing tests.

#10 @MikeHansenMe
12 years ago

  • Cc mdhansen@… added

#11 @nacin
11 years ago

This looks like a real-world example of #18575. Nice to have.

I'm OK with this, at a glance. Will get someone else to look at it.

#12 @miqrogroove
11 years ago

  • Keywords wptexturize needs-patch added; has-patch needs-testing dev-feedback removed

Some good ideas here, but these patches don't work as advertised.

\[.*+\] simply breaks shortcode detection.

\[.+\] prevents matches on [] but still matches [ ... < ... [] and [] ... > ... ]

#13 @miqrogroove
11 years ago

I'm looking if there's a way to make variable length expressions atomic so that the engine won't backtrack inside an element. But given the complexity of that sort of thing, I'm tempted to just rewrite the entire expression instead, using conditions instead of alternation.

#14 @miqrogroove
11 years ago

Here's a simple strategy:

/(<.*>|\[[^<]*\])/

I'll put that in a new patch and then do some benchmarks against a conditional strategy. My intuition tells me the alternation pattern is not optimal.

@miqrogroove
11 years ago

Fix shortcode matching and also #27602

#15 follow-up: @miqrogroove
11 years ago

  • Keywords has-patch added; needs-patch removed

In miqro-12690.patch:

  • Assume shortcodes never contain nested shortcodes or HTML elements.
  • Do not match escaped shortcodes like [[gallery]].
  • Don't waste time checking empty( $curl ) inside the loop.
  • Also fixes #27602. No reason to patch this twice!
  • Includes unit test by kurtpayne.
  • Added several more unit tests.

@miqrogroove
11 years ago

Also fix #8912 using same regexp.

#16 in reply to: ↑ 15 ; follow-ups: @mdawaffe
10 years ago

Replying to miqrogroove:

In miqro-12690.patch:

It seems like we should not texturize things inside escaped shortcodes. The main use case for escaped shortcodes is to talk about shortcodes. You don't want people trying to copy and paste your examples and getting stuck with curly quotes.

So [[gallery ...]] should be preserved and [[gallery ...]...[gallery ...]] should either be preserved or be transformed to [[gallery ...]&#8230;[gallery ...]], I'm not sure.

Aside: there's a parse error in the test file :)

#17 follow-up: @miqrogroove
10 years ago

Then we would have to decide what to do with [[caption]]Hello World[[/caption]]

#18 in reply to: ↑ 16 ; follow-up: @miqrogroove
10 years ago

Replying to mdawaffe:

[[gallery ...]...[gallery ...]] should either be preserved or be transformed to [[gallery ...]&#8230;[gallery ...]], I'm not sure.

Sorry, but that needs to be a separate ticket. The shortcode system doesn't support that syntax at all.

@miqrogroove
10 years ago

Tweaks suggested by mdawaffe.

#19 follow-up: @miqrogroove
10 years ago

In miqro-12690.3.patch:

  • Assume shortcodes never contain nested shortcodes or HTML elements.
  • Allow escaped shortcodes like [[gallery]] and do not texturize them either.
  • Do texturize between escaped shortcodes like [[code]]Hello World[[/code]]
  • Don't waste time checking empty( $curl ) inside the loop.
  • Also fixes #8912.
  • Also fixes #27602.
  • Includes unit test by kurtpayne.
  • Added several more unit tests.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#21 in reply to: ↑ 17 @miqrogroove
10 years ago

Replying to miqrogroove:

Then we would have to decide what to do with [[caption]]Hello World[[/caption]]

fwiw, this gives weird output in 3.8 and 3.9:

[caption]]Hello World[[/caption]

I'll take a closer look at shortcodes.php to figure out why it does that.

#22 in reply to: ↑ 18 @mdawaffe
10 years ago

Replying to miqrogroove:

Replying to mdawaffe:

[[gallery ...]...[gallery ...]] should either be preserved or be transformed to [[gallery ...]&#8230;[gallery ...]], I'm not sure.

Sorry, but that needs to be a separate ticket. The shortcode system doesn't support that syntax at all.

Sorry - I meant [[code]foo[/code]].

#23 in reply to: ↑ 19 @mdawaffe
10 years ago

Replying to miqrogroove:

In miqro-12690.3.patch:

  • Do texturize between escaped shortcodes like [[code]]Hello World[[/code]]

I haven't looked at the patch to see if it matters, but the correct way to escape [code]foo[/code] is [[code]foo[/code]] not [[code]]foo[[/code]].

#24 @miqrogroove
10 years ago

Yes, there is an unfortunate bug in shortcodes.php. If we want to write something like this ...

The proper way to end the caption is with the [[/caption]] tag.

... the "escaped" tag is ignored by the shortcode engine and the output is not unescaped as expected. As a result, the input is required to be unescaped by the user ...

The proper way to end the caption is with the [/caption] tag.

... to get the correct output, whereas the syntax changes again with tag pairs ...

The caption code is [[caption] like this [/caption]].

I think I can accommodate this by adjusting the unit tests and tweaking one or two conditions. The regexp should still work because it is not checking the symmetry of extra braces.

#25 @miqrogroove
10 years ago

Also consider:

The caption code is [caption] like this [/caption]].

The second tag is not considered escaped by the shortcode system. :(

There are going to be several highly ambiguous test cases for texturize. Can't be avoided.

@miqrogroove
10 years ago

More tweaks to ensure shortcode compatibility.

#26 @miqrogroove
10 years ago

In miqro-12690.4.patch:

  • Assume shortcodes never contain nested shortcodes or HTML elements.
  • Allow escaped shortcodes like [[tag]] or [[tag] or [tag]] and do not texturize them either.
  • Do texturize between escaped shortcodes like [[code]Hello World[/code]]
  • Don't waste time checking empty( $curl ) inside the loop.
  • Also fixes #8912.
  • Also fixes #27602.
  • Includes unit test by kurtpayne.
  • Added several more unit tests.

#27 in reply to: ↑ 16 ; follow-up: @mdawaffe
10 years ago

Replying to mdawaffe:

It seems like we should not texturize things inside escaped shortcodes.

I just realized that my argument wasn't very good. People talking about shortcodes will probably also enclose examples in <code> blocks, which we already don't texturize. In conclusion: I'm not sure what the expected behavior is.

@miqrogroove
10 years ago

Revert one of the previous regexp edits not needed.

#28 in reply to: ↑ 27 @miqrogroove
10 years ago

Replying to mdawaffe:

Replying to mdawaffe:

It seems like we should not texturize things inside escaped shortcodes.

I just realized that my argument wasn't very good. People talking about shortcodes will probably also enclose examples in <code> blocks, which we already don't texturize. In conclusion: I'm not sure what the expected behavior is.

Works for me because:

  • The old texturize treated escaped and un-escaped shortcodes the same.
  • shortcodes.php parses escaped and un-escaped shortcodes the same way.
  • miqro-12690.2.patch would have altered the shortcode input by texturizing escaped codes.
  • Now we can make sure escaped codes get treated just like HTML comments.

Patch updated again. The previous was a bit rushed.

#29 @wonderboymusic
10 years ago

  • Owner set to wonderboymusic
  • Resolution set to fixed
  • Status changed from reviewing to closed

In 28727:

In wptexturize(), ensure that texturization does not corrupt contents of HTML elements, HTML comments, and smartcode attributes.

Adds a variety of unit tests/assertions.

Props miqrogroove.
Fixes #12690, #8912, #27602.

#30 @wonderboymusic
10 years ago

  • Milestone changed from Future Release to 4.0

#31 follow-ups: @kovshenin
10 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

r28727 has too much backtracking which can cause segfaults in libpcre.

$text = '';
for ( $i = 0; $i <= 300; $i++ )
    $text .= 'lorem ipsum dolor sit amet ';

$text = '[' . trim( $text ) . ']';
echo wptexturize( $text );

// /var/log/syslog: Jul 23 13:40:44 debian kernel: [56656.209599] php5-fpm[6990]: segfault at 7fff4db6cf30 ip 00007f9a269d17ed sp 00007fff4db6cef0 error 6 in libpcre.so.3.13.1[7f9a269be000+3c000]

A simple workaround would be to not backtrack to try and match an empty [] shortcode, see 12690.diff. Here's a little perl script if you'd like to compare the amount of backtracking on both variants.

That said, we should probably be looking into using the data in $shortcode_tags in that regex, or even get_shortcode_regex() entirely, that would limit the search to the actual shortcodes that are available, and it will probably break so many unit tests which rely on non-existent shortcodes such as [code] or [...].

I'm running PHP 5.6.0beta3 on Debian, PCRE version 8.30 2012-02-04 with the default backtrack_limit of 1M. If you're having trouble reproducing, try increasing the number of loops.

Last edited 10 years ago by kovshenin (previous) (diff)

#32 in reply to: ↑ 31 @soulseekah
10 years ago

Replying to kovshenin:

r28727 has too much backtracking which can cause segfaults in libpcre.

To add to the discussion on segfaults in libpcre: https://lists.exim.org/lurker/message/20100430.100815.4f7be374.en.html (especially point number 4)

Basically it's a stack overflow due to too much recursion.

Setting ulimit -s 65535 will increase the stack size on the machine. But still, a large enough string will eventually overflow. Setting a good backtrack_limit to a reasonable value would prevent PHP from faulting at least, although it's probably difficult to find a good limit since it's not the count that matters but rather the amount of memory on the stack.

Another option is to compile libpcre from source and configure it with the --disable-stack-for-recursion flag. This will disable stack-based recursion and enable heap-based recursion.

Last edited 10 years ago by soulseekah (previous) (diff)

@kovshenin
10 years ago

#33 follow-up: @miqrogroove
10 years ago

http://www.regular-expressions.info/possessive.html

Reading...

kovshenin, does your patch solve the problem? Or should I continue looking for solutions? I'd like to know your results as I follow up to this.

#34 in reply to: ↑ 33 @kovshenin
10 years ago

Replying to miqrogroove: Yeah, the patch uses a greedy ++ and according to Perl it's way more efficient. In the examples I tried it stopped segfaulting. I'll be able to tell more after I run this patch on WordPress.com. I'd also like to hear some opinion regarding reusing all or part of get_shortcode_regex().

#35 @miqrogroove
10 years ago

As far as I understand that article so far, it means the existing three greedy quantifiers should also be made possessive because the proceeding chars are mutually exclusive and there should be no chance of finding matches through backtracking. That needs to be tested though. In other words...

/(<(?(?=!--).+?--\s*>|[^>]+>)|\[\[?(?:[^\[\]<>]|<[^>]+>)+\]\]?)/s

becomes

/(<(?(?=!--).+?--\s*>|[^>]++>)|\[\[?(?:[^\[\]<>]++|<[^>]++>)++\]\]?)/s

#36 in reply to: ↑ 31 @miqrogroove
10 years ago

Replying to kovshenin:

That said, we should probably be looking into using the data in $shortcode_tags in that regex, or even get_shortcode_regex() entirely, that would limit the search to the actual shortcodes that are available, and it will probably break so many unit tests

Something to consider for a later version. However, I assumed the longer form of alternation would be massively slower for the inner loop, redundant, and it would significantly alter the behavior of wptexturize. Those sorts of things tend to get bogged down with review and lack of traction. Besides, I'm not sure there is a need to make that change just for the sake of our curly quotes.

@kovshenin
10 years ago

#37 @kovshenin
10 years ago

It looks like a single possessive quantifier for the shortcode group is sufficient to stop any backtracking after anything has been matched in that group. [^\[\]<>] doesn't really have to be greedy because it gets its greediness from the + outside of the group. See 12690.3.diff.

We've been running this on WordPress.com for a few hours now and everything looks good so far.

#38 @miqrogroove
10 years ago

I should be able to test this some more by Sunday.

#39 @miqrogroove
10 years ago

I am in favor of 12690.3.diff.

Performance results:

Adding the one possessive quantifier outside the shortcode subgroup gives about 2% faster CPU time for handling that one regexp only.

When also adding possessives in other places, there is no additional benefit. Placing a new possessive inside the subgroup could create a "nested indefinite repeat" which would take more CPU time as explained in the PHP manual. Also, patterns such as [^>]+> might be optimized internally, which would explain why there is no difference from [^>]++>.

12690.3.diff appears to be optimal.

#40 @wonderboymusic
10 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

[29431] missed the ticket.

#41 @miqrogroove
10 years ago

kovshenin and soulseekah, any insights you may have on #29450 would be appreciated. I've made two substantial attempts to further reduce backtracking in that same regexp. If the latest patch works I'll be happy. If not, I'm wondering why everyone is getting different results with the same test cases?

Note: See TracTickets for help on using tickets.