#12690 closed defect (bug) (fixed)
Square brackets breaking links that contain square brackets
Reported by: | iandstewart | Owned by: | wonderboymusic |
---|---|---|---|
Milestone: | 4.0 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Formatting | Keywords: | wptexturize has-patch |
Focuses: | Cc: |
Description
In the editor, wrapping square brackets around an anchor with a URL that contains square brackets, like so
[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&ref=sr_gallery_7&&ga_search_query=keep+calm+and+carry+on&ga_search_type=handmade&ga_page=13&includes[]=tags&includes[]=title">KeepCalmPosters</a> ]
converts the last double prime in the href attribute to the character code for the right double quotation mark, like so
[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&ref=sr_gallery_7&&ga_search_query=keep+calm+and+carry+on&ga_search_type=handmade&ga_page=13&includes[]=tags&includes[]=title”>KeepCalmPosters</a> ]
Attachments (11)
Change History (52)
#2
@
15 years ago
- Cc bmbalex@… added
- Keywords texturize added
- Owner set to bumbu
- Status changed from new to reviewing
#4
@
14 years ago
- Component changed from General to Formatting
- Keywords needs-testing needs-unit-tests added
- Milestone changed from 3.0 to Future Release
#5
@
12 years ago
I was unable to apply the patch to my install however I looked at the code change. I was able to get the code working on my install. I will include my patch in case anyone else has trouble with it. Fixes problem for me.
#6
follow-up:
↓ 7
@
12 years ago
another problem with this line of code is that it does not show properly in the feed.
item rdf:about="http://localhost/wp/?p=1"> <title>Hello world!</title> <link>http://localhost/wp/?p=1</link> <dc:date>2012-09-10T20:58:44Z</dc:date> <dc:creator>admin</dc:creator> <dc:subject><![CDATA[Uncategorized]]></dc:subject> <description>[photos by</description> <content:encoded><![CDATA[<p>[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&ref=sr_gallery_7&&ga_search_query=keep+calm+and+carry+on&ga_search_type=handmade&ga_page=13&includes[]=tags&includes[]=title”>KeepCalmPosters</a> ]</p> ]]></content:encoded> </item>
#7
in reply to:
↑ 6
@
12 years ago
- Cc kurtpayne added
New patch ... almost identical to 12690.2.diff except .*+
has been simplified to .+
.
Replying to MikeHansenMe:
another problem with this line of code is that it does not show properly in the feed.
<content:encoded><![CDATA[<p>[photos by <a href="http://www.etsy.com/view_listing.php?listing_id=42936748&ref=sr_gallery_7&&ga_search_query=keep+calm+and+carry+on&ga_search_type=handmade&ga_page=13&includes[]=tags&includes[]=title”>KeepCalmPosters</a> ]</p> ]]></content:encoded>
The patch fixes the missing closing quote on the feed, too.
#9
@
12 years ago
- Keywords dev-feedback added; needs-unit-tests removed
Unit test attached. Still needs dev feedback since it's touching wptexturize
. The patch doesn't break existing tests.
#11
@
11 years ago
This looks like a real-world example of #18575. Nice to have.
I'm OK with this, at a glance. Will get someone else to look at it.
#12
@
11 years ago
- Keywords wptexturize needs-patch added; has-patch needs-testing dev-feedback removed
Some good ideas here, but these patches don't work as advertised.
\[.*+\]
simply breaks shortcode detection.
\[.+\]
prevents matches on []
but still matches [ ... < ... []
and [] ... > ... ]
#13
@
11 years ago
I'm looking if there's a way to make variable length expressions atomic so that the engine won't backtrack inside an element. But given the complexity of that sort of thing, I'm tempted to just rewrite the entire expression instead, using conditions instead of alternation.
#14
@
11 years ago
Here's a simple strategy:
/(<.*>|\[[^<]*\])/
I'll put that in a new patch and then do some benchmarks against a conditional strategy. My intuition tells me the alternation pattern is not optimal.
#15
follow-up:
↓ 16
@
11 years ago
- Keywords has-patch added; needs-patch removed
In miqro-12690.patch:
- Assume shortcodes never contain nested shortcodes or HTML elements.
- Do not match escaped shortcodes like
[[gallery]]
. - Don't waste time checking empty( $curl ) inside the loop.
- Also fixes #27602. No reason to patch this twice!
- Includes unit test by kurtpayne.
- Added several more unit tests.
#16
in reply to:
↑ 15
;
follow-ups:
↓ 18
↓ 27
@
10 years ago
Replying to miqrogroove:
In miqro-12690.patch:
It seems like we should not texturize things inside escaped shortcodes. The main use case for escaped shortcodes is to talk about shortcodes. You don't want people trying to copy and paste your examples and getting stuck with curly quotes.
So [[gallery ...]]
should be preserved and [[gallery ...]...[gallery ...]]
should either be preserved or be transformed to [[gallery ...]…[gallery ...]]
, I'm not sure.
Aside: there's a parse error in the test file :)
#17
follow-up:
↓ 21
@
10 years ago
Then we would have to decide what to do with [[caption]]Hello World[[/caption]]
#18
in reply to:
↑ 16
;
follow-up:
↓ 22
@
10 years ago
Replying to mdawaffe:
[[gallery ...]...[gallery ...]]
should either be preserved or be transformed to[[gallery ...]…[gallery ...]]
, I'm not sure.
Sorry, but that needs to be a separate ticket. The shortcode system doesn't support that syntax at all.
#19
follow-up:
↓ 23
@
10 years ago
In miqro-12690.3.patch:
- Assume shortcodes never contain nested shortcodes or HTML elements.
- Allow escaped shortcodes like
[[gallery]]
and do not texturize them either. - Do texturize between escaped shortcodes like
[[code]]Hello World[[/code]]
- Don't waste time checking empty( $curl ) inside the loop.
- Also fixes #8912.
- Also fixes #27602.
- Includes unit test by kurtpayne.
- Added several more unit tests.
This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.
10 years ago
#21
in reply to:
↑ 17
@
10 years ago
Replying to miqrogroove:
Then we would have to decide what to do with
[[caption]]Hello World[[/caption]]
fwiw, this gives weird output in 3.8 and 3.9:
[caption]]Hello World[[/caption]
I'll take a closer look at shortcodes.php to figure out why it does that.
#22
in reply to:
↑ 18
@
10 years ago
Replying to miqrogroove:
Replying to mdawaffe:
[[gallery ...]...[gallery ...]]
should either be preserved or be transformed to[[gallery ...]…[gallery ...]]
, I'm not sure.
Sorry, but that needs to be a separate ticket. The shortcode system doesn't support that syntax at all.
Sorry - I meant [[code]foo[/code]]
.
#23
in reply to:
↑ 19
@
10 years ago
Replying to miqrogroove:
In miqro-12690.3.patch:
- Do texturize between escaped shortcodes like
[[code]]Hello World[[/code]]
I haven't looked at the patch to see if it matters, but the correct way to escape [code]foo[/code]
is [[code]foo[/code]]
not [[code]]foo[[/code]]
.
#24
@
10 years ago
Yes, there is an unfortunate bug in shortcodes.php. If we want to write something like this ...
The proper way to end the caption is with the [[/caption]] tag.
... the "escaped" tag is ignored by the shortcode engine and the output is not unescaped as expected. As a result, the input is required to be unescaped by the user ...
The proper way to end the caption is with the [/caption] tag.
... to get the correct output, whereas the syntax changes again with tag pairs ...
The caption code is [[caption] like this [/caption]].
I think I can accommodate this by adjusting the unit tests and tweaking one or two conditions. The regexp should still work because it is not checking the symmetry of extra braces.
#25
@
10 years ago
Also consider:
The caption code is [caption] like this [/caption]].
The second tag is not considered escaped by the shortcode system. :(
There are going to be several highly ambiguous test cases for texturize. Can't be avoided.
#26
@
10 years ago
In miqro-12690.4.patch:
- Assume shortcodes never contain nested shortcodes or HTML elements.
- Allow escaped shortcodes like
[[tag]] or [[tag] or [tag]]
and do not texturize them either. - Do texturize between escaped shortcodes like
[[code]Hello World[/code]]
- Don't waste time checking empty( $curl ) inside the loop.
- Also fixes #8912.
- Also fixes #27602.
- Includes unit test by kurtpayne.
- Added several more unit tests.
#27
in reply to:
↑ 16
;
follow-up:
↓ 28
@
10 years ago
Replying to mdawaffe:
It seems like we should not texturize things inside escaped shortcodes.
I just realized that my argument wasn't very good. People talking about shortcodes will probably also enclose examples in <code>
blocks, which we already don't texturize. In conclusion: I'm not sure what the expected behavior is.
#28
in reply to:
↑ 27
@
10 years ago
Replying to mdawaffe:
Replying to mdawaffe:
It seems like we should not texturize things inside escaped shortcodes.
I just realized that my argument wasn't very good. People talking about shortcodes will probably also enclose examples in
<code>
blocks, which we already don't texturize. In conclusion: I'm not sure what the expected behavior is.
Works for me because:
- The old texturize treated escaped and un-escaped shortcodes the same.
- shortcodes.php parses escaped and un-escaped shortcodes the same way.
- miqro-12690.2.patch would have altered the shortcode input by texturizing escaped codes.
- Now we can make sure escaped codes get treated just like HTML comments.
Patch updated again. The previous was a bit rushed.
#29
@
10 years ago
- Owner set to wonderboymusic
- Resolution set to fixed
- Status changed from reviewing to closed
In 28727:
#31
follow-ups:
↓ 32
↓ 36
@
10 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
r28727 has too much backtracking which can cause segfaults in libpcre.
$text = ''; for ( $i = 0; $i <= 300; $i++ ) $text .= 'lorem ipsum dolor sit amet '; $text = '[' . trim( $text ) . ']'; echo wptexturize( $text ); // /var/log/syslog: Jul 23 13:40:44 debian kernel: [56656.209599] php5-fpm[6990]: segfault at 7fff4db6cf30 ip 00007f9a269d17ed sp 00007fff4db6cef0 error 6 in libpcre.so.3.13.1[7f9a269be000+3c000]
A simple workaround would be to not backtrack to try and match an empty []
shortcode, see 12690.diff. Here's a little perl script if you'd like to compare the amount of backtracking on both variants.
That said, we should probably be looking into using the data in $shortcode_tags in that regex, or even get_shortcode_regex() entirely, that would limit the search to the actual shortcodes that are available, and it will probably break so many unit tests which rely on non-existent shortcodes such as [code]
or [...]
.
I'm running PHP 5.6.0beta3 on Debian, PCRE version 8.30 2012-02-04 with the default backtrack_limit of 1M. If you're having trouble reproducing, try increasing the number of loops.
#32
in reply to:
↑ 31
@
10 years ago
Replying to kovshenin:
r28727 has too much backtracking which can cause segfaults in libpcre.
To add to the discussion on segfaults in libpcre: https://lists.exim.org/lurker/message/20100430.100815.4f7be374.en.html (especially point number 4)
Basically it's a stack overflow due to too much recursion.
Setting ulimit -s 65535
will increase the stack size on the machine. But still, a large enough string will eventually overflow. Setting a good backtrack_limit
to a reasonable value would prevent PHP from faulting at least, although it's probably difficult to find a good limit since it's not the count that matters but rather the amount of memory on the stack.
Another option is to compile libpcre from source and configure it with the --disable-stack-for-recursion
flag. This will disable stack-based recursion and enable heap-based recursion.
#33
follow-up:
↓ 34
@
10 years ago
http://www.regular-expressions.info/possessive.html
Reading...
kovshenin, does your patch solve the problem? Or should I continue looking for solutions? I'd like to know your results as I follow up to this.
#34
in reply to:
↑ 33
@
10 years ago
Replying to miqrogroove: Yeah, the patch uses a greedy ++ and according to Perl it's way more efficient. In the examples I tried it stopped segfaulting. I'll be able to tell more after I run this patch on WordPress.com. I'd also like to hear some opinion regarding reusing all or part of get_shortcode_regex().
#35
@
10 years ago
As far as I understand that article so far, it means the existing three greedy quantifiers should also be made possessive because the proceeding chars are mutually exclusive and there should be no chance of finding matches through backtracking. That needs to be tested though. In other words...
/(<(?(?=!--).+?--\s*>|[^>]+>)|\[\[?(?:[^\[\]<>]|<[^>]+>)+\]\]?)/s
becomes
/(<(?(?=!--).+?--\s*>|[^>]++>)|\[\[?(?:[^\[\]<>]++|<[^>]++>)++\]\]?)/s
#36
in reply to:
↑ 31
@
10 years ago
Replying to kovshenin:
That said, we should probably be looking into using the data in $shortcode_tags in that regex, or even get_shortcode_regex() entirely, that would limit the search to the actual shortcodes that are available, and it will probably break so many unit tests
Something to consider for a later version. However, I assumed the longer form of alternation would be massively slower for the inner loop, redundant, and it would significantly alter the behavior of wptexturize. Those sorts of things tend to get bogged down with review and lack of traction. Besides, I'm not sure there is a need to make that change just for the sake of our curly quotes.
#37
@
10 years ago
It looks like a single possessive quantifier for the shortcode group is sufficient to stop any backtracking after anything has been matched in that group. [^\[\]<>]
doesn't really have to be greedy because it gets its greediness from the +
outside of the group. See 12690.3.diff.
We've been running this on WordPress.com for a few hours now and everything looks good so far.
#39
@
10 years ago
I am in favor of 12690.3.diff.
Performance results:
Adding the one possessive quantifier outside the shortcode subgroup gives about 2% faster CPU time for handling that one regexp only.
When also adding possessives in other places, there is no additional benefit. Placing a new possessive inside the subgroup could create a "nested indefinite repeat" which would take more CPU time as explained in the PHP manual. Also, patterns such as [^>]+>
might be optimized internally, which would explain why there is no difference from [^>]++>
.
12690.3.diff appears to be optimal.
#40
@
10 years ago
- Resolution set to fixed
- Status changed from reopened to closed
[29431] missed the ticket.
#41
@
10 years ago
kovshenin and soulseekah, any insights you may have on #29450 would be appreciated. I've made two substantial attempts to further reduce backtracking in that same regexp. If the latest patch works I'll be happy. If not, I'm wondering why everyone is getting different results with the same test cases?
problem was in dividing link into smaller parts by preg_split, and function was transforming single quotes into html format
with these patch function 'wptexturize' don't take in account sqare empty parentheses, and link with empty parantheses will not be divided into smaller parts => no unwanted quotes transformation