WordPress.org

Make WordPress Core

Opened 3 weeks ago

Last modified 3 weeks ago

#50683 new enhancement

Parse content for shortcodes instead of using regex

Reported by: cfinke Owned by:
Milestone: 5.6 Priority: normal
Severity: normal Version:
Component: Shortcodes Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Shortcodes are currently "parsed" out of content using a regular expression and a call to preg_replace_callback(). This causes some issues like the inability to use a square bracket in a shortcode attribute.

I've written and attached to this ticket a lexer/parser that steps through the content character by character, finding shortcodes and calling do_shortcode_tag() for each one as soon as it's completely parsed. It passes every existing shortcode unit test, as well as six additional tests I've added. It fixes tickets #49955 and #43725 and may possibly fix others that deal with shortcode edge cases.

This method has the advantage of being able to deal with content that wouldn't be properly extracted by a regular expression. Take this string for example:

[shortcode1][shortcode2 att="[/shortcode1]" /][/shortcode1]

The current implementation would parse this as a single shortcode:

  • [shortcode1] with inner content [shortcode2 att="

The parse I've attached properly recognizes it as two shortcodes:

  • [shortcode1] with inner content [shortcode2 att="[/shortcode1]" /]
  • [shortcode2] with attribute att="[/shortcode1]"

Attachments (7)

shortcode-parser.diff (20.3 KB) - added by cfinke 3 weeks ago.
shortcode-parser.1.diff (21.3 KB) - added by cfinke 3 weeks ago.
Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.
shortcode-parser.2.diff (21.5 KB) - added by cfinke 3 weeks ago.
Fixes a bug when parsing non-shortcode content that contains backslashes.
shortcode-parser.3.diff (21.4 KB) - added by cfinke 3 weeks ago.
Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.
shortcode-parser.4.diff (22.5 KB) - added by cfinke 3 weeks ago.
Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.
shortcode-parser.5.diff (22.7 KB) - added by cfinke 3 weeks ago.
Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.
shortcode-parser.6.diff (22.1 KB) - added by cfinke 3 weeks ago.
Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.

Download all attachments as: .zip

Change History (17)

#1 @SergeyBiryukov
3 weeks ago

  • Keywords has-patch has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

#2 @cfinke
3 weeks ago

"The parse I've attached" should read "The patch I've attached". :/

#3 @johnbillion
3 weeks ago

  • Milestone changed from Future Release to 5.6
  • Version trunk deleted

Thanks for working on this @cfinke.

Are you able to provide some performance comparisons of the current implementation versus your parser please? That way we can assess whether there is any change in performance when parsing shortcodes from post content.

For example it would be great to see before-and-after figures for how long it takes to process:

  • A long post containing just one shortcode
  • A long post with zero shortcodes
  • A long post containing many shortcodes
  • A short post consisting of mainly nested shortcodes
  • etc
Last edited 3 weeks ago by johnbillion (previous) (diff)

#4 @cfinke
3 weeks ago

@johnbillion I will work on getting a suite of performance tests set up. I had been watching the total time to run the shortcode unit test suite, but it seems to be pretty variable both before and after the patch (at least on my machine).

@cfinke
3 weeks ago

Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.

@cfinke
3 weeks ago

Fixes a bug when parsing non-shortcode content that contains backslashes.

#5 @johnbillion
3 weeks ago

Thinking out loud: the parse() method could return early when there's no [ character in the content. Something like:

if ( false === strpos( $this->content, '[' ) ) {
    return $this->content;
}

#6 @johnbillion
3 weeks ago

Ah! No need. do_shortcode() handles this already.

@cfinke
3 weeks ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.

#7 @cfinke
3 weeks ago

I've done some performance testing. I created 15 sample shortcodes in different formats: a simple self-closing shortcode, a set of 10-deep nested shortcodes, a shortcode with lots of attributes, etc. Then, I created three sets of random content (short, medium, and long), composed of random selections of the ASCII keyboard characters except opening and closing brackets and then added each sample shortcode to each set of content in four different places: the beginning, middle, end, and at both the beginning and end. This process creates 195 variations of content.

Then, I tested how long it takes both the current and new do_shortcode() functions to run on all of these content strings.

The total time for the current do_shortcode() function was 0.0052 seconds.

The total time for the updated do_shortcode() that calls the parser I wrote was 0.0879 seconds.

The new code is ~17x slower than the current code on this specific dataset. It looks like the tests that had the worst performance were ones where a shortcode had sizeable inner content (~4KB), because the new parser doesn't skip ahead at all when looking at inner content.

I'll take a look at performance improvements and testing the individual variations between the old and new code.

Note: I'm not sure how this is done normally, so I kind of hacked it by putting the speed-testing code in a testcase so I could run phpunit --group shortcode --filter test_speed from the command line.

@cfinke
3 weeks ago

Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.

@cfinke
3 weeks ago

Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.

@cfinke
3 weeks ago

Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.

#8 @SergeyBiryukov
3 weeks ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary.

Just noting this should ideally be tested with mbstring.func_overload = 2 PHP setting, as it is known to have caused some issues in the past: #16057, #18007, #25061, #25259, #28162, #38226, #46050, etc.

It is deprecated as of PHP 7.2, but may still occur in the wild.

#9 @cfinke
3 weeks ago

Good call; I'll do some testing with that setting as well.

I also realized that by commenting out the calls to $this->debug(), the benchmark speed is improved to about 0.013 seconds. This puts the speed of the new parsing code at about 40% of the current implementation rather than 6%, which it was when I first tested them against each other.

#10 @cfinke
3 weeks ago

The new parsing code passes all of the unit tests when mbstring.func_overload=2 (apart from warnings about some tests not being run because getID3() can't be run with that setting).

Note: See TracTickets for help on using tickets.