WordPress.org

Make WordPress Core

Opened 4 months ago

Last modified 5 weeks ago

#50683 reviewing enhancement

Parse content for shortcodes instead of using regex

Reported by: cfinke Owned by: SergeyBiryukov
Milestone: 5.7 Priority: normal
Severity: normal Version:
Component: Shortcodes Keywords: has-patch has-unit-tests early needs-testing
Focuses: Cc:

Description

Shortcodes are currently "parsed" out of content using a regular expression and a call to preg_replace_callback(). This causes some issues like the inability to use a square bracket in a shortcode attribute.

I've written and attached to this ticket a lexer/parser that steps through the content character by character, finding shortcodes and calling do_shortcode_tag() for each one as soon as it's completely parsed. It passes every existing shortcode unit test, as well as six additional tests I've added. It fixes tickets #49955 and #43725 and may possibly fix others that deal with shortcode edge cases.

This method has the advantage of being able to deal with content that wouldn't be properly extracted by a regular expression. Take this string for example:

[shortcode1][shortcode2 att="[/shortcode1]" /][/shortcode1]

The current implementation would parse this as a single shortcode:

  • [shortcode1] with inner content [shortcode2 att="

The parse I've attached properly recognizes it as two shortcodes:

  • [shortcode1] with inner content [shortcode2 att="[/shortcode1]" /]
  • [shortcode2] with attribute att="[/shortcode1]"

Attachments (9)

shortcode-parser.diff (20.3 KB) - added by cfinke 4 months ago.
shortcode-parser.1.diff (21.3 KB) - added by cfinke 4 months ago.
Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.
shortcode-parser.2.diff (21.5 KB) - added by cfinke 4 months ago.
Fixes a bug when parsing non-shortcode content that contains backslashes.
shortcode-parser.3.diff (21.4 KB) - added by cfinke 4 months ago.
Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.
shortcode-parser.4.diff (22.5 KB) - added by cfinke 4 months ago.
Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.
shortcode-parser.5.diff (22.7 KB) - added by cfinke 4 months ago.
Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.
shortcode-parser.6.diff (22.1 KB) - added by cfinke 4 months ago.
Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.
shortcode-parser.7.diff (39.4 KB) - added by cfinke 8 weeks ago.
No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.
shortcode-parser.8.diff (40.1 KB) - added by cfinke 5 weeks ago.
Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.

Download all attachments as: .zip

Change History (28)

#1 @SergeyBiryukov
4 months ago

  • Keywords has-patch has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

#2 @cfinke
4 months ago

"The parse I've attached" should read "The patch I've attached". :/

#3 @johnbillion
4 months ago

  • Milestone changed from Future Release to 5.6
  • Version trunk deleted

Thanks for working on this @cfinke.

Are you able to provide some performance comparisons of the current implementation versus your parser please? That way we can assess whether there is any change in performance when parsing shortcodes from post content.

For example it would be great to see before-and-after figures for how long it takes to process:

  • A long post containing just one shortcode
  • A long post with zero shortcodes
  • A long post containing many shortcodes
  • A short post consisting of mainly nested shortcodes
  • etc
Last edited 4 months ago by johnbillion (previous) (diff)

#4 @cfinke
4 months ago

@johnbillion I will work on getting a suite of performance tests set up. I had been watching the total time to run the shortcode unit test suite, but it seems to be pretty variable both before and after the patch (at least on my machine).

@cfinke
4 months ago

Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.

@cfinke
4 months ago

Fixes a bug when parsing non-shortcode content that contains backslashes.

#5 @johnbillion
4 months ago

Thinking out loud: the parse() method could return early when there's no [ character in the content. Something like:

if ( false === strpos( $this->content, '[' ) ) {
    return $this->content;
}

#6 @johnbillion
4 months ago

Ah! No need. do_shortcode() handles this already.

@cfinke
4 months ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.

#7 @cfinke
4 months ago

I've done some performance testing. I created 15 sample shortcodes in different formats: a simple self-closing shortcode, a set of 10-deep nested shortcodes, a shortcode with lots of attributes, etc. Then, I created three sets of random content (short, medium, and long), composed of random selections of the ASCII keyboard characters except opening and closing brackets and then added each sample shortcode to each set of content in four different places: the beginning, middle, end, and at both the beginning and end. This process creates 195 variations of content.

Then, I tested how long it takes both the current and new do_shortcode() functions to run on all of these content strings.

The total time for the current do_shortcode() function was 0.0052 seconds.

The total time for the updated do_shortcode() that calls the parser I wrote was 0.0879 seconds.

The new code is ~17x slower than the current code on this specific dataset. It looks like the tests that had the worst performance were ones where a shortcode had sizeable inner content (~4KB), because the new parser doesn't skip ahead at all when looking at inner content.

I'll take a look at performance improvements and testing the individual variations between the old and new code.

Note: I'm not sure how this is done normally, so I kind of hacked it by putting the speed-testing code in a testcase so I could run phpunit --group shortcode --filter test_speed from the command line.

@cfinke
4 months ago

Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.

@cfinke
4 months ago

Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.

@cfinke
4 months ago

Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.

#8 @SergeyBiryukov
4 months ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary.

Just noting this should ideally be tested with mbstring.func_overload = 2 PHP setting, as it is known to have caused some issues in the past: #16057, #18007, #25061, #25259, #28162, #38226, #46050, etc.

It is deprecated as of PHP 7.2, but may still occur in the wild.

#9 @cfinke
4 months ago

Good call; I'll do some testing with that setting as well.

I also realized that by commenting out the calls to $this->debug(), the benchmark speed is improved to about 0.013 seconds. This puts the speed of the new parsing code at about 40% of the current implementation rather than 6%, which it was when I first tested them against each other.

#10 @cfinke
4 months ago

The new parsing code passes all of the unit tests when mbstring.func_overload=2 (apart from warnings about some tests not being run because getID3() can't be run with that setting).

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


8 weeks ago

#12 @johnbillion
8 weeks ago

@cfinke This looks like a great change but I don't want this ticket to lose momentum. Do you have time in the next few days to:

  • Provide a sample file (or tests) that you were using for your benchmarking? That way others can run performance tests if necessary
  • Remove the debug code from the class to clean it up a little
  • Check the inline docs for coding standards

Let us know, and if not then I think we'll punt this to 5.7 so it gets some thorough testing. Thanks!

#13 @SergeyBiryukov
8 weeks ago

  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#14 @cfinke
8 weeks ago

Thanks for the nudge @johnbillion. I'll upload a new patch now. I've removed the debug code from the class, double-checked the inline docs (although I didn't add @since, since I'm not sure what it should be), and included the test files I used for my speed benchmarking in the patch.

To just run the speed test, run phpunit --group shortcode --filter test_speed. It's not actually a test case, but it was the easiest way I knew to run the speed tests from the command line. To compare with the current implementation, you'd need to run that command on both the patched and unpatched WordPress code and compare the times that get output.

@cfinke
8 weeks ago

No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.

This ticket was mentioned in Slack in #core by helen. View the logs.


5 weeks ago

#16 @mukesh27
5 weeks ago

Hi there!

I have review shortcode-parser.7.diff and found that there are missing @since 5.6.0 in Class, functions and variables.

#17 @helen
5 weeks ago

  • Keywords early needs-testing added
  • Milestone changed from 5.6 to 5.7

I'm very sorry we didn't get to this in a timely manner - marking with the early keyword and milestoning for 5.7 so hopefully we don't drop the ball again. There's some specific testing that needs to be done I think (not on you @cfinke).

#18 @SergeyBiryukov
5 weeks ago

Just noting that this looks great, but needs some minor cleanup to adjust documentation per the standards, move the test examples to the tests/phpunit/data/ directory, remove debugging, etc.

Additionally, the Shortcode_Parser class should probably be WP_Shortcode_Parser to avoid conflicts, and should be located in its own file, wp-includes/class-wp-shortcode-parser.php, per the coding standards.

Last edited 5 weeks ago by SergeyBiryukov (previous) (diff)

@cfinke
5 weeks ago

Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.

#19 @cfinke
5 weeks ago

Since the patch eliminates the use of get_shortcode_regex() in do_shortcode(), other places in the code that call get_shortcode_regex() could potentially be updated to use WP_Shortcode_Parser as well:

  • has_shortcode()
  • strip_shortcodes()
  • do_shortcodes_in_html_tags()
  • get_post_galleries()
  • wp_ajax_parse_embed()

This wouldn't have to happen simultaneously with switching do_shortcode() to use WP_Shortcode_Parser, but it might not be a bad idea so that everything that parses shortcodes does it consistently.

Note: See TracTickets for help on using tickets.