WordPress.org

Make WordPress Core

Opened 12 months ago

Last modified 2 hours ago

#50683 reviewing enhancement

Parse content for shortcodes instead of using regex

Reported by: cfinke Owned by: johnbillion
Milestone: 5.9 Priority: normal
Severity: normal Version:
Component: Shortcodes Keywords: has-patch has-unit-tests early needs-testing early-like-actually-early needs-testing-info
Focuses: Cc:

Description

Shortcodes are currently "parsed" out of content using a regular expression and a call to preg_replace_callback(). This causes some issues like the inability to use a square bracket in a shortcode attribute.

I've written and attached to this ticket a lexer/parser that steps through the content character by character, finding shortcodes and calling do_shortcode_tag() for each one as soon as it's completely parsed. It passes every existing shortcode unit test, as well as six additional tests I've added. It fixes tickets #49955 and #43725 and may possibly fix others that deal with shortcode edge cases.

This method has the advantage of being able to deal with content that wouldn't be properly extracted by a regular expression. Take this string for example:

[shortcode1][shortcode2 att="[/shortcode1]" /][/shortcode1]

The current implementation would parse this as a single shortcode:

  • [shortcode1] with inner content [shortcode2 att="

The parse I've attached properly recognizes it as two shortcodes:

  • [shortcode1] with inner content [shortcode2 att="[/shortcode1]" /]
  • [shortcode2] with attribute att="[/shortcode1]"

Attachments (10)

shortcode-parser.diff (20.3 KB) - added by cfinke 12 months ago.
shortcode-parser.1.diff (21.3 KB) - added by cfinke 12 months ago.
Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.
shortcode-parser.2.diff (21.5 KB) - added by cfinke 12 months ago.
Fixes a bug when parsing non-shortcode content that contains backslashes.
shortcode-parser.3.diff (21.4 KB) - added by cfinke 12 months ago.
Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.
shortcode-parser.4.diff (22.5 KB) - added by cfinke 12 months ago.
Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.
shortcode-parser.5.diff (22.7 KB) - added by cfinke 12 months ago.
Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.
shortcode-parser.6.diff (22.1 KB) - added by cfinke 12 months ago.
Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.
shortcode-parser.7.diff (39.4 KB) - added by cfinke 10 months ago.
No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.
shortcode-parser.8.diff (40.1 KB) - added by cfinke 9 months ago.
Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.
50683.diff (40.1 KB) - added by johnbillion 2 hours ago.

Download all attachments as: .zip

Change History (41)

#1 @SergeyBiryukov
12 months ago

  • Keywords has-patch has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

#2 @cfinke
12 months ago

"The parse I've attached" should read "The patch I've attached". :/

#3 @johnbillion
12 months ago

  • Milestone changed from Future Release to 5.6
  • Version trunk deleted

Thanks for working on this @cfinke.

Are you able to provide some performance comparisons of the current implementation versus your parser please? That way we can assess whether there is any change in performance when parsing shortcodes from post content.

For example it would be great to see before-and-after figures for how long it takes to process:

  • A long post containing just one shortcode
  • A long post with zero shortcodes
  • A long post containing many shortcodes
  • A short post consisting of mainly nested shortcodes
  • etc
Last edited 12 months ago by johnbillion (previous) (diff)

#4 @cfinke
12 months ago

@johnbillion I will work on getting a suite of performance tests set up. I had been watching the total time to run the shortcode unit test suite, but it seems to be pretty variable both before and after the patch (at least on my machine).

@cfinke
12 months ago

Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.

@cfinke
12 months ago

Fixes a bug when parsing non-shortcode content that contains backslashes.

#5 @johnbillion
12 months ago

Thinking out loud: the parse() method could return early when there's no [ character in the content. Something like:

if ( false === strpos( $this->content, '[' ) ) {
    return $this->content;
}

#6 @johnbillion
12 months ago

Ah! No need. do_shortcode() handles this already.

@cfinke
12 months ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.

#7 @cfinke
12 months ago

I've done some performance testing. I created 15 sample shortcodes in different formats: a simple self-closing shortcode, a set of 10-deep nested shortcodes, a shortcode with lots of attributes, etc. Then, I created three sets of random content (short, medium, and long), composed of random selections of the ASCII keyboard characters except opening and closing brackets and then added each sample shortcode to each set of content in four different places: the beginning, middle, end, and at both the beginning and end. This process creates 195 variations of content.

Then, I tested how long it takes both the current and new do_shortcode() functions to run on all of these content strings.

The total time for the current do_shortcode() function was 0.0052 seconds.

The total time for the updated do_shortcode() that calls the parser I wrote was 0.0879 seconds.

The new code is ~17x slower than the current code on this specific dataset. It looks like the tests that had the worst performance were ones where a shortcode had sizeable inner content (~4KB), because the new parser doesn't skip ahead at all when looking at inner content.

I'll take a look at performance improvements and testing the individual variations between the old and new code.

Note: I'm not sure how this is done normally, so I kind of hacked it by putting the speed-testing code in a testcase so I could run phpunit --group shortcode --filter test_speed from the command line.

@cfinke
12 months ago

Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.

@cfinke
12 months ago

Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.

@cfinke
12 months ago

Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.

#8 @SergeyBiryukov
12 months ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary.

Just noting this should ideally be tested with mbstring.func_overload = 2 PHP setting, as it is known to have caused some issues in the past: #16057, #18007, #25061, #25259, #28162, #38226, #46050, etc.

It is deprecated as of PHP 7.2, but may still occur in the wild.

#9 @cfinke
12 months ago

Good call; I'll do some testing with that setting as well.

I also realized that by commenting out the calls to $this->debug(), the benchmark speed is improved to about 0.013 seconds. This puts the speed of the new parsing code at about 40% of the current implementation rather than 6%, which it was when I first tested them against each other.

#10 @cfinke
12 months ago

The new parsing code passes all of the unit tests when mbstring.func_overload=2 (apart from warnings about some tests not being run because getID3() can't be run with that setting).

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


10 months ago

#12 @johnbillion
10 months ago

@cfinke This looks like a great change but I don't want this ticket to lose momentum. Do you have time in the next few days to:

  • Provide a sample file (or tests) that you were using for your benchmarking? That way others can run performance tests if necessary
  • Remove the debug code from the class to clean it up a little
  • Check the inline docs for coding standards

Let us know, and if not then I think we'll punt this to 5.7 so it gets some thorough testing. Thanks!

#13 @SergeyBiryukov
10 months ago

  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#14 @cfinke
10 months ago

Thanks for the nudge @johnbillion. I'll upload a new patch now. I've removed the debug code from the class, double-checked the inline docs (although I didn't add @since, since I'm not sure what it should be), and included the test files I used for my speed benchmarking in the patch.

To just run the speed test, run phpunit --group shortcode --filter test_speed. It's not actually a test case, but it was the easiest way I knew to run the speed tests from the command line. To compare with the current implementation, you'd need to run that command on both the patched and unpatched WordPress code and compare the times that get output.

@cfinke
10 months ago

No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.

This ticket was mentioned in Slack in #core by helen. View the logs.


9 months ago

#16 @mukesh27
9 months ago

Hi there!

I have review shortcode-parser.7.diff and found that there are missing @since 5.6.0 in Class, functions and variables.

#17 @helen
9 months ago

  • Keywords early needs-testing added
  • Milestone changed from 5.6 to 5.7

I'm very sorry we didn't get to this in a timely manner - marking with the early keyword and milestoning for 5.7 so hopefully we don't drop the ball again. There's some specific testing that needs to be done I think (not on you @cfinke).

#18 @SergeyBiryukov
9 months ago

Just noting that this looks great, but needs some minor cleanup to adjust documentation per the standards, move the test examples to the tests/phpunit/data/ directory, remove debugging, etc.

Additionally, the Shortcode_Parser class should probably be WP_Shortcode_Parser to avoid conflicts, and should be located in its own file, wp-includes/class-wp-shortcode-parser.php, per the coding standards.

Last edited 9 months ago by SergeyBiryukov (previous) (diff)

@cfinke
9 months ago

Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.

#19 @cfinke
9 months ago

Since the patch eliminates the use of get_shortcode_regex() in do_shortcode(), other places in the code that call get_shortcode_regex() could potentially be updated to use WP_Shortcode_Parser as well:

  • has_shortcode()
  • strip_shortcodes()
  • do_shortcodes_in_html_tags()
  • get_post_galleries()
  • wp_ajax_parse_embed()

This wouldn't have to happen simultaneously with switching do_shortcode() to use WP_Shortcode_Parser, but it might not be a bad idea so that everything that parses shortcodes does it consistently.

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


7 months ago

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


7 months ago

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


6 months ago

#23 @hellofromTonya
6 months ago

  • Milestone changed from 5.7 to Future Release

With 5.7 Beta 1 tomorrow and no activity for several months, punting this ticket to Future Release. Let's see if we can get this one early in 5.8.

If any maintainer or committer feels this can be resolved in time, or wishes to assume ownership during a specific cycle, feel free to update the milestone accordingly.

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


6 months ago

#25 @johnbillion
6 months ago

  • Keywords early-like-actually-early added
  • Milestone changed from Future Release to 5.8

This ticket was mentioned in Slack in #core-test by hellofromtonya. View the logs.


2 months ago

#27 @Boniu91
2 months ago

  • Keywords needs-testing-info added

@cfinke Hello!

Please provide more information to help testers manually test the patch (including at the Test Scrubs):
– What are the steps to test?
– Are there any testing dependencies, such as a plugin or script?
– What is the expected behavior after applying the patch?

#28 @SergeyBiryukov
2 months ago

  • Milestone changed from 5.8 to 5.9

#29 @cfinke
7 weeks ago

@Boniu91 There are extensive unit tests for this patch that can be run via phpunit --group shortcode.

For manual testing, anything involving shortcodes would be suitable. Apart from the fixes mentioned in the comments above, all shortcode-related functionality after this patch has been applied should behave the same as it did before.

This ticket was mentioned in Slack in #core-test by hellofromtonya. View the logs.


3 hours ago

@johnbillion
2 hours ago

#31 @johnbillion
2 hours ago

  • Owner changed from SergeyBiryukov to johnbillion

50683.diff is a refresh with PHPCS fixes. I'll be reviewing this over the next few days ready for an early 5.9.

Note: See TracTickets for help on using tickets.