Make WordPress Core

Opened 4 years ago

Last modified 6 months ago

#50683 reviewing enhancement

Parse content for shortcodes instead of using regex

Reported by: cfinke's profile cfinke Owned by: johnbillion's profile johnbillion
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Shortcodes Keywords: has-patch has-unit-tests early needs-testing early-like-actually-early needs-testing-info
Focuses: Cc:

Description

Shortcodes are currently "parsed" out of content using a regular expression and a call to preg_replace_callback(). This causes some issues like the inability to use a square bracket in a shortcode attribute.

I've written and attached to this ticket a lexer/parser that steps through the content character by character, finding shortcodes and calling do_shortcode_tag() for each one as soon as it's completely parsed. It passes every existing shortcode unit test, as well as six additional tests I've added. It fixes tickets #49955 and #43725 and may possibly fix others that deal with shortcode edge cases.

This method has the advantage of being able to deal with content that wouldn't be properly extracted by a regular expression. Take this string for example:

[shortcode1][shortcode2 att="[/shortcode1]" /][/shortcode1]

The current implementation would parse this as a single shortcode:

  • [shortcode1] with inner content [shortcode2 att="

The parse I've attached properly recognizes it as two shortcodes:

  • [shortcode1] with inner content [shortcode2 att="[/shortcode1]" /]
  • [shortcode2] with attribute att="[/shortcode1]"

Attachments (10)

shortcode-parser.diff (20.3 KB) - added by cfinke 4 years ago.
shortcode-parser.1.diff (21.3 KB) - added by cfinke 4 years ago.
Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.
shortcode-parser.2.diff (21.5 KB) - added by cfinke 4 years ago.
Fixes a bug when parsing non-shortcode content that contains backslashes.
shortcode-parser.3.diff (21.4 KB) - added by cfinke 4 years ago.
Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.
shortcode-parser.4.diff (22.5 KB) - added by cfinke 4 years ago.
Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.
shortcode-parser.5.diff (22.7 KB) - added by cfinke 4 years ago.
Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.
shortcode-parser.6.diff (22.1 KB) - added by cfinke 4 years ago.
Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.
shortcode-parser.7.diff (39.4 KB) - added by cfinke 4 years ago.
No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.
shortcode-parser.8.diff (40.1 KB) - added by cfinke 4 years ago.
Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.
50683.diff (40.1 KB) - added by johnbillion 3 years ago.

Download all attachments as: .zip

Change History (53)

#1 @SergeyBiryukov
4 years ago

  • Keywords has-patch has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

#2 @cfinke
4 years ago

"The parse I've attached" should read "The patch I've attached". :/

#3 @johnbillion
4 years ago

  • Milestone changed from Future Release to 5.6
  • Version trunk deleted

Thanks for working on this @cfinke.

Are you able to provide some performance comparisons of the current implementation versus your parser please? That way we can assess whether there is any change in performance when parsing shortcodes from post content.

For example it would be great to see before-and-after figures for how long it takes to process:

  • A long post containing just one shortcode
  • A long post with zero shortcodes
  • A long post containing many shortcodes
  • A short post consisting of mainly nested shortcodes
  • etc
Last edited 4 years ago by johnbillion (previous) (diff)

#4 @cfinke
4 years ago

@johnbillion I will work on getting a suite of performance tests set up. I had been watching the total time to run the shortcode unit test suite, but it seems to be pretty variable both before and after the patch (at least on my machine).

@cfinke
4 years ago

Updated diff that vastly improves performance on long text by skipping sections of text that don't have any opening brackets.

@cfinke
4 years ago

Fixes a bug when parsing non-shortcode content that contains backslashes.

#5 @johnbillion
4 years ago

Thinking out loud: the parse() method could return early when there's no [ character in the content. Something like:

if ( false === strpos( $this->content, '[' ) ) {
    return $this->content;
}

#6 @johnbillion
4 years ago

Ah! No need. do_shortcode() handles this already.

@cfinke
4 years ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary. This also speeds up the new parser code significantly.

#7 @cfinke
4 years ago

I've done some performance testing. I created 15 sample shortcodes in different formats: a simple self-closing shortcode, a set of 10-deep nested shortcodes, a shortcode with lots of attributes, etc. Then, I created three sets of random content (short, medium, and long), composed of random selections of the ASCII keyboard characters except opening and closing brackets and then added each sample shortcode to each set of content in four different places: the beginning, middle, end, and at both the beginning and end. This process creates 195 variations of content.

Then, I tested how long it takes both the current and new do_shortcode() functions to run on all of these content strings.

The total time for the current do_shortcode() function was 0.0052 seconds.

The total time for the updated do_shortcode() that calls the parser I wrote was 0.0879 seconds.

The new code is ~17x slower than the current code on this specific dataset. It looks like the tests that had the worst performance were ones where a shortcode had sizeable inner content (~4KB), because the new parser doesn't skip ahead at all when looking at inner content.

I'll take a look at performance improvements and testing the individual variations between the old and new code.

Note: I'm not sure how this is done normally, so I kind of hacked it by putting the speed-testing code in a testcase so I could run phpunit --group shortcode --filter test_speed from the command line.

@cfinke
4 years ago

Optimizes inner content parsing by skipping ahead to the next bracket inside the content. This improves this parser's performance on the above-mentioned speed test suite to about 0.05 seconds total.

@cfinke
4 years ago

Improves speed again by fast-forwarding to the end of the content if there are no more brackets after a shortcode is found. Benchmark time is about 0.027 seconds.

@cfinke
4 years ago

Remove code that handles the concept of escaping something with a backslash when not in a quoted string. Shortcodes themselves are escaped with double brackets, and outside of strings, there's no other escaping needed. Minor improvement to the speed testing benchmark -- about 0.024 seconds.

#8 @SergeyBiryukov
4 years ago

Switching to non-mb_* string functions. The rest of the shortcode code doesn't use them, so maybe they aren't necessary.

Just noting this should ideally be tested with mbstring.func_overload = 2 PHP setting, as it is known to have caused some issues in the past: #16057, #18007, #25061, #25259, #28162, #38226, #46050, etc.

It is deprecated as of PHP 7.2, but may still occur in the wild.

#9 @cfinke
4 years ago

Good call; I'll do some testing with that setting as well.

I also realized that by commenting out the calls to $this->debug(), the benchmark speed is improved to about 0.013 seconds. This puts the speed of the new parsing code at about 40% of the current implementation rather than 6%, which it was when I first tested them against each other.

#10 @cfinke
4 years ago

The new parsing code passes all of the unit tests when mbstring.func_overload=2 (apart from warnings about some tests not being run because getID3() can't be run with that setting).

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


4 years ago

#12 @johnbillion
4 years ago

@cfinke This looks like a great change but I don't want this ticket to lose momentum. Do you have time in the next few days to:

  • Provide a sample file (or tests) that you were using for your benchmarking? That way others can run performance tests if necessary
  • Remove the debug code from the class to clean it up a little
  • Check the inline docs for coding standards

Let us know, and if not then I think we'll punt this to 5.7 so it gets some thorough testing. Thanks!

#13 @SergeyBiryukov
4 years ago

  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#14 @cfinke
4 years ago

Thanks for the nudge @johnbillion. I'll upload a new patch now. I've removed the debug code from the class, double-checked the inline docs (although I didn't add @since, since I'm not sure what it should be), and included the test files I used for my speed benchmarking in the patch.

To just run the speed test, run phpunit --group shortcode --filter test_speed. It's not actually a test case, but it was the easiest way I knew to run the speed tests from the command line. To compare with the current implementation, you'd need to run that command on both the patched and unpatched WordPress code and compare the times that get output.

@cfinke
4 years ago

No functionality changes, but I've updated documentation to match coding standards and I've included the sample shortcode files I used for speed testing.

This ticket was mentioned in Slack in #core by helen. View the logs.


4 years ago

#16 @mukesh27
4 years ago

Hi there!

I have review shortcode-parser.7.diff and found that there are missing @since 5.6.0 in Class, functions and variables.

#17 @helen
4 years ago

  • Keywords early needs-testing added
  • Milestone changed from 5.6 to 5.7

I'm very sorry we didn't get to this in a timely manner - marking with the early keyword and milestoning for 5.7 so hopefully we don't drop the ball again. There's some specific testing that needs to be done I think (not on you @cfinke).

#18 @SergeyBiryukov
4 years ago

Just noting that this looks great, but needs some minor cleanup to adjust documentation per the standards, move the test examples to the tests/phpunit/data/ directory, remove debugging, etc.

Additionally, the Shortcode_Parser should probably be WP_Shortcode_Parser to avoid conflicts, and should be located in its own file, wp-includes/class-wp-shortcode-parser.php, per the coding standards.

Version 0, edited 4 years ago by SergeyBiryukov (next)

@cfinke
4 years ago

Renamed Shortcode_Parser class to WP_Shortcode_Parser and moved it to its own file; added "@since 5.7.0" to all of the docblocks; moved shortcode examples to the phpunit/data/ folder.

#19 @cfinke
4 years ago

Since the patch eliminates the use of get_shortcode_regex() in do_shortcode(), other places in the code that call get_shortcode_regex() could potentially be updated to use WP_Shortcode_Parser as well:

  • has_shortcode()
  • strip_shortcodes()
  • do_shortcodes_in_html_tags()
  • get_post_galleries()
  • wp_ajax_parse_embed()

This wouldn't have to happen simultaneously with switching do_shortcode() to use WP_Shortcode_Parser, but it might not be a bad idea so that everything that parses shortcodes does it consistently.

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


4 years ago

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


4 years ago

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


4 years ago

#23 @hellofromTonya
4 years ago

  • Milestone changed from 5.7 to Future Release

With 5.7 Beta 1 tomorrow and no activity for several months, punting this ticket to Future Release. Let's see if we can get this one early in 5.8.

If any maintainer or committer feels this can be resolved in time, or wishes to assume ownership during a specific cycle, feel free to update the milestone accordingly.

This ticket was mentioned in Slack in #core by hellofromtonya. View the logs.


4 years ago

#25 @johnbillion
4 years ago

  • Keywords early-like-actually-early added
  • Milestone changed from Future Release to 5.8

This ticket was mentioned in Slack in #core-test by hellofromtonya. View the logs.


4 years ago

#27 @Boniu91
4 years ago

  • Keywords needs-testing-info added

@cfinke Hello!

Please provide more information to help testers manually test the patch (including at the Test Scrubs):
– What are the steps to test?
– Are there any testing dependencies, such as a plugin or script?
– What is the expected behavior after applying the patch?

#28 @SergeyBiryukov
4 years ago

  • Milestone changed from 5.8 to 5.9

#29 @cfinke
4 years ago

@Boniu91 There are extensive unit tests for this patch that can be run via phpunit --group shortcode.

For manual testing, anything involving shortcodes would be suitable. Apart from the fixes mentioned in the comments above, all shortcode-related functionality after this patch has been applied should behave the same as it did before.

This ticket was mentioned in Slack in #core-test by hellofromtonya. View the logs.


3 years ago

@johnbillion
3 years ago

#31 @johnbillion
3 years ago

  • Owner changed from SergeyBiryukov to johnbillion

50683.diff is a refresh with PHPCS fixes. I'll be reviewing this over the next few days ready for an early 5.9.

#33 @johnbillion
3 years ago

I've moved this into a PR at https://github.com/WordPress/wordpress-develop/pull/1554 so we can assess the performance impact across PHP versions on the GitHub Actions CI.

I've increased the size of the content added to the performance test payloads so they stress test the parser more and reduce the variance when re-running the tests. The "long content" data is now 100,000 bytes (instead of 1,000) which is roughly equal to 15,000 words in English.

I'll let the tests run without the parser in place to get a baseline, then add the parser and run the tests again.

On my local installation I'm seeing a slight performance degradation from a total of around 0.03s to around 0.15s, which is slower but not something I would worry about too much. The advantage of fixing bugs outweighs a minimal performance impact at the extremities of content size.

#34 @johnbillion
3 years ago

Performance change (in seconds) is below.

This has quite a big performance impact on PHP 5.6 and 7.0 which is concerning. PHP 7.2 is an outlier and I'm going to re-run the test suite to see if it's consistent.

PHPBeforeAfterChange
5.6 0.3158 2.8549 +2.5391
7.0 0.1396 1.0648 +0.9252
7.1 0.1789 0.3655 +0.1866
7.2 0.1460 1.2098 +1.0638
7.3 0.1547 0.4010 +0.2463
7.4 0.1146 0.7769 +0.6623
8.0 0.1091 0.2948 +0.1857

cc @cfinke

Last edited 3 years ago by johnbillion (previous) (diff)

#35 @johnbillion
3 years ago

I re-ran the suite. Seems there's quite a lot of variance with the test speeds.

PHPAfter
5.6 3.2700
7.0 1.2237
7.1 0.4394
7.2 0.3922
7.3 1.0296
7.4 0.8874
8.0 0.8836

#36 @johnbillion
3 years ago

I updated the tests so the speed test runs 20 times during each run (https://github.com/WordPress/wordpress-develop/actions/runs/1108267266). Loads of variance between the 20 test runs on all PHP versions.

5.6 7.0 7.1 7.2 7.3 7.4 8.0
1.39231.21121.16170.89130.71880.96360.8124
3.10341.12880.36790.33810.22230.2980.2479
3.20340.44660.40790.83330.21170.29160.23
3.30631.09010.41820.83970.70360.93890.2161
1.24821.09041.12260.35530.71090.30780.2496
1.27950.47890.38960.86610.71310.26520.7677
3.09931.11120.41610.34240.67490.26180.7822
3.22990.43660.41140.32650.25340.23430.7902
1.1960.45020.39360.88510.67710.25460.781
3.19390.46680.38790.88770.24750.27880.6941
3.13951.1070.39470.84120.2220.26110.2523
1.2811.1380.42320.83880.73630.2560.7009
3.09940.40051.05180.88090.64580.25340.2308
3.05141.12610.41630.3480.2330.80050.7244
1.29021.12650.3890.3270.26420.25040.2551
1.15391.14551.12090.32620.24570.8610.254
3.05680.48350.44010.33250.250.90830.6867
1.24160.4430.4220.340.66770.32830.2406
1.24861.17351.14310.33590.25810.34440.7076
3.07330.45561.08760.34520.66880.32770.2211

@cfinke Each of these 20 runs for each PHP version uses different random strings. Can you think of something about the content of the strings that would cause the performance of the parser to vary so much?

Last edited 3 years ago by johnbillion (previous) (diff)

#37 @cfinke
3 years ago

I think the variance is not as great as it seems -- within each version there are really two groupings of times. With PHP 7.1, for example, there are six times between 1.05 and 1.16 and 14 times between 0.36 and 0.44 (and nothing in-between). PHP 7.2 is split between ~0.3 and ~0.8 times.

So there must be some condition that forks the performance test onto either the fast path or the slow path. I'll see if I can reproduce these results on my machine and investigate what might be responsible.

#38 @cfinke
3 years ago

It has something to do with the new entries in $variations that call str_repeat(). Without them, the disparate duration groups disappear.

xdebug isn't working on my machine, so I'll continue investigating after I get it fixed.

#39 @cfinke
3 years ago

It appears to be reliant on whether a > character follows the last < character in the content being passed to do_shortcode(). When there is no closing >, then shortcode_parse_atts() takes much less time to complete, probably because a lot of the string gets stripped or ignored.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


3 years ago

#41 @audrasjb
3 years ago

  • Milestone changed from 5.9 to Future Release

As seen in today's bug scrub, we're unfortunately past the early-like-actually-early phase in the cycle :)

Seriously, given the performance issue underlined in comment 34, let's investigate a bit more on this before adding a milestone :)

Last edited 3 years ago by audrasjb (previous) (diff)

This ticket was mentioned in Slack in #core-performance by johnbillion. View the logs.


13 months ago

#43 @dmsnell
6 months ago

I've been thinking about the proposed patch quite a bit, and I keep going back to some of the fundamental issues with Shortcodes.

It reminds me that I think we could clarify many of our problems, parsing included, if we think about a new breaking change:

  • Shortcodes can only appear inside HTML attributes and text nodes.

That is, we exclude the ability to use Shortcodes to give a tag its name, or add an attribute to an existing tag. If we do this, we could make a couple of other changes:

  • Use the HTML API to find them.
  • Re-use the HTML attribute parsing and apply the rules to shortcode attributes.

These are behavioral changes, but I think there's compelling reason to consider them apart from this particular issue. They would fit nicely with this issue.

Note: See TracTickets for help on using tickets.