WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 5 years ago

Last modified 5 years ago

#28575 closed defect (bug) (fixed)

Eliminate redundant preg_match() to improve wptexturize() performance

Reported by: dllh Owned by: SergeyBiryukov
Milestone: 4.0 Priority: normal
Severity: normal Version: 3.9
Component: Formatting Keywords: has-patch commit
Focuses: performance Cc:

Description

wptexturize() performs very poorly for large posts. In trying to investigate ways of speeding it up, I noticed that when looping over $textarr to actually do the replacements, we do the preg_match() to check for strings like 9x9 for every iteration of the loop. It seems to me that we should do this check only once, since for large chunks of $text, we're doing a preg_match() against all of $text for each item in $textarr, which will be a lot of items for large text.

I did some rough profiling of the core code against the attached patch. Methodology:

  • Put a chunk of text of approaching 1MB in a file. The text I used came from an actual large post I encountered this problem with.
  • Put the wptexturized text in a second file.
  • For 10 iterations, call wptexturize() on the original file.
  • Compare the value to the value in the second file to validate that nothing in the way wptexturize() actually processes text has been changed by my modifications.
  • Measure the total time (I used the time command on my linux box).

With no modifications, core consistently took about 1m9s to run through my little test. With the attached patch, it consistently took about 6s. For just 1 iteration, the difference was more like 8s to 1.4s, so an appreciable amount of time even for a single run of the function.

Arguably, users shouldn't be making posts of 1MB, but it happens. On pages like search results that will generate an excerpt and thus fire wptexturize() potentially many times, the performance increase here stands to be fairly significant provided there are large posts among the results.

I'm attaching both a proposed patch and the script I used for the rough profiling. The post I tested isn't mine, so I can't share it, but that should be easy enough to generate.

Attachments (3)

wptexturize.patch (1014 bytes) - added by dllh 5 years ago.
wptexturize.php (450 bytes) - added by dllh 5 years ago.
Simple script for profiling.
28575.patch (964 bytes) - added by SergeyBiryukov 5 years ago.

Download all attachments as: .zip

Change History (16)

@dllh
5 years ago

@dllh
5 years ago

Simple script for profiling.

#1 @dllh
5 years ago

  • Keywords has-patch added

#2 @SergeyBiryukov
5 years ago

  • Component changed from General to Formatting
  • Keywords commit added
  • Milestone changed from Awaiting Review to 4.0

Looks pretty straightforward.

#3 @SergeyBiryukov
5 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 28776:

Move preg_match() that should only run once out of the loop.

props dllh.
fixes #28575.

#4 @SergeyBiryukov
5 years ago

  • Version set to 3.9

Introduced in [27839].

#5 @miqrogroove
5 years ago

  • Keywords needs-testing added
  • Resolution fixed deleted
  • Status changed from closed to reopened
  • Type changed from enhancement to defect (bug)

Resolution appears to be incorrect. This needs more testing.

#6 @miqrogroove
5 years ago

  • Keywords has-patch commit removed

Please revert [28776] to prevent corrupting HTML output. I can work on testing/patching by next week. In the meantime, I would suggest that the bug was a simple typo, not a redundant pattern. Unsure without a lot more testing. Thank you for the bug report.

#7 follow-ups: @SergeyBiryukov
5 years ago

  • Keywords has-patch added

Indeed, looks like 28575.patch would be a correct solution here.

But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass, including the multiplication test.

#8 in reply to: ↑ 7 @dllh
5 years ago

Replying to SergeyBiryukov:

Indeed, looks like 28575.patch would be a correct solution here.

But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass, including the multiplication test.

I had just come up with the same patch after some more testing. I hadn't tested with markup that included a multiplication string. Interestingly, with the old code, the slowness reproduced for me only when the post did not include such a string. That is, when I test the code from before my patch with html that includes a 9x9 string, it's fast; when the html does not include such a string, it's slow.So in testing [27839], I suppose the tested code included the multiplication strings and the slowness was never manifest.

#9 in reply to: ↑ 7 @miqrogroove
5 years ago

Replying to SergeyBiryukov:

Indeed, looks like 28575.patch would be a correct solution here.

That's what I had in mind.

But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass

I was thinking back to why I originally moved that code inside the loop. The replacement was matching and modifying links to image files, which is not what we want. I'm not even sure I wrote unit tests for that scenario.

#10 @SergeyBiryukov
5 years ago

  • Keywords commit added; needs-testing removed

#11 @SergeyBiryukov
5 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 28802:

Revert [28776] and use a correct variable instead.

props miqrogroove.
fixes #28575.

This ticket was mentioned in IRC in #wordpress-dev by Clorith. View the logs.


5 years ago

#13 @miqrogroove
5 years ago

My performance results:

Looking strictly at the regex performance only, patching 3.9.1 for the one variable name gives a speed increase of about 10% for a sample post.

Substituting the $dynamic patterns used in trunk now, about 50% more time is needed.

Substituting the new HTML/shortcode avoidance logic, about 4% more time is needed.

Patching trunk for the one variable name, about 8% less time is needed.

With the patch from #28483, another 4% less time is needed.

Here is a rough breakdown of the time being spent on regex:

17% - Dashes and spaces (4 patterns)
8% - '99' and '99" (2 patterns)
8% - "42" or '42.00' (2 patterns)
8% - 9" and 9' (2 patterns)
5% - HTML/shortcode avoidance
5% - '99
5% - Single quote at start
5% - Apostrophe in a word.
5% - Double quote at start
4% - Any remaining double quotes.
4% - Single quotes followed by spaces

Conclusions:

The difference between running with and without the affected if statement altogether is on the order of 1%. So, the patch for this ticket was only slightly better than deleting the bugged code.

Each of the $dynamic patterns is optimal, using almost exactly the same amount of time. Reducing the number of patterns is paramount for improving performance.

For other than the HTML/shortcode avoidance pattern, about 95% of the time consumed is devoted to looping through $textarr and re-executing the patterns versus running each pattern one time on the entire post.

The calls to _wptexturize_pushpop_element() are not trivial, consuming 7% of the total time.

Last edited 5 years ago by miqrogroove (previous) (diff)
Note: See TracTickets for help on using tickets.