#28575 closed defect (bug) (fixed)
Eliminate redundant preg_match() to improve wptexturize() performance
Reported by: | dllh | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | 4.0 | Priority: | normal |
Severity: | normal | Version: | 3.9 |
Component: | Formatting | Keywords: | has-patch commit |
Focuses: | performance | Cc: |
Description
wptexturize()
performs very poorly for large posts. In trying to investigate ways of speeding it up, I noticed that when looping over $textarr
to actually do the replacements, we do the preg_match()
to check for strings like 9x9
for every iteration of the loop. It seems to me that we should do this check only once, since for large chunks of $text
, we're doing a preg_match()
against all of $text
for each item in $textarr
, which will be a lot of items for large text.
I did some rough profiling of the core code against the attached patch. Methodology:
- Put a chunk of text of approaching 1MB in a file. The text I used came from an actual large post I encountered this problem with.
- Put the wptexturized text in a second file.
- For 10 iterations, call
wptexturize()
on the original file. - Compare the value to the value in the second file to validate that nothing in the way
wptexturize()
actually processes text has been changed by my modifications. - Measure the total time (I used the
time
command on my linux box).
With no modifications, core consistently took about 1m9s to run through my little test. With the attached patch, it consistently took about 6s. For just 1 iteration, the difference was more like 8s to 1.4s, so an appreciable amount of time even for a single run of the function.
Arguably, users shouldn't be making posts of 1MB, but it happens. On pages like search results that will generate an excerpt and thus fire wptexturize()
potentially many times, the performance increase here stands to be fairly significant provided there are large posts among the results.
I'm attaching both a proposed patch and the script I used for the rough profiling. The post I tested isn't mine, so I can't share it, but that should be easy enough to generate.
Attachments (3)
Change History (16)
#2
@
10 years ago
- Component changed from General to Formatting
- Keywords commit added
- Milestone changed from Awaiting Review to 4.0
Looks pretty straightforward.
#3
@
10 years ago
- Owner set to SergeyBiryukov
- Resolution set to fixed
- Status changed from new to closed
In 28776:
#5
@
10 years ago
- Keywords needs-testing added
- Resolution fixed deleted
- Status changed from closed to reopened
- Type changed from enhancement to defect (bug)
Resolution appears to be incorrect. This needs more testing.
#6
@
10 years ago
- Keywords has-patch commit removed
Please revert [28776] to prevent corrupting HTML output. I can work on testing/patching by next week. In the meantime, I would suggest that the bug was a simple typo, not a redundant pattern. Unsure without a lot more testing. Thank you for the bug report.
#7
follow-ups:
↓ 8
↓ 9
@
10 years ago
- Keywords has-patch added
Indeed, looks like 28575.patch would be a correct solution here.
But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass, including the multiplication test.
#8
in reply to:
↑ 7
@
10 years ago
Replying to SergeyBiryukov:
Indeed, looks like 28575.patch would be a correct solution here.
But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass, including the multiplication test.
I had just come up with the same patch after some more testing. I hadn't tested with markup that included a multiplication string. Interestingly, with the old code, the slowness reproduced for me only when the post did not include such a string. That is, when I test the code from before my patch with html that includes a 9x9 string, it's fast; when the html does not include such a string, it's slow.So in testing [27839], I suppose the tested code included the multiplication strings and the slowness was never manifest.
#9
in reply to:
↑ 7
@
10 years ago
Replying to SergeyBiryukov:
Indeed, looks like 28575.patch would be a correct solution here.
That's what I had in mind.
But even if it was a typo, how does [28776] corrupt HTML output? All texturize unit tests still pass
I was thinking back to why I originally moved that code inside the loop. The replacement was matching and modifying links to image files, which is not what we want. I'm not even sure I wrote unit tests for that scenario.
This ticket was mentioned in IRC in #wordpress-dev by Clorith. View the logs.
10 years ago
#13
@
10 years ago
My performance results:
Looking strictly at the regex performance only, patching 3.9.1 for the one variable name gives a speed increase of about 10% for a sample post.
Substituting the $dynamic patterns used in trunk now, about 50% more time is needed.
Substituting the new HTML/shortcode avoidance logic, about 4% more time is needed.
Patching trunk for the one variable name, about 8% less time is needed.
With the patch from #28483, another 4% less time is needed.
Here is a rough breakdown of the time being spent on regex:
17% - Dashes and spaces (4 patterns)
8% - '99' and '99" (2 patterns)
8% - "42" or '42.00' (2 patterns)
8% - 9" and 9' (2 patterns)
5% - HTML/shortcode avoidance
5% - '99
5% - Single quote at start
5% - Apostrophe in a word.
5% - Double quote at start
4% - Any remaining double quotes.
4% - Single quotes followed by spaces
Conclusions:
The difference between running with and without the affected if statement altogether is on the order of 1%. So, the patch for this ticket was only slightly better than deleting the bugged code.
Each of the $dynamic patterns is optimal, using almost exactly the same amount of time. Reducing the number of patterns is paramount for improving performance.
For other than the HTML/shortcode avoidance pattern, about 95% of the time consumed is devoted to looping through $textarr and re-executing the patterns versus running each pattern one time on the entire post.
The calls to _wptexturize_pushpop_element() are not trivial, consuming 7% of the total time.
Simple script for profiling.