Make WordPress Core

Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#28724 closed enhancement (fixed)

Optimize Regexp Usage in wptexturize()

Reported by: miqrogroove's profile miqrogroove Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 4.0 Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

The effort to consolidate patterns in #28623 didn't pan out.

I think the next logical step in reducing CPU time in wptexturize is to structure the various patterns into groups by target and run them only if the target char(s) exist within the subject string. As most of the patterns involve quotes or apostrophes, it should be faster to run strpos within the inner loop before running multiple regexp replacements. If those replacements will do nothing, then they do not need to run during that one pass through the loop.

Example:

This is a <i>test</i> post.

Currently, all of the patterns would run 3 times each for that post. With the proposed changes, we could run strpos about a dozen times and then exit wptexturize with the same results in less time.

Attachments (4)

miqro-28724.patch (4.6 KB) - added by miqrogroove 10 years ago.
3x Performance Boost
miqro-28724.2.patch (4.6 KB) - added by miqrogroove 10 years ago.
Style tweak.
miqro-28724-part2.patch (600 bytes) - added by miqrogroove 10 years ago.
Additional 2% to 25% improvement depending on content.
miqro-28724-part2.2.patch (600 bytes) - added by miqrogroove 10 years ago.
Fix version compatibility.

Download all attachments as: .zip

Change History (20)

@miqrogroove
10 years ago

3x Performance Boost

#1 @SergeyBiryukov
10 years ago

FALSE should be in lower case, see #16302.

#2 @SergeyBiryukov
10 years ago

  • Milestone changed from Awaiting Review to 4.0

@miqrogroove
10 years ago

Style tweak.

#3 @miqrogroove
10 years ago

  • Keywords has-patch added

First draft tested better than expected. I'll continue brainstorming, meanwhile the patch can be used as-is.

#4 @miqrogroove
10 years ago

The next thing I'd like to study is the performance implication of having whitespace between HTML tags. For example, when I insert a table in the editor it tends to get unwrapped like

<table>
<tbody>
<tr>
<td>
...

Does this mean texturize is looping through a non-empty string after each element whose content is nothing more than a line feed? This might be another situation where it is possible to avoid running regexp altogether.

#5 @wonderboymusic
10 years ago

In 28986:

Optimize regexp usage in wptexturize() for a "3x Performance Boost."

Props miqrogroove.
See #28724.

#6 @miqrogroove
10 years ago

I got another 25% improvement just by stubbing this line after the "do texturize" comment:

if (empty(trim($curl))) continue;

This is definitely worth looking into.

@miqrogroove
10 years ago

Additional 2% to 25% improvement depending on content.

#7 @miqrogroove
10 years ago

In miqro-28724-part2.patch:

  • Optimize handling of whitespace in wptexturize().

What I found was that HTML-heavy posts tend to have a large amount of whitespace between elements, due to the formatting imposed by TinyMCE. Lists and tables in particular tend to have a newline after every element. These cases improve by ~25% less time spent texturizing whitespace.

A moderately long post with little HTML still tends to contain whitespace around commonly used shortcodes, the "more" tag, and the inline formatting tags. There is a minimal speed improvement in these cases.

Short posts such as "hello world" will have a fixed cost and tend not to benefit from optimizations.

Let's commit that and call it fixed. :)

#8 @miqrogroove
10 years ago

In a side-by-side comparison of v3.8 and my working copy, results of running wptexturize 10 times on a sample post:

v3.8 - 0.260 seconds
V4.0 - 0.061 seconds

This includes the impact of bug fixes, added patterns, HTML avoidance, calls to _wptexturize_pushpop_element(), and running the inner loop.

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


10 years ago

#10 follow-up: @SergeyBiryukov
10 years ago

miqro-28724-part2.patch gives me a fatal error:

Fatal error: Can't use function return value in write context in wp-includes/formatting.php on line 237

It can probably be replaced with:

if ( '' === trim( $curl ) ) {
	continue;
}

Is there a reason not to make this check the first in the loop? Can the regex can be modified to avoid newlines instead?

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

#11 in reply to: ↑ 10 ; follow-up: @miqrogroove
10 years ago

Replying to SergeyBiryukov:

Is there a reason not to make this check the first in the loop? Can the regex can be modified to avoid newlines instead?

The position was based on performance testing. This particular spot gives slightly better results depending on the type of input.

Regexp changes were considered but would make the stack logic very much more complex. The stack helper currently assumes the input begins and ends with the tag braces.

Can you tell me which version of PHP is giving a fatal error? I've never seen that before.

@miqrogroove
10 years ago

Fix version compatibility.

#12 @miqrogroove
10 years ago

Error fixed.

#13 in reply to: ↑ 11 @SergeyBiryukov
10 years ago

Replying to miqrogroove:

Can you tell me which version of PHP is giving a fatal error? I've never seen that before.

A note from http://www.php.net/manual/en/function.empty.php:

Prior to PHP 5.5, empty() only supports variables; anything else will result in a parse error.

#14 @SergeyBiryukov
10 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 29008:

Exclude a newline between delimeters from further checks in wptexturize() for better performance.

props miqrogroove.
fixes #28724.

#15 @miqrogroove
10 years ago

Sergey, the function runs slightly slower with the continue command. The difference is less than 1% but I wanted to point that out. Adding a continue command in the middle of an if block doesn't change the flow of control.

#16 @SergeyBiryukov
10 years ago

Thanks for pointing that out. I know it's not necessary here, but I've decided to add it for clarity anyway.

Note: See TracTickets for help on using tickets.