Make WordPress Core

Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#14429 closed enhancement (fixed)

Faster is_serialized

Reported by: sambauers's profile sambauers Owned by:
Milestone: 3.1 Priority: normal
Severity: minor Version: 3.0
Component: Optimization Keywords: has-patch dev-feedback
Focuses: Cc:

Description

is_serialized runs a preg_match every time it is called which is slow.

This patch replaces the preg_match conditional with two much cheaper conditionals.

Estimated speedup for WordPress on a reasonably fresh install is about 3%. Would be more on installs with lots of options set I imagine.

Attachments (14)

faster-is_serialized.diff (1.0 KB) - added by sambauers 14 years ago.
14429.diff (759 bytes) - added by ryan 14 years ago.
Conditionally use strpbrk
14429.patch (1.4 KB) - added by hakre 14 years ago.
no need for strpbrk checks
14429.2.patch (1.4 KB) - added by hakre 14 years ago.
no need for strpbrk checks; fix style/ws
14429-strpbrk.patch (620 bytes) - added by hakre 14 years ago.
Based on [15636] fixes coding style / WS and removes unneeded array() expression.
14429-strpbrk.2.patch (907 bytes) - added by hakre 14 years ago.
Based on [15636], as previous but caches function_exists (as this is for performance)
14429-strpbrk.3.patch (930 bytes) - added by hakre 14 years ago.
Based on [15636], as previous and uses strlen check for both (w and w/o strpbrk()) cases.
14429-strpbrk.4.patch (1.5 KB) - added by hakre 14 years ago.
Based on [15636], as previous and there is no need for the (badly named) $badions array at all: removed.
14429-strpbrk.5.patch (1.5 KB) - added by hakre 14 years ago.
Based on [15636], as previous but runtime optimized for 99% of all non serialized strings larger than 3 chars
14429-strpbrk.6.patch (1.5 KB) - added by hakre 14 years ago.
Based on [15636], as previous but there is no need for the regular expression to check for the first two chars in case strpbrk does not exists
14429-strpbrk.7.patch (1.5 KB) - added by hakre 14 years ago.
Based on [15636], as previous and runtime optimized to check on second char prior to check anything else.
14429-strpbrk.8.patch (1.2 KB) - added by hakre 14 years ago.
Based on [15636], as previous and with strpbrk constraints removed (might not be needed any longer)
14429-strpbrk.9.patch (1.2 KB) - added by hakre 14 years ago.
Based on [15636], as previous and snappier return of preg_match results
14429.10.diff (1.5 KB) - added by duck_ 13 years ago.

Download all attachments as: .zip

Change History (40)

#1 @sambauers
14 years ago

  • Keywords has-patch added

#2 follow-up: @Denis-de-Bernardy
14 years ago

See also #9930 for simple optimizations and fixes that never got committed.

#3 in reply to: ↑ 2 ; follow-up: @sambauers
14 years ago

Replying to Denis-de-Bernardy:

See also #9930 for simple optimizations and fixes that never got committed.

The proposal in #9930 is a hammer, is_serialized() is not a nail. There's a serious case of regex fever going on there. "Elegance" over performance. Although I note your concerns about performance in the comments there.

It's not necessary to check for serialised doubles on every call of is_serialized(), only if it passes the rudimentary tests for a serialised string. On a standard install is_serialzed() runs about 1700 times per page (from memory). Maybe not even half of those calls are examining serialised strings at all.

Where possible, commonly used functions should exit or return as quickly and as cheaply as possible. That's what I'm doing here. The serialised doubles problem should be a discreet case after that, not tied into the rudimentary checks.

#4 @nacin
14 years ago

  • Milestone changed from Awaiting Review to 3.1

Strong +1 to the approach.

#5 in reply to: ↑ 3 @sambauers
14 years ago

Replying to sambauers:

On a standard install is_serialzed() runs about 1700 times per page (from memory).

My memory is bad. It's more like 180 times.

#6 @sambauers
14 years ago

My code is also bad. This isn't faster at all after doing some reduction testing. In which case it's probably best to stick with what is there.

#7 @ryan
14 years ago

Once we go PHP5, we can look into this optimization from Rasmus (scroll down).

http://talks.php.net/show/confoo10/11

@ryan
14 years ago

Conditionally use strpbrk

#8 @ryan
14 years ago

With a default scratch install with the Links widget added, this reduces the number of calls to preg_match from 233 to 45. strpbrk is called 164 times and function_exists is called 188 times. is_serialized went from spending 6-10% of its time running preg_match to 2% of its time running preg_match and a fraction of a percent running strpbrk and function_exists. strlen() > 1 eliminated 24 strings from further processing (probably 0|1|Y|N flags). An empty() check didn't eliminate any further strings, so I left it out.

This could be further optimized by moving the strlen check up and caching the function_exists results. The function_exists check will disappear in 3.2, however, so worrying too much about it now is probably not worthwhile.

#9 @ryan
14 years ago

(In [15636]) Speed up is_serialized() with strpbrk(). Props Rasmus. see #14429

#10 follow-ups: @Denis-de-Bernardy
14 years ago

r15636 honestly doesn't seem to make any sense. Shouldn't we be using something more like this?

if ( isset($data[1]) && $data[1] == ':' && strpbrk($data[0],'adObis') == $data[0] ) {

or even:

if ( isset($data[1]) && $data[1] == ':' && in_array($data[0], array('a', 'd', 'O', 'b', 'i', 's') ) {

It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set, b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.

Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...

My $.02...

#11 in reply to: ↑ 10 @hakre
14 years ago

I suggest an overload strategy to re-define the function based on function_exists() to reduce it to one call per request instead of 188 - if you're really in for the speed.


Replying to Denis-de-Bernardy:

It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set,

PHP does not need to calculate the length of strings, it always know it, so this is not bold.

b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.

That's a very valid point.

Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...

Sure.

As I understood Rasums talk he just wanted to show how fast things can be made more easy (e.g. by replacing the whole function with an existing function). It's not making them perfect. So it's Rasmus, it's Feedback, but it should not be blindly adopted. I think that wouldn't be in Rasmus interest as well.

My 2 cents.

#12 @hakre
14 years ago

Related: #14816

#13 in reply to: ↑ 10 @hakre
14 years ago

Replying to Denis-de-Bernardy:
I like the direction you point to, strpbrk() is probably doing too much for our serialized case here.

I replace the in_array call with strpos of which I assume (not know) it's faster.

if ( 2 > strlen($data[1]) && ':' === $data[1] && false !== strpos('adObis', $data[0])) {

Using that approach does save us from function_exists and strpbk overheads.

if ( isset($data[1]) && $data[1] == ':' && in_array($data[0], array('a', 'd', 'O', 'b', 'i', 's') )

r15636 honestly doesn't seem to make any sense. Shouldn't we be using something more like this?

if ( isset($data[1]) && $data[1] == ':' && strpbrk($data[0],'adObis') == $data[0] ) {

or even:

It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set, b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.

Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...

My $.02...

#14 @hakre
14 years ago

Added a patch that removes the need to check for strpbrk() as it is not using it. Additionally I added (optional) angel brackets for most of the single line if clauses to improve readability.

@hakre
14 years ago

no need for strpbrk checks

@hakre
14 years ago

no need for strpbrk checks; fix style/ws

@hakre
14 years ago

Based on [15636] fixes coding style / WS and removes unneeded array() expression.

@hakre
14 years ago

Based on [15636], as previous but caches function_exists (as this is for performance)

@hakre
14 years ago

Based on [15636], as previous and uses strlen check for both (w and w/o strpbrk()) cases.

@hakre
14 years ago

Based on [15636], as previous and there is no need for the (badly named) $badions array at all: removed.

@hakre
14 years ago

Based on [15636], as previous but runtime optimized for 99% of all non serialized strings larger than 3 chars

@hakre
14 years ago

Based on [15636], as previous but there is no need for the regular expression to check for the first two chars in case strpbrk does not exists

@hakre
14 years ago

Based on [15636], as previous and runtime optimized to check on second char prior to check anything else.

@hakre
14 years ago

Based on [15636], as previous and with strpbrk constraints removed (might not be needed any longer)

@hakre
14 years ago

Based on [15636], as previous and snappier return of preg_match results

#15 @hakre
14 years ago

I now provided a new lineup of patches that come with slight changes between each other making them perfect to run measurements against those implementations.

This contains a cache as proposed by ryan as well as doing simpler check upfront until the complete need to rely on strpbrk at all.

I'm intersted to see the performance of 14429-strpbrk.9.patch relative to [15636] in ryans testbed.

#16 @hakre
13 years ago

  • Keywords dev-feedback added

#17 @ryan
13 years ago

Nice. I'll profile the latest patch soon.

#18 @Denis-de-Bernardy
13 years ago

I believe an even faster approach would be to split ghe regexp in two. Chech if the string ends with ; on the one hand side using substr, and then call preg_match without the expensive (memory intensive) .+ wildcard.

#19 follow-up: @hakre
13 years ago

You mean something like

$len = strlen( $data );
$last_index = $len - 1;
$last_char = $data[$last_index];
$last_char_match = ( ';' === $last_char || '}' === $last_char )

?

#20 in reply to: ↑ 19 @Denis-de-Bernardy
13 years ago

Replying to hakre:
Yeah, something like that basically. To avoid the huge backtrace.

@duck_
13 years ago

#21 @duck_
13 years ago

Attached new patch with some improvements over 14429-strpbrk.9.patch:

  • Incorporated Denis' suggestion to remove .* from regex for a, s, O
  • Removed strpos( 'adObis', $token ) since that will be caught by not matching any of the cases in the switch statement
  • Doesn't false-positive against s:4:test; (missing quotes around test)


Results from some quick profiling using xdebug across 10 front page loads of a test site (measuring the 8610 calls to is_serialized):

14429.10.diff: 181555 microseconds
14429-strpbrk.9.patch: 189115 microseconds
trunk (strpbrk): 280283 microseconds

Though there are still some false-negatives (e.g. serialize(2.1E+200), #9930) and false-positives (e.g. b:5;).

The only real way around this (as far as I can see) is by actually making use of unserialize in the test. An interesting approach I've seen which allows you to do this and not have to run unserialize twice (once for the test and again if it's serialized) is here. Profiling a modified version of the code from that gist gave the best results in this particular test, 147125 microseconds, but it would be much slower over, for example, a million iterations of a small serialized object. However, the first test is probably a better real world example and if used right that doesn't matter since you're getting the unserialized result back too.

#22 @ryan
13 years ago

(In [16300]) Faster is_serialized(). Props duck_, hakre, Denis-de-Bernardy. see #14429

#23 @hakre
13 years ago

Thanks for the detailed feedback and the additional measurements. I think sites can benefit from that, as is_serialized is somehow a weak point.

for the false positives as in #9930 I've added some more info there, too. It's another regex that is checking for proper float/double php serializations, the bools are easy to check being 0 or 1, but I have not provided any feedback to that.

I dunno if these false positives are a real bummer or not. It's somehow specific.

As duck_ suggested to do the actual serialization as an additional check at the end, this implies to change the code where is_serialized() is actually used to have a profit from that. I don't see an easy way to do that. Probably making use of php exceptions is one way to do so, but this is not an option right now because of PHP 4 compability. And even if we switch to PHP 5.2 I think there is no coding practice how to deal with exceptions if not at all in the project to easily decide on this. So this is in the end a design issue which tackles some general decisions.

#24 @ryan
13 years ago

  • Resolution set to fixed
  • Status changed from new to closed

#25 @hakre
13 years ago

Related: #16504

#26 @hakre
13 years ago

Related: #17375, #17129

Last edited 13 years ago by hakre (previous) (diff)
Note: See TracTickets for help on using tickets.