#14429 closed enhancement (fixed)
Faster is_serialized
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | 3.1 | Priority: | normal |
Severity: | minor | Version: | 3.0 |
Component: | Optimization | Keywords: | has-patch dev-feedback |
Focuses: | Cc: |
Description
is_serialized runs a preg_match every time it is called which is slow.
This patch replaces the preg_match conditional with two much cheaper conditionals.
Estimated speedup for WordPress on a reasonably fresh install is about 3%. Would be more on installs with lots of options set I imagine.
Attachments (14)
Change History (40)
#3
in reply to:
↑ 2
;
follow-up:
↓ 5
@
15 years ago
Replying to Denis-de-Bernardy:
See also #9930 for simple optimizations and fixes that never got committed.
The proposal in #9930 is a hammer, is_serialized() is not a nail. There's a serious case of regex fever going on there. "Elegance" over performance. Although I note your concerns about performance in the comments there.
It's not necessary to check for serialised doubles on every call of is_serialized(), only if it passes the rudimentary tests for a serialised string. On a standard install is_serialzed() runs about 1700 times per page (from memory). Maybe not even half of those calls are examining serialised strings at all.
Where possible, commonly used functions should exit or return as quickly and as cheaply as possible. That's what I'm doing here. The serialised doubles problem should be a discreet case after that, not tied into the rudimentary checks.
#5
in reply to:
↑ 3
@
15 years ago
Replying to sambauers:
On a standard install is_serialzed() runs about 1700 times per page (from memory).
My memory is bad. It's more like 180 times.
#6
@
15 years ago
My code is also bad. This isn't faster at all after doing some reduction testing. In which case it's probably best to stick with what is there.
#8
@
14 years ago
With a default scratch install with the Links widget added, this reduces the number of calls to preg_match from 233 to 45. strpbrk is called 164 times and function_exists is called 188 times. is_serialized went from spending 6-10% of its time running preg_match to 2% of its time running preg_match and a fraction of a percent running strpbrk and function_exists. strlen() > 1 eliminated 24 strings from further processing (probably 0|1|Y|N flags). An empty() check didn't eliminate any further strings, so I left it out.
This could be further optimized by moving the strlen check up and caching the function_exists results. The function_exists check will disappear in 3.2, however, so worrying too much about it now is probably not worthwhile.
#10
follow-ups:
↓ 11
↓ 13
@
14 years ago
r15636 honestly doesn't seem to make any sense. Shouldn't we be using something more like this?
if ( isset($data[1]) && $data[1] == ':' && strpbrk($data[0],'adObis') == $data[0] ) {
or even:
if ( isset($data[1]) && $data[1] == ':' && in_array($data[0], array('a', 'd', 'O', 'b', 'i', 's') ) {
It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set, b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.
Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...
My $.02...
#11
in reply to:
↑ 10
@
14 years ago
I suggest an overload strategy to re-define the function based on function_exists() to reduce it to one call per request instead of 188 - if you're really in for the speed.
Replying to Denis-de-Bernardy:
It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set,
PHP does not need to calculate the length of strings, it always know it, so this is not bold.
b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.
That's a very valid point.
Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...
Sure.
As I understood Rasums talk he just wanted to show how fast things can be made more easy (e.g. by replacing the whole function with an existing function). It's not making them perfect. So it's Rasmus, it's Feedback, but it should not be blindly adopted. I think that wouldn't be in Rasmus interest as well.
My 2 cents.
#13
in reply to:
↑ 10
@
14 years ago
Replying to Denis-de-Bernardy:
I like the direction you point to, strpbrk() is probably doing too much for our serialized case here.
I replace the in_array call with strpos of which I assume (not know) it's faster.
if ( 2 > strlen($data[1]) && ':' === $data[1] && false !== strpos('adObis', $data[0])) {
Using that approach does save us from function_exists and strpbk overheads.
if ( isset($data[1]) && $data[1] == ':' && in_array($data[0], array('a', 'd', 'O', 'b', 'i', 's') )
r15636 honestly doesn't seem to make any sense. Shouldn't we be using something more like this?
if ( isset($data[1]) && $data[1] == ':' && strpbrk($data[0],'adObis') == $data[0] ) {or even:
It's like... there's absolutely no point in a) calculating the length of a 1MB string to verifying that its second char is set, b) doing some complex comparison involving strpbrk() (which could potentially amount to finding the first of these chars at the very end of a 1MB string) and c) only *then* verifying that the second character works for our sake.
Better: a) check that the second character exists at all, b) verify that it works, and c) finally look at the first chars to see if it works...
My $.02...
#14
@
14 years ago
Added a patch that removes the need to check for strpbrk() as it is not using it. Additionally I added (optional) angel brackets for most of the single line if clauses to improve readability.
@
14 years ago
Based on [15636], as previous but caches function_exists (as this is for performance)
@
14 years ago
Based on [15636], as previous and uses strlen check for both (w and w/o strpbrk()) cases.
@
14 years ago
Based on [15636], as previous and there is no need for the (badly named) $badions array at all: removed.
@
14 years ago
Based on [15636], as previous but runtime optimized for 99% of all non serialized strings larger than 3 chars
@
14 years ago
Based on [15636], as previous but there is no need for the regular expression to check for the first two chars in case strpbrk does not exists
@
14 years ago
Based on [15636], as previous and runtime optimized to check on second char prior to check anything else.
@
14 years ago
Based on [15636], as previous and with strpbrk constraints removed (might not be needed any longer)
#15
@
14 years ago
I now provided a new lineup of patches that come with slight changes between each other making them perfect to run measurements against those implementations.
This contains a cache as proposed by ryan as well as doing simpler check upfront until the complete need to rely on strpbrk at all.
I'm intersted to see the performance of 14429-strpbrk.9.patch relative to [15636] in ryans testbed.
#18
@
14 years ago
I believe an even faster approach would be to split ghe regexp in two. Chech if the string ends with ; on the one hand side using substr, and then call preg_match without the expensive (memory intensive) .+ wildcard.
#19
follow-up:
↓ 20
@
14 years ago
You mean something like
$len = strlen( $data ); $last_index = $len - 1; $last_char = $data[$last_index]; $last_char_match = ( ';' === $last_char || '}' === $last_char )
?
#20
in reply to:
↑ 19
@
14 years ago
Replying to hakre:
Yeah, something like that basically. To avoid the huge backtrace.
#21
@
14 years ago
Attached new patch with some improvements over 14429-strpbrk.9.patch:
- Incorporated Denis' suggestion to remove
.*
from regex for a, s, O - Removed
strpos( 'adObis', $token )
since that will be caught by not matching any of the cases in the switch statement - Doesn't false-positive against
s:4:test;
(missing quotes around test)
Results from some quick profiling using xdebug across 10 front page loads of a test site (measuring the 8610 calls to is_serialized):
14429.10.diff: 181555 microseconds 14429-strpbrk.9.patch: 189115 microseconds trunk (strpbrk): 280283 microseconds
Though there are still some false-negatives (e.g. serialize(2.1E+200), #9930) and false-positives (e.g. b:5;
).
The only real way around this (as far as I can see) is by actually making use of unserialize in the test. An interesting approach I've seen which allows you to do this and not have to run unserialize twice (once for the test and again if it's serialized) is here. Profiling a modified version of the code from that gist gave the best results in this particular test, 147125 microseconds, but it would be much slower over, for example, a million iterations of a small serialized object. However, the first test is probably a better real world example and if used right that doesn't matter since you're getting the unserialized result back too.
#23
@
14 years ago
Thanks for the detailed feedback and the additional measurements. I think sites can benefit from that, as is_serialized is somehow a weak point.
for the false positives as in #9930 I've added some more info there, too. It's another regex that is checking for proper float/double php serializations, the bools are easy to check being 0 or 1, but I have not provided any feedback to that.
I dunno if these false positives are a real bummer or not. It's somehow specific.
As duck_ suggested to do the actual serialization as an additional check at the end, this implies to change the code where is_serialized() is actually used to have a profit from that. I don't see an easy way to do that. Probably making use of php exceptions is one way to do so, but this is not an option right now because of PHP 4 compability. And even if we switch to PHP 5.2 I think there is no coding practice how to deal with exceptions if not at all in the project to easily decide on this. So this is in the end a design issue which tackles some general decisions.
See also #9930 for simple optimizations and fixes that never got committed.