Opened 13 years ago
Closed 10 years ago
#18007 closed defect (bug) (fixed)
is_serialized trouble for multibytes encoding
Reported by: | challet | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.9 | Priority: | normal |
Severity: | normal | Version: | 3.2 |
Component: | Charset | Keywords: | has-patch |
Focuses: | Cc: |
Description
Possible bug in the is_serialized function, in a multibytes environment :
Context
- utf-8 encoded database fields
- auto overlaoding singlebyte functions from conf (see http://www.php.net/manual/en/mbstring.overload.php)
Retrieving an option from the database, the get_option function try to deserialize it.
Bug scenario
- the strlen call is actually a mb_strlen one
- the $length var is the number of characters (not bytes)
- using
$data[$length-1]
don't retrieve the last char but one before, it's actual position is undefined, depending on the other content of the string - this $lastc is not ; or }
- the string is understood has not serialized
Patch proposal
here is a fix i made to make it work with multibytes string :
Not tested it for single bytes, but there is no reason for it to not work.
--- a/wp-includes/functions.php +++ b/wp-includes/functions.php @@ -256,7 +256,7 @@ function is_serialized( $data ) { return false; if ( ':' !== $data[1] ) return false; - $lastc = $data[$length-1]; + $lastc = substr($data, -1); if ( ';' !== $lastc && '}' !== $lastc ) return false; $token = $data[0];
Attachments (12)
Change History (30)
#2
@
13 years ago
After applying this fix on a multi-network installation (thousands of sites), the problem we encountered with multibyte characters in text widgets (among others) was resolved. It has been running fine for about a month now.
#5
follow-up:
↓ 6
@
12 years ago
- Cc kurtpayne added
I'm trying to come up with a test for this, but I'm hitting some roadblocks.
I can't reproduce the issue. Is it still present in php 5.3?
I am doing the following:
ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
INSERT INTO
wp_options(
option_name,
option_value,
autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
get_option( 'test_18007_option' );
This returns the correct value ... an empty array.
Also, mbstring.func_overload is a PHP_INI_PERDIR | PHP_INI_SYSTEM level configuration item for php 4.3 through php 5.2.6. This is a little tricky since this has to be set with php -d
or in php.ini (or skip the test).
#6
in reply to:
↑ 5
@
12 years ago
Replying to kurtpayne:
I am doing the following:
ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
INSERT INTO
wp_options(
option_name,
option_value,
autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
get_option( 'test_18007_option' );
This returns the correct value ... an empty array.
You should try something like :
`INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('test_18007_option', 'a:0:{"text": "aéùp"}', 'yes');`
The shift between the length (in bytes) and the length (in charcters number) happens when there is indeed a character using multiple bytes.
This is a little tricky
It is :)
The problem is by using $length = strlen( $data );
and $lastc = $data[$length-1];
here,
we're not getting the last char because $data[$anynumber]
will retrieve the $anynumber byte, not char.
While the purpose to use $lastc = substr($data, -1);
is to have exactly the last char, even whith multibytes overloading on.
#7
follow-up:
↓ 8
@
12 years ago
I still can't recreate this. I've downgraded to php 5.2.17, 5.2.9, and 5.2.6 and tried this, but I can't recreate the original bug.
I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:
php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml
#8
in reply to:
↑ 7
@
12 years ago
I can reproduce the original bug.
18007.2.patch is a more complete patch using substr()
.
18007.2.alt.patch is another take using count( str_split() )
instead of strlen()
.
Unfortunately, a quick test with 3 000 000 iterations of is_serialized()
showed that substr()
is about 2 times slower than direct access, and count( str_split() )
is 3 times slower than strlen()
:
- 11,617s — unpatched
- 24,617s — 18007.2.patch
- 33,553s — 18007.2.alt.patch
18007.3.patch is an attempt to only use substr()
if mbstring.func_overload
is enabled, and to keep decent performance otherwise:
- 12,503s — 18007.3.patch,
mbstring.func_overload
is off - 26,384s — 18007.3.patch,
mbstring.func_overload
is on
We have a similar check in streams.php
:
http://core.trac.wordpress.org/browser/tags/3.4.2/wp-includes/pomo/streams.php#L17
Replying to kurtpayne:
I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:
php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml
Note that mbstring.func_overload=4
only overloads regular expression functions. For string functions, you need mbstring.func_overload=2
.
18007.unit-test.patch is the shortened and revised unit test.
#9
@
12 years ago
- Keywords 3.6-early added; needs-unit-tests removed
- Milestone changed from Awaiting Review to Future Release
Original bug confirmed using mbstring.func_overload = 2
18007.3.patch looks good and in line with the precedent set in pomo/streams.php. I did not run my own benchmarks; I trust the numbers provided.
18007.unit-test.patch fails without patch, passes with patch.
#10
@
12 years ago
More takes in 18007.4.patch, 18007.5.patch and 18007.5.alt.patch.
I didn't test all the cases yesterday, here's the full table. Increased the number of iterations to 10,000,000 to make the results more accurate.
mbstring.func_overload off | mbstring.func_overload on
|
62,783s — unpatched | 36,596s — unpatched |
63,268s — 18007.5.patch | 67,948s — 18007.5.patch |
64,873s — 18007.2.patch | 75,975s — 18007.2.patch |
71,176s — 18007.4.patch | 76,906s — 18007.5.alt.patch |
73,531s — 18007.3.patch | 86,326s — 18007.4.patch |
76,352s — 18007.5.alt.patch | 88,115s — 18007.3.patch |
96,569s — 18007.6.patch | 100,632s — 18007.2.alt.patch |
99,088s — 18007.2.alt.patch | 453,753s — 18007.6.patch |
It turned out that 18007.2.patch is actually faster than 18007.3.patch in both cases. 18007.5.patch shows the best performance, but probably looks less obvious.
Not sure why the unpatched results differ by 75%, but the difference is consistent after several takes.
#13
follow-up:
↓ 15
@
11 years ago
[17592] offers another approach using mb_internal_encoding()
, need to include it in the performance tests.
#14
@
11 years ago
- Milestone changed from Future Release to 3.9
A simple way to reproduce the issue:
- Set
mbstring.func_overload = 2
inphp.ini
. - Enter a widget title in Cyrillic and save it.
- The widget shows a notice (see tags/3.8.1/src/wp-includes/widgets.php#L262):
Notice: Undefined offset: 2 in wp-includes/widgets.php on line 262
- The widget disappears on front-end. After a page refresh, it disappears in the admin too.
18007.2.patch seems like the way to go here. Let's refresh the unit tests if necessary and get this in.
#15
in reply to:
↑ 13
@
11 years ago
Replying to SergeyBiryukov:
[17592] offers another approach using
mb_internal_encoding()
, need to include it in the performance tests.
Implemented in 18007.6.patch, see the updated table in comment:10.
It turned out to be too slow, so 18007.2.patch is still the way to go.
Turned challet's proposal into a patch.