Opened 23 months ago

Last modified 6 weeks ago

#18007 new defect (bug)

is_serialized trouble for multibytes encoding

Reported by: challet Owned by:
Priority: normal Milestone: Future Release
Component: Charset Version: 3.2
Severity: normal Keywords: has-patch 3.6-early
Cc: kurtpayne

Description

Possible bug in the is_serialized function, in a multibytes environment :

Context

Retrieving an option from the database, the get_option function try to deserialize it.

Bug scenario

  • the strlen call is actually a mb_strlen one
  • the $length var is the number of characters (not bytes)
  • using $data[$length-1] don't retrieve the last char but one before, it's actual position is undefined, depending on the other content of the string
  • this $lastc is not ; or }
  • the string is understood has not serialized

Patch proposal

here is a fix i made to make it work with multibytes string :
Not tested it for single bytes, but there is no reason for it to not work.

--- a/wp-includes/functions.php
+++ b/wp-includes/functions.php
@@ -256,7 +256,7 @@ function is_serialized( $data ) {
                return false;
        if ( ':' !== $data[1] )
                return false;
-       $lastc = $data[$length-1];
+       $lastc = substr($data, -1);
        if ( ';' !== $lastc && '}' !== $lastc )
                return false;
        $token = $data[0];

Attachments (10)

18007.patch (417 bytes) - added by SergeyBiryukov 22 months ago.
18007_incomplete_unit_test.patch (3.1 KB) - added by kurtpayne 8 months ago.
18007.2.patch (1.2 KB) - added by SergeyBiryukov 8 months ago.
18007.2.alt.patch (647 bytes) - added by SergeyBiryukov 8 months ago.
18007.3.patch (1.9 KB) - added by SergeyBiryukov 8 months ago.
18007.unit-test.patch (2.0 KB) - added by SergeyBiryukov 8 months ago.
18007.unit-test.2.patch (2.0 KB) - added by SergeyBiryukov 8 months ago.
More accurate check for mbstring.func_overload
18007.4.patch (1.4 KB) - added by SergeyBiryukov 8 months ago.
18007.5.patch (835 bytes) - added by SergeyBiryukov 8 months ago.
18007.5.alt.patch (921 bytes) - added by SergeyBiryukov 8 months ago.

Download all attachments as: .zip

Change History (21)

  • Keywords has-patch added

Turned challet's proposal into a patch.

After applying this fix on a multi-network installation (thousands of sites), the problem we encountered with multibyte characters in text widgets (among others) was resolved. It has been running fine for about a month now.

This bug still reproduced. WordPress 3.3.1

  • Keywords needs-unit-tests added

comment:5 follow-up: ↓ 6   kurtpayne8 months ago

  • Cc kurtpayne added

I'm trying to come up with a test for this, but I'm hitting some roadblocks.

I can't reproduce the issue. Is it still present in php 5.3?

I am doing the following:

  • ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • INSERT INTO wp_options (option_name, option_value, autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
  • get_option( 'test_18007_option' );

This returns the correct value ... an empty array.

Also, mbstring.func_overload is a PHP_INI_PERDIR | PHP_INI_SYSTEM level configuration item for php 4.3 through php 5.2.6. This is a little tricky since this has to be set with php -d or in php.ini (or skip the test).

comment:6 in reply to: ↑ 5   challet8 months ago

Replying to kurtpayne:

I am doing the following:

  • ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • INSERT INTO wp_options (option_name, option_value, autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
  • get_option( 'test_18007_option' );

This returns the correct value ... an empty array.

You should try something like :

`INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('test_18007_option', 'a:0:{"text": "aéùp"}', 'yes');`

The shift between the length (in bytes) and the length (in charcters number) happens when there is indeed a character using multiple bytes.

This is a little tricky

It is :)

The problem is by using $length = strlen( $data ); and $lastc = $data[$length-1]; here,
we're not getting the last char because $data[$anynumber] will retrieve the $anynumber byte, not char.
While the purpose to use $lastc = substr($data, -1); is to have exactly the last char, even whith multibytes overloading on.

Last edited 8 months ago by challet (previous) (diff)

comment:7 follow-up: ↓ 8   kurtpayne8 months ago

I still can't recreate this. I've downgraded to php 5.2.17, 5.2.9, and 5.2.6 and tried this, but I can't recreate the original bug.

I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:

php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml

comment:8 in reply to: ↑ 7   SergeyBiryukov8 months ago

I can reproduce the original bug.

18007.2.patch is a more complete patch using substr().

18007.2.alt.patch is another take using count( str_split() ) instead of strlen().

Unfortunately, a quick test with 3 000 000 iterations of is_serialized() showed that substr() is about 2 times slower than direct access, and count( str_split() ) is 3 times slower than strlen():

18007.3.patch is an attempt to only use substr() if mbstring.func_overload is enabled, and to keep decent performance otherwise:

We have a similar check in streams.php:
http://core.trac.wordpress.org/browser/tags/3.4.2/wp-includes/pomo/streams.php#L17

Replying to kurtpayne:

I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:

php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml

Note that mbstring.func_overload=4 only overloads regular expression functions. For string functions, you need mbstring.func_overload=2.

18007.unit-test.patch is the shortened and revised unit test.

Last edited 8 months ago by SergeyBiryukov (previous) (diff)
  • Keywords 3.6-early added; needs-unit-tests removed
  • Milestone changed from Awaiting Review to Future Release

Original bug confirmed using mbstring.func_overload = 2

18007.3.patch looks good and in line with the precedent set in pomo/streams.php. I did not run my own benchmarks; I trust the numbers provided.

18007.unit-test.patch fails without patch, passes with patch.

More accurate check for mbstring.func_overload

More takes in 18007.4.patch, 18007.5.patch and 18007.5.alt.patch.

I didn't test all the cases yesterday, here's the full table. Increased the number of iterations to 10,000,000 to make the results more accurate.

mbstring.func_overload offmbstring.func_overload on
62,783s — unpatched36,596s — unpatched
63,268s — 18007.5.patch67,948s — 18007.5.patch
64,873s — 18007.2.patch75,975s — 18007.2.patch
71,176s — 18007.4.patch76,906s — 18007.5.alt.patch
73,531s — 18007.3.patch86,326s — 18007.4.patch
76,352s — 18007.5.alt.patch88,115s — 18007.3.patch
99,088s — 18007.2.alt.patch100,632s — 18007.2.alt.patch

It turned out that 18007.2.patch is actually faster than 18007.3.patch in both cases. 18007.5.patch shows the best performance, but probably looks less obvious.

Not sure why the unpatched results differ by 75%, but the difference is consistent after several takes.

Note: See TracTickets for help on using tickets.