WordPress.org

Make WordPress Core

Opened 4 years ago

Closed 14 months ago

#18007 closed defect (bug) (fixed)

is_serialized trouble for multibytes encoding

Reported by: challet Owned by: nacin
Milestone: 3.9 Priority: normal
Severity: normal Version: 3.2
Component: Charset Keywords: has-patch
Focuses: Cc:

Description

Possible bug in the is_serialized function, in a multibytes environment :

Context

Retrieving an option from the database, the get_option function try to deserialize it.

Bug scenario

  • the strlen call is actually a mb_strlen one
  • the $length var is the number of characters (not bytes)
  • using $data[$length-1] don't retrieve the last char but one before, it's actual position is undefined, depending on the other content of the string
  • this $lastc is not ; or }
  • the string is understood has not serialized

Patch proposal

here is a fix i made to make it work with multibytes string :
Not tested it for single bytes, but there is no reason for it to not work.

--- a/wp-includes/functions.php
+++ b/wp-includes/functions.php
@@ -256,7 +256,7 @@ function is_serialized( $data ) {
                return false;
        if ( ':' !== $data[1] )
                return false;
-       $lastc = $data[$length-1];
+       $lastc = substr($data, -1);
        if ( ';' !== $lastc && '}' !== $lastc )
                return false;
        $token = $data[0];

Attachments (12)

18007.patch (417 bytes) - added by SergeyBiryukov 4 years ago.
18007_incomplete_unit_test.patch (3.1 KB) - added by kurtpayne 3 years ago.
18007.2.patch (1.2 KB) - added by SergeyBiryukov 3 years ago.
18007.2.alt.patch (647 bytes) - added by SergeyBiryukov 3 years ago.
18007.3.patch (1.9 KB) - added by SergeyBiryukov 3 years ago.
18007.unit-test.patch (2.0 KB) - added by SergeyBiryukov 3 years ago.
18007.unit-test.2.patch (2.0 KB) - added by SergeyBiryukov 3 years ago.
More accurate check for mbstring.func_overload
18007.4.patch (1.4 KB) - added by SergeyBiryukov 3 years ago.
18007.5.patch (835 bytes) - added by SergeyBiryukov 3 years ago.
18007.5.alt.patch (921 bytes) - added by SergeyBiryukov 3 years ago.
18007.6.patch (461 bytes) - added by SergeyBiryukov 14 months ago.
18007.2.refreshed.patch (1.9 KB) - added by SergeyBiryukov 14 months ago.

Download all attachments as: .zip

Change History (30)

@SergeyBiryukov4 years ago

comment:1 @SergeyBiryukov4 years ago

  • Keywords has-patch added

Turned challet's proposal into a patch.

comment:2 @puffythepirateboy4 years ago

After applying this fix on a multi-network installation (thousands of sites), the problem we encountered with multibyte characters in text widgets (among others) was resolved. It has been running fine for about a month now.

comment:3 @Londeren3 years ago

This bug still reproduced. WordPress 3.3.1

comment:4 @SergeyBiryukov3 years ago

  • Keywords needs-unit-tests added

comment:5 follow-up: @kurtpayne3 years ago

  • Cc kurtpayne added

I'm trying to come up with a test for this, but I'm hitting some roadblocks.

I can't reproduce the issue. Is it still present in php 5.3?

I am doing the following:

  • ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • INSERT INTO wp_options (option_name, option_value, autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
  • get_option( 'test_18007_option' );

This returns the correct value ... an empty array.

Also, mbstring.func_overload is a PHP_INI_PERDIR | PHP_INI_SYSTEM level configuration item for php 4.3 through php 5.2.6. This is a little tricky since this has to be set with php -d or in php.ini (or skip the test).

comment:6 in reply to: ↑ 5 @challet3 years ago

Replying to kurtpayne:

I am doing the following:

  • ALTER TABLE wp_options MODIFY option_value LONGTEXT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
  • INSERT INTO wp_options (option_name, option_value, autoload) VALUES ('test_18007_option', 'a:0:{}', 'yes');
  • get_option( 'test_18007_option' );

This returns the correct value ... an empty array.

You should try something like :

`INSERT INTO `wp_options` (`option_name`, `option_value`, `autoload`) VALUES ('test_18007_option', 'a:0:{"text": "aéùp"}', 'yes');`

The shift between the length (in bytes) and the length (in charcters number) happens when there is indeed a character using multiple bytes.

This is a little tricky

It is :)

The problem is by using $length = strlen( $data ); and $lastc = $data[$length-1]; here,
we're not getting the last char because $data[$anynumber] will retrieve the $anynumber byte, not char.
While the purpose to use $lastc = substr($data, -1); is to have exactly the last char, even whith multibytes overloading on.

Last edited 3 years ago by challet (previous) (diff)

comment:7 follow-up: @kurtpayne3 years ago

I still can't recreate this. I've downgraded to php 5.2.17, 5.2.9, and 5.2.6 and tried this, but I can't recreate the original bug.

I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:

php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml

@SergeyBiryukov3 years ago

@SergeyBiryukov3 years ago

comment:8 in reply to: ↑ 7 @SergeyBiryukov3 years ago

I can reproduce the original bug.

18007.2.patch is a more complete patch using substr().

18007.2.alt.patch is another take using count( str_split() ) instead of strlen().

Unfortunately, a quick test with 3 000 000 iterations of is_serialized() showed that substr() is about 2 times slower than direct access, and count( str_split() ) is 3 times slower than strlen():

18007.3.patch is an attempt to only use substr() if mbstring.func_overload is enabled, and to keep decent performance otherwise:

We have a similar check in streams.php:
http://core.trac.wordpress.org/browser/tags/3.4.2/wp-includes/pomo/streams.php#L17

Replying to kurtpayne:

I'm attaching the unit test I'm working on, but I'm stuck until I get some feedback. You have to apply it then run it as:

php -d mbstring.func_overload=4 /usr/bin/phpunit -c mbstring.xml

Note that mbstring.func_overload=4 only overloads regular expression functions. For string functions, you need mbstring.func_overload=2.

18007.unit-test.patch is the shortened and revised unit test.

Last edited 3 years ago by SergeyBiryukov (previous) (diff)

comment:9 @kurtpayne3 years ago

  • Keywords 3.6-early added; needs-unit-tests removed
  • Milestone changed from Awaiting Review to Future Release

Original bug confirmed using mbstring.func_overload = 2

18007.3.patch looks good and in line with the precedent set in pomo/streams.php. I did not run my own benchmarks; I trust the numbers provided.

18007.unit-test.patch fails without patch, passes with patch.

@SergeyBiryukov3 years ago

More accurate check for mbstring.func_overload

@SergeyBiryukov3 years ago

@SergeyBiryukov3 years ago

comment:10 @SergeyBiryukov3 years ago

More takes in 18007.4.patch, 18007.5.patch and 18007.5.alt.patch.

I didn't test all the cases yesterday, here's the full table. Increased the number of iterations to 10,000,000 to make the results more accurate.

mbstring.func_overload offmbstring.func_overload on
62,783s — unpatched36,596s — unpatched
63,268s — 18007.5.patch67,948s — 18007.5.patch
64,873s — 18007.2.patch75,975s — 18007.2.patch
71,176s — 18007.4.patch76,906s — 18007.5.alt.patch
73,531s — 18007.3.patch86,326s — 18007.4.patch
76,352s — 18007.5.alt.patch88,115s — 18007.3.patch
96,569s — 18007.6.patch100,632s — 18007.2.alt.patch
99,088s — 18007.2.alt.patch453,753s — 18007.6.patch

It turned out that 18007.2.patch is actually faster than 18007.3.patch in both cases. 18007.5.patch shows the best performance, but probably looks less obvious.

Not sure why the unpatched results differ by 75%, but the difference is consistent after several takes.

Last edited 14 months ago by SergeyBiryukov (previous) (diff)

comment:13 follow-up: @SergeyBiryukov21 months ago

[17592] offers another approach using mb_internal_encoding(), need to include it in the performance tests.

comment:14 @SergeyBiryukov14 months ago

  • Milestone changed from Future Release to 3.9

A simple way to reproduce the issue:

  1. Set mbstring.func_overload = 2 in php.ini.
  2. Enter a widget title in Cyrillic and save it.
  3. The widget shows a notice (see tags/3.8.1/src/wp-includes/widgets.php#L262):
    Notice: Undefined offset: 2 in wp-includes/widgets.php on line 262
    
  4. The widget disappears on front-end. After a page refresh, it disappears in the admin too.

18007.2.patch seems like the way to go here. Let's refresh the unit tests if necessary and get this in.

@SergeyBiryukov14 months ago

comment:15 in reply to: ↑ 13 @SergeyBiryukov14 months ago

Replying to SergeyBiryukov:

[17592] offers another approach using mb_internal_encoding(), need to include it in the performance tests.

Implemented in 18007.6.patch, see the updated table in comment:10.

It turned out to be too slow, so 18007.2.patch is still the way to go.

comment:16 @nacin14 months ago

  • Keywords needs-refresh added; 3.6-early removed

Patch needs refreshing.

comment:17 @SergeyBiryukov14 months ago

  • Keywords needs-refresh removed

Refreshed.

comment:18 @nacin14 months ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In 27565:

In is_serialized(), use substr() rather than array access, for compatibility with multibyte overloading.

props SergeyBiryukov.
fixes #18007.

Note: See TracTickets for help on using tickets.