#32165 closed defect (bug) (fixed)
wp-db.php destructs all the multibyte characters
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 4.2.3 | Priority: | high |
Severity: | blocker | Version: | 4.1.2 |
Component: | Database | Keywords: | fixed-major |
Focuses: | Cc: |
Description
Many of the users in Japan reported that they lost their post data from the malfunction of WordPress 4.1.2 or later (4.1.3, 4.2 and 4.2.1 included). Some of them are forced to refrain from upgrading WordPress.
I was informed of MySQL settings from one of reporters, reproduced the environment and made sure of the cause, which might cause the data destruction even in the environment other than Japanese language. I'd like you to fix it and strongly recommend you to release a patched version as soon as possible.
This is what's really going on. As you know, WordPress works in a UTF-8 enviroment. It gets the text input data encoded in UTF-8, such as posts, comments and all. Our reporters' settings are the same. But their MySQL server settings are not. Theirs are like below:
MySQL server setings:
- character_set_client: utf8
- character_set_connection: utf8
- character_set_server: utf8
MySQL database and table settings:
- WordPress default database character set: ujis
- WordPress tables character set: ujis
- Collation of both: ujis_japanese_ci
This enviromnet appears to grow historically. That is to say, they built their sites when MySQL didn't satisfactedly support utf8. They adopted ujis to database character set, which was the only option available. While repeadedly updated, MySQL has become a modern database and begun to fully support utf8. Most of the hosting companies updated MySQL with the newly written my.cnf. Server settings are changed, but old veterans' databases and tables filled with data left untouched.
Under this given condition, wp-db.php in newly released WordPress tries to manipulate text data like below:
If character set of the table is ujis, it fires mb_convert_encoding() with this character set. This is the line.
$value['value'] = mb_convert_encoding($value['value'], $mb_charsets[$charset], $mb_charset[$charset]);
This destructs users' posts data into unrecoverable garbage. $mb_charsets[$charset]
comes from database settings, But $value['value']
is encoded with UTF-8. This line does the same thing as below to UTF-8 encoded string.
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'EUC-JP');
As a result, all the characters except ascii will suffer irreversible conversion. I think this is not what you want to do. You might think we'd better change the line to this.
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP', 'UTF-8');
or
$value['value'] = mb_convert_encoding($value['value'], 'EUC-JP');
It appears to be better. But NO. These change don't work and cause the same destruction. The reason is that the connection between WordPress and MySQL uses UTF-8 character set and input/outpu stream line is encoded in UTF-8. In this case, WordPress forces EUC-JP encoded data into the UTF-8 stream, which data MySQL can't interpret correctly (it changes all the multibyte characters to '?').
If database/table character set is different from the one used in the server-client connection, you've got to use the latter one, which is required by TCP/IP and UNIX socket connection. Otherwise, you'll put the data integration in danger and prevent MySQL database engine from converting them appropriately.
So here is my patch.
--- wp-db_orig.php 2015-04-28 12:16:52.037000000 +0900 +++ wp-db.php 2015-04-28 22:13:20.465000000 +0900 @@ -2590,7 +2590,11 @@ $db_check_string = false; foreach ( $data as &$value ) { - $charset = $value['charset']; + if ($value['charset'] !== $this->charset) { + $charset = $this->charset; + } else { + $charset = $value['charset']; + } // Column isn't a string, or is latin1, which will will happily store anything. if ( false === $charset || 'latin1' === $charset ) {
Attachments (5)
Change History (57)
#3
@
10 years ago
Haven't tested the patch, but wouldn't a single $charset = $this->charset;
instead of $charset = $value['charset'];
do the same? Or am I missing something?
edit: sorry for the git prefix in the patch.. I thought I had disabled that.
This ticket was mentioned in Slack in #core by voilamedia. View the logs.
10 years ago
#6
@
10 years ago
@kjmtsh thanks for the detailed bug report.
If I understand it right, the problem is in strip_invalid_text()
failing to consider that the database connection encoding is different than the table/column encoding. This results in wrong "from" and "to" args in mb_convert_encoding()
as the string is currently encoded same as the database connection encoding rather than the table encoding.
I'm not sure if the proposed patch is an acceptable fix. We need to detect these cases better, @pento would know more :)
#7
@
10 years ago
@azaozz Thank you for commenting.
To test this case, you have to newly create database whose character set is different from MySQL server settings. Simply to convert utf8 character set table to ujis character set may not work and destroy the stored data.
If a user's database connectin character set is different from database or table character set, we have no options which to take. Data flow is like below.
input: WordPress -(utf8)-> database engine -(ujis)-> each table output: each table -(ujis)-> database engine -(utf8)-> WordPress
If the constant DB_CHARSET defined in wp-config.php is set to utf8, the connection from WordPress to database engine is encoded in UTF-8. But when database engine writes to the table, it uses table character set (ujis in our case). WordPress must not give EUC-JP encoded string to database engine. When this happens, users' data will be destroyed at this stage.
Outputs are no problem. But there might be a problem for inputs. UTF-8 has much more characters than EUC-JP. So if input data contains some characters imcompatible with EUC-JP, MySQL change those characters to '?' (question mark). But we can do nothing for that.
strip_invalid_text()
tries to remove the incompatible characters with table/column character set. It will succeed when and only when database connection character set and table/column character set are the same. Otherwise fails. In this situation, we can only use database connection character set. We must not use table/column character set.
If we convert the data from UTF-8 to EUC-JP, and reconvert them from EUC-JP to UTF-8? No. For example, take some UTF-8 encoded Korean characters incompatible with EUC-JP Japanese character set.
$input = mb_convert_encoding("한글", 'EUC-JP', 'UTF-8'); $input = mb_convert_encoding($input, 'UTF-8', 'EUC-JP'); echo $input;
We only get '??' (double question marks). The first conversion change the characters to '??' and the second one does nothing. What is worse, two Korean characters are not removed. This is not the result this function tries to get. I don't think it's not a good idea to use mb_convert_encoding() here, because this function is just for conversion, not for removal of the characters.
We can't have users set DB_CHARSET to ujis, because MySQL server settings requires utf8 (many of them don't have the previlege to change server settings, either).
Possible solutions:
- Under such condition, we give up using
strip_invalid_text()
function. My patch does this. - Remove local conversion methods from wp-db.php.
- Do nothing.
Possible problems:
- Even if we don't use
strip_invalid_text()
, there remains possible danger that the characters incompatible with EUC-JP will send to MySQL. Most of the users in Japanese language probably use only Japanese. This is why they have few problems until now. The change in wp-db.php only brings the virtual problem to the real one.
It appears to me that if we want to avoid possible problem, we have no options except to recommend such users to convert their database/table/column character set to utf8. I know this operation is not easy, especially for non-programmers or programmers not aquainted with MySQL.
I think some day WordPress will inevitably stop supporting other character set than UTF-8. Now is the time? Or it is the beginning of the end?
- Even if users' database character set is utf8,
strip_invalid_text()
has another possible problem I didn't mention in the above ticket. It checks UTF-8 character with the PHP's regular expression, which means that posts, comments, pingback etc go through this function. In this case, what if WordPress get pingback in some European language (like German or French) encoded in ISO-8859-1? German umlaut or French accent ague is silently removed. This will always happen when WordPress gets non UTF-8 encoded characters.
This is the same as we say WordPress doesn't support any other character set than utf8.
- If we take this solution, we have to let all the users know about it that WordPress will only support UTF-8 character set for the database, because otherwise we can't ensure the safety of the input data. This is the easiest way. We don't have to care about other character sets. But users are in difficult situation.
I recommend to take number 1 solution for the time being, and start to prepare to stop supporting any other character set than utf8.
Thank you.
#10
@
10 years ago
- Owner set to pento
- Status changed from new to assigned
Could @kjmtsh, and anyone else seeing this problem, please confirm if trunk or the 4.2 branch fixes this for them?
#32
@
10 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
I tested wp-db.php in trunk checked out today.
No, I'm afraid the problem is not yet resolved. The same malfunction. Can't post a new article and update of the post causes data loss.
I'm reading the code and thinking... Probably, character set given to CONVERT()
statement may not be good, which is different from the one WordPress expects. And database connection character set change is not good, either. I'm going to report again as soon as possible. Give me a little more time, please. Thank you for your help, all.
#36
follow-up:
↓ 40
@
10 years ago
I think I detected the cause of some malfunction on some specific environment.
1. Integer overflow
$wpdb::prepare()
method uses vsprintf()
function. We are using this method in strip_invalid_text()
to create a SQL statement to truncate the input data according to the target character fields. When this method checks wp_posts.post_content data, it sets the length 4294967295 which comes from the fact that the data type of the post_content field is LONGTEXT.
On 32 bit PHP system, the maximum integer bytes (PHP_INT_MAX) are 2147483647. This causes integer overflow. And, as a result, we get a queer SQL statement. Example:
$this->prepare( "CONVERT( LEFT( CONVERT( %s USING binary ), %d ) USING {$value['charset']} )", $value['value'], $value['length']['length'] );
This code creats the statement like below for the post_content field input.
CONVERT( LEFT( CONVERT( 'Some post content data' USING binary ), -1 ) USING ujis)
-1 is the same as 0. This statement removes all the content data. It doesn't happen on the 64 bit PHP implementation, which has the maximum integer 9223372036854775807.
PHP official document says: if integer value overflows, PHP will change it to float type. But it doesn't work, at least on my environment (Windows 32 bit and Ubuntu Linux 32 bit). I don't know if all the 32 bit system behaves the same or not. Anyway, we should change this format placeholder from %d to %f. I couldn't find in MySQL documents if LEFT() function can take float number. So I experimented it and made sure it does work fine.
2. ujis character problem
The ujis database/table/column has another problem. In strip_invalid_text()
method defined in wp-db.php, $value['db']
and $db_check_string
variables are set to true. And then we go into the if ($db_check_string)
block. This block causes a few problem.
A. SQL statement
When we try to write a new post, WordPress tries to make an auto draft entitled 'auto draft.' Our environment is in UTF-8, and all the WordPress PHP and Javascript files are encoded in UTF-8. So all of our input data are encoded in UTF-8, never EUC-JP. I checked SELECT CONVERT etc input on the commandline. In Japanese, auto draft is translated into '自動下書き'.
mysql> charset utf8 SELECT HEX( '自動下書き' ); => E887AAE58B95E4B88BE69BB8E3818D SELECT CONVERT( '自動下書き' USING binary ); => OK. We get E887AAE58B95E4B88BE69BB8E3818D as expected. SELECT LEFT( CONVERT( '自動下書き' USING binary ), 65535 ); => OK. We still get E887AAE58B95E4B88BE69BB8E3818D as expected. SELECT CONVERT( LEFT( CONVERT( '自動下書き' USING binary ), 65535 ) USING ujis ); => NG. We get nothing unexpectedly. SELECT CONVERT( LEFT( CONVERT( '自動下書き' USING binary ), 65535 ) USING utf8 ); => OK. We get E887AAE58B95E4B88BE69BB8E3818D again as expected.
This is our environment. WordPress always fails to instantiate an object, and PHP complains 'Trying to get property of non-object' from all the subsequent methods. And we couldn't use post new admin page. So we must not use
USING {$value['charset']}
We must use this instead.
USING {$this->charset}
For comparison, ujis example.
mysql> charset ujis SELECT HEX('自動下書き'); => BCABC6B0B2BCBDF1A4AD SELECT CONVERT( LEFT( CONVERT( '自動下書き' USING binary), 65535) USING ujis); => OK. We get BCABC6B0B2BCBDF1A4AD as expected. SELECT CONVERT( LEFT( CONVERT( '自動下書き' USING binary), 65535) USING utf8); => NG. We get nothing at all.
This is trivial from the experiment above. BUT this is NOT our environment.
B. Database connection character set
As I explained in the above report, we have ujis table/column but the data transmission is in UTF-8. This is from users' settings, namely 'DB_CHARSET' value in wp-config.php. I think we should not change users' decision to another without notice. If we don't inform them the change, it will make the problem detection very difficult or impossible, just like our case. Even if they ask in the forum reporting their settings precisely with all the plugins deactivated and default template used, no one ever knows from their information where the problem is. So the next lines in strip_invalid_text()
method.
if ( $charset !== $connection_charset ) { $connection_charset = $charset; $this->set_charset( $this->dbh, $charset ); }
Changing connection character set is always wrong unless users explicitly want to change it. This code block makes the connection from WordPress to MySQL broken. Nothing returns from MySQL and WordPress fails to make the object etc...
Is this character set change necessary on some environments? I don't know the situation that requires this change. My suggestion is this.
We should remove above if block, and at the same time another if block below, which is required from the above block, and leave the database connection character set to users' choice according to their server settings.
// Don't forget to change the charset back! if ( $connection_charset !== $this->charset ) { $this->set_charset( $this->dbh ); }
Number 1 and Number 2 fixes make WordPress work fine on both ujis environment and on other environments. There's no patch, because fixes are simple enough. I hope this will help you. Thanks.
#40
in reply to:
↑ 36
@
10 years ago
32165.3.diff is the first pass at solving the second problem described in comment:36.
To provide some context - the functionality to switch character sets was added because of some edge cases we found of queries doing strange things with mixed character sets. In the process, however, I suspect we lost sight of the woods for all of the trees, breaking the much more common and legitimate use case of simply taking non-utf8 strings from user input, and allowing MySQL to convert them to utf8.
I think it's reasonable for us to remove this (it was a little ridiculous of us to allow for changing character sets in a single query).
@nacin, @mdawaffe, @nbachiyski: As my co-conspirators on this functionality, your thoughts would be greatly appreciated (in the next day or two, please - this needs to be fixed in 4.2.3).
#42
@
10 years ago
The character set switching was put in place because of some bugs we found when testing the patch on WordPress.org, but I think we fixed it incorrectly - we decided to check by column character set, when we really should've been using the existing character set, and allowing MySQL to do the conversion.
This more accurately reflects how queries are run - if some code needs to change the character set, it'll change using $wpdb->set_charset()
, then run the query.
Because we're no longer switching character set for the sanity checks, this will also be fine if someone decides to use mysqli_set_charset()
, instead.
#43
@
10 years ago
The patch looks mostly good, at least as far as I understood things, it shouldn't break anything.
My biggest issue is that there isn't a test that breaks unless the wp-db.php
portion of the patch is applied. This way we don't have any regression safety net.
Two other tiny things:
- Why did we move
$this->check_current_query = false;
to outside of the loop? - Is the comment
// Split the CONVERT() calls by charset, so we can make sure the connection is right
still accurate?
#44
@
10 years ago
I realised that I was on the right track, but I was missing handling for the specific case mentioned in comment:36.
This particular case has UTF-8 input, DB_CHARSET
set to utf8
, but the table columns are ujis
. MySQL can handle the conversion when data is inserted, but our sanity checks weren't allowing for this case.
32165.5.diff also fixes up the couple of tiny things that @nbachiyski mentioned, and adds unit tests that fail when the wp-db.php
part of the patch is reverted.
#46
@
10 years ago
- Keywords fixed-major added
- Resolution fixed deleted
- Status changed from closed to reopened
#48
@
10 years ago
Is it possible that this bug has resurfaced after the 4.2.3 security update?
A WP site lost all it's posts and user-data. settings seem unaffected.
Related: #32090, #32104, #32136.