Make WordPress Core

Opened 8 years ago

Closed 8 years ago

#37689 closed defect (bug) (fixed)

Issues with utf8mb4 collation and the 4.6 update

Reported by: hristo-sg's profile Hristo Sg Owned by: pento's profile pento
Milestone: 4.6.1 Priority: normal
Severity: normal Version: 4.6
Component: Database Keywords: has-patch fixed-major
Focuses: performance Cc:

Description

If you have a pre-4.6 WP install with charset configured in the wp-config.php file and set to utf8mb4:

define('DB_CHARSET', 'utf8mb4');

After the update, all site symbols including those in the options table are converted into incorrect characters.

If you comment out the line:

#define('DB_CHARSET', 'utf8mb4');

The website starts showing characters correctly.

Attachments (1)

37689.diff (1.6 KB) - added by pento 8 years ago.

Download all attachments as: .zip

Change History (22)

#2 @swissspidy
8 years ago

IIRC leaving the default at define('DB_CHARSET', 'utf8'); works best as WordPress will automatically convert to utf8mb4 if possible. But of course that's not "the" solution.

#3 @danielkanchev
8 years ago

Yes, the nasty part is that I suspect everyone who has defined the charset to be utf8mb4 may see a broken site after the update to 4.6.

If you need more information about the site @Hristo Sg mention I can provide the exact MySQL version, MySQL client version, PHP version, etc.

#4 @ocean90
8 years ago

  • Keywords reporter-feedback added

@hristo-sg Can you provide some details about your PHP and MySQL (client) versions? What's the current charset/collation of your tables?

#5 @danielkanchev
8 years ago

@ocean90 here is the requested information:

MySQL Server version:

Server version: 5.6.28-76.1-log Percona Server (GPL), Release 76.1, Revision 5759e76

MySQL client version:

mysql --version
mysql Ver 14.14 Distrib 5.6.27-75.0, for Linux (x86_64) using 5.1

PHP Details:

http://pandjarov.com/updatetest/info.php

Table Collation before the upgrade:

utf8mb4_unicode_ci

Table Collation after the upgrade:

utf8mb4_unicode_ci

So the issue is that before the upgrade the site works as expected and after the upgrade all the text was gibberish.

#6 @ocean90
8 years ago

@danielkanchev Thanks, this could be related to #37683 and the change in [37601]. Do you have a test install where you can test if a revert of [37601] will fix the issue?

#7 @danielkanchev
8 years ago

I reverted [37601] and the issue was not resolved - the site is still showing gibberish if the define('DB_CHARSET', 'utf8mb4') is not commented. Other than that the DB_COLLATE is indeed empty in the wp-config.php

#8 @ocean90
8 years ago

There are two other related changes: [37523] and [37521]. Maybe it's one of these?

#9 @danielkanchev
8 years ago

I tried reverting those two as well but the issue remains. @ocean90 if you want I may give you access to a test site which is experiencing this issue or I can revert other changes as well.

#10 @ocean90
8 years ago

  • Keywords reporter-feedback removed

I'm afraid I'm out of ideas.

@pento any ideas what could cause this?

#11 @pento
8 years ago

The cause is strange and exciting interactions between character sets. :-)

@danielkanchev: Could you please DM me an Slack? My username there is "pento". I'd like to have a look at your test site.

#12 @danielkanchev
8 years ago

@pento Thanks for the help! I sent you a DM :)

#13 @jeremyfelt
8 years ago

Noting that this ticket may affect the approach on #37683, which is marked for 4.6.1. We need to determine if this should be moved to the 4.6.1 milestone as well.

@pento Do you have any more details?

This ticket was mentioned in Slack in #core by jeremyfelt. View the logs.


8 years ago

#15 @SGr33n
8 years ago

Hi people,
I had this issue on a website, and the commenting of the DB_CHARSET didn't work, because some plugin stopped working. So I had a look at the database discovering that the issue was on the database (so I assume this depends on the utf8mb4 conversion script). Here is how to fix the database, but please test this on a staging environment, don't do that on a production website.

On http://www.i18nqa.com/debug/utf8-debug.html you can see that if you see characters like ù or á the original charset was latin1, so first create a dump of the database, open a shell console on the server and run:

mysqldump -uUSERNAME -p --default-character-set=latin1 DATABASE_NAME > dump-latin1.sql
[enter your password]

Then you have to edit this file in order to make a small correction, but if it's big the editing will require RAM or time:

nano dump-latin1.sql

Change

/*!40101 SET NAMES latin1 */;

to

/*!40101 SET NAMES utf8 */;

save by entering CTRL+X
enter Y

Now your dump is fixed and ready, so I suggest to restore it on another database name, in order to have a backup of the old one and possibly easly restore it, or at least to add a prefix to the existent tables.

Restore it with:

mysql -uUSERNAME -p DATABASE_NAME < dump-latin1.sql
[enter your password]

Your WordPress should now work as expected.

Last edited 8 years ago by SGr33n (previous) (diff)

This ticket was mentioned in Slack in #core by jeremyfelt. View the logs.


8 years ago

#17 @pento
8 years ago

  • Milestone changed from Awaiting Review to 4.6.1

Thank you @danielkanchev for the use of your server. :-)

The root cause of this problem in @danielkanchev's case was [37320], and PHP 5.3. While the site was on WordPress 4.5, it was using PHP 5.3, which doesn't support utf8mb4. Because DB_CHARSET was set to utf8mb4, wpdb::set_charset() was silently failing, and reverting back to the default server character set - latin1.

The upgrade to WordPress 4.6 included [37320], which sets the server side character set, but it assumes that the client side character set has been set correctly. This caused MySQL to be taking latin1 strings from the database, and converting them to utf8 before sending them to PHP. PHP was treating them as latin1, however, hence the mojibake.

I think we could reasonably check the result of the mysqli_set_charset() before running the SET NAMES query, as it's better to try and use the server default character sets for everything if part of the process fails.

@pento
8 years ago

#18 @pento
8 years ago

  • Keywords has-patch added

@danielkanchev: Could I get you to test 37689.diff with WordPress 4.6 and PHP 5.3, with DB_CHARSET set to utf8mb4?

#19 @danielkanchev
8 years ago

@pento I tested the provided patch and everything works with WP 4.6 + PHP 5.3 and DB_CHARSET set to utf8mb4 on the test site.

#20 @pento
8 years ago

  • Owner set to pento
  • Resolution set to fixed
  • Status changed from new to closed

In 38441:

Database: Don't force an unsupported character set that previously would've silently failed.

[37320] corrected some behaviour in how PHP and MySQL character sets are matched up. This was correct, but had the side effect of causing some incorrectly configured sites to start failing.

Prior to [37320], if DB_CHARSET was set to utf8mb4, but the PHP version didn't support utf8mb4, it would fall back to the default character set - usually latin1. After [37320], the SET NAMES query would force MySQL to treat the connection character set as utf8mb4, even if PHP wasn't able to understand it.

By checking if mysqli_set_charset() succeeded, we can simulate the old behaviour, while maintaining the fix in [37320].

Props danielkanchev fo helping to diagnose this issue.
Fixes #37689 for trunk.

#21 @pento
8 years ago

  • Keywords fixed-major added
  • Resolution fixed deleted
  • Status changed from closed to reopened

#22 @pento
8 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 38442:

Database: Don't force an unsupported character set that previously would've silently failed.

[37320] corrected some behaviour in how PHP and MySQL character sets are matched up. This was correct, but had the side effect of causing some incorrectly configured sites to start failing.

Prior to [37320], if DB_CHARSET was set to utf8mb4, but the PHP version didn't support utf8mb4, it would fall back to the default character set - usually latin1. After [37320], the SET NAMES query would force MySQL to treat the connection character set as utf8mb4, even if PHP wasn't able to understand it.

By checking if mysqli_set_charset() succeeded, we can simulate the old behaviour, while maintaining the fix in [37320].

Merge of [38441] to the 4.6 branch.

Props danielkanchev fo helping to diagnose this issue.
Fixes #37689.

Note: See TracTickets for help on using tickets.