#21212 closed task (blessed) (fixed)
MySQL tables should use utf8mb4 character set
Reported by: | pento | Owned by: | |
---|---|---|---|
Milestone: | 4.2 | Priority: | normal |
Severity: | normal | Version: | 3.4.1 |
Component: | Database | Keywords: | |
Focuses: | Cc: |
Description (last modified by )
Historically, the MySQL utf8 character set has only supported the first character plane of UTF-8. With MySQL 5.5.3, it now supports the entire character plane, using the utf8mb4 character set. This character set is 100% backwards compatible, and does not require more space than utf8 for characters that fall within the utf8 set, only using an extra byte for characters outside of the utf8 set.
References:
http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html
Attachments (13)
Change History (104)
#2
@
12 years ago
If it's 100% compatible, it sounds like something that could be implemented directly in $wpdb - but that makes it more complicated for existing installations (As you'd need it prior to install?)
We can't just add it to the config file either, as we support db.php dropins for other database setups (Such as hyperdb) which may or may not, use mysql_*() functions.
#3
@
12 years ago
$wpdb would be fine for new installs, it may also be an option for existing installs, too.
According to the MySQL upgrade notes, feeding 4 byte characters into a utf8 column is not allowed. My testing showed, however, that it was just replaced with a '?' (the same as it always has) regardless of the character set of the input string.
So, if DB_CHARSET is set to utf8 and the MySQL version >= 5.5.3, wpdb::ini_charset() could just force it to utf8mb4.
For new installs, the same thing in wp-admin/includes/schema.php.
#5
@
12 years ago
New patch implements this functionality in $wpdb instead. It's in wpdb::set_charset() instead of wpdb::init_charset() because mysql_get_server_info() needs to happen after the connection is opened.
This shouldn't affect existing installs - if they have MySQL < 5.5.3, the code won't be run. If they have MySQL >= 5.5.3, my testing shows that feeding a utf8mb4 string into a utf8 column has the same behaviour as it always did - replacing 4 byte characters with a placeholder.
#10
@
12 years ago
Our unique key on slug in the terms table fails miserably with utf8mb4.
WordPress database error Specified key was too long; max key length is 767 bytes for query CREATE TABLE wp_terms ( term_id bigint(20) unsigned NOT NULL auto_increment, name varchar(200) NOT NULL default '', slug varchar(200) NOT NULL default '', term_group bigint(10) NOT NULL default 0, PRIMARY KEY (term_id), UNIQUE KEY slug (slug), KEY name (name) ) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci made by wp_install, make_db_current_silent, dbDelta
#12
@
12 years ago
The 767 byte limit won't be increased any time soon, due to http://bugs.mysql.com/bug.php?id=32915 .
I'll keep an eye on the MySQL bug, we can come back to this for a future version.
#13
@
12 years ago
- Milestone 3.5 deleted
- Resolution set to maybelater
- Status changed from new to closed
#14
@
12 years ago
- Keywords has-patch commit removed
- Resolution maybelater deleted
- Status changed from closed to reopened
It turns out this is possible, thanks to the innodb_large_prefix
option, introduced in 5.5.14:
http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix
So, the requirements are:
MyISAM tables:
- MySQL >= 5.5.3 (Assuming we don't add any indexes larger than 250 characters.)
InnoDB tables:
- MySQL >= 5.5.14
- innodb_file_format=barracuda
- innodb_file_per_table=true
- innodb_large_prefix=true
- All tables with ROW_FORMAT=(DYNAMIC|COMPRESSED)
Any other table formats:
- *head explode*
We can certainly detect these settings, it's just a question of whether that kind of complex (and edge-case-y) test should be in core.
This ticket was mentioned in IRC in #wordpress-dev by nacin. View the logs.
10 years ago
#22
@
10 years ago
FWIW I've been trying to configure a brand new site to use utf8mb4
but I'm still hitting the 767 byte key length issue when installing. Running MySQL 5.5.37 and configuration as per Pento's handy guide, and the innodb_large_prefix
setting is definitely ON
so I'm a bit stumped.
Switching an existing site to utf8mb4
appears to have worked fine, it's the fresh install that's not happy.
This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.
10 years ago
#27
@
10 years ago
- Severity changed from minor to normal
@pento, any updates on this?
A number of people, especially in Japan, wants to use Emoji in WordPress!
#30
@
10 years ago
[30366] missed the ticket:
WPDB: When removing invalid characters from utf8 strings in older versions of PHP, the regex was too large to be compiled.
#34
@
10 years ago
I wonder if slipping [30400] into a late build of 4.1 is a good idea? There's likely a lot of plugins (maybe themes) out there that will have code that will start throwing warnings because of the strict checks. (We just noticed it with automated testing in our builds last night). It's a good idea, just needs to either be communicated REALLY well, or maybe delayed.
This ticket was mentioned in Slack in #core by nerrad. View the logs.
10 years ago
#36
@
10 years ago
So here's an example error in our unit testing that came up...
WordPress database error BLOB/TEXT column 'DTT_description' can't have a default value for query CREATE TABLE wptests_esp_datetime ( DTT_ID int(10) unsigned NOT NULL AUTO_INCREMENT, EVT_ID INT UNSIGNED NOT NULL , DTT_name VARCHAR(255) NOT NULL default '', DTT_description TEXT NOT NULL default '', DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00', DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00', DTT_reg_limit mediumint(8) DEFAULT -1, DTT_sold mediumint(8) unsigned DEFAULT 0, DTT_is_primary tinyint(1) UNSIGNED NOT NULL DEFAULT 0, DTT_order mediumint(3) unsigned DEFAULT 0, DTT_parent int(10) unsigned DEFAULT 0, DTT_deleted tinyint(1) UNSIGNED NOT NULL DEFAULT 0, PRIMARY KEY (DTT_ID), KEY EVT_ID (EVT_ID), KEY DTT_is_primary (DTT_is_primary) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci
Obviously, that reveals where our sql can be corrected. However I suspect there will be a number of plugin authors and potentially theme authors who will have similar sql that fails strict standards and depending on who coded, might not understand what the problem is right away.
#37
@
10 years ago
Setting STRICT_ALL_TABLES
means that queries that were passing before [30400] are now failing, even when WP_DEBUG is set to OFF.
I think the idea of making the MySQL modes filterable, so plugin or theme authors can choose to disable STRICT_ALL_TABLES
is good, but I agree with @nerrad: this will come as a surprise to plugin or theme developers, and even if they do put out a quick update to their plugin or theme (which might be as simple as using that filter to disable STRICT_ALL_TABLES
), many of their users may update wordpress core before they update the affected plugins or themes.
If we're going to be setting STRICT_ALL_TABLES
as a mode in 4.1, I'd suggest it just be the default when WP_DEBUG is on. That way affected plugins and themes will be made aware soon, but it won't affect users live installs.
#38
@
10 years ago
Should milestone be updated on this now that there are actual commits in 4.1.0 for this ticket?
#39
@
10 years ago
also I'm a little confused at how this HAS NOT been assigned a milestone, YET its changes are currently in 4.1 beta. Shouldn't it be assigned to the 4.1 milestone if it's work is currently in that branch?
#40
@
10 years ago
- Milestone changed from Awaiting Review to 4.1
- Type changed from enhancement to task (blessed)
We're human, we miss things sometimes, especially when working on the releases that just happened a few minutes ago.
Also, sometimes we go with aggressive changes in betas to see what kind of real-world breakage is out there, so we can communicate, mitigate, and pull back if we need to. That MySQL strict mode requires values or defaults for write queries yet doesn't support defaults for all column types is, IMO, ridiculous, but at least we know now, and can go from here. Thanks for reporting back with the specific breakage, nerrad.
#41
@
10 years ago
No problems, and no worries here on my end :) Just want to help things along and reporting seems to be the way we should do that. Mike and I are on the same team and its purely coincidence we posted the same time about the milestone thing - we had a laugh about it. Sorry if it comes across like making noise.
#42
follow-up:
↓ 43
@
10 years ago
Sorry to complicate things:
MySQL 5.7 will enable STRICT_TRANS_TABLE,NO_ENGINE_SUBSTITUTION and ONLY_FULL_GROUP_BY by default. Modes ERROR_FOR_DIVISION_BY_ZERO, NO_ZERO_DATE and NO_ZERO_IN_DATE have also been merged into the definitions of STRICT_TRANS_TABLES or STRICT_ALL_TABLES, and thus will also be enabled by default.
So this means that these three columns will all violate the STRICT_ALL_TABLES goal:
DTT_description TEXT NOT NULL default '', DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00', DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00',
The more future proof alternative:
DTT_description TEXT NOT NULL, DTT_EVT_start datetime NOT NULL, DTT_EVT_end datetime NOT NULL,
(Disclaimer for those that don't know me yet: I work on the MySQL team @ Oracle.)
#43
in reply to:
↑ 42
@
10 years ago
Replying to morgantocker:
Sorry to complicate things:
MySQL 5.7 will enable STRICT_TRANS_TABLE,NO_ENGINE_SUBSTITUTION and ONLY_FULL_GROUP_BY by default. Modes ERROR_FOR_DIVISION_BY_ZERO, NO_ZERO_DATE and NO_ZERO_IN_DATE have also been merged into the definitions of STRICT_TRANS_TABLES or STRICT_ALL_TABLES, and thus will also be enabled by default.
So this means that these three columns will all violate the STRICT_ALL_TABLES goal:
DTT_description TEXT NOT NULL default '', DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00', DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00',The more future proof alternative:
DTT_description TEXT NOT NULL, DTT_EVT_start datetime NOT NULL, DTT_EVT_end datetime NOT NULL,(Disclaimer for those that don't know me yet: I work on the MySQL team @ Oracle.)
ya @morgantocker, we were made aware of our plugin's invalid SQL since STRICT_ALL_TABLES
was enabled in WP core, and we are changing our code. But just as a reminder: the SQL @nerrad posted is specific to our plugin only, not WP core.
Generally, enabling STRICT_ALL_TABLES
has been useful for debugging like this which is why using it when WP_DEBUG is on may be handy.
This ticket was mentioned in Slack in #core by akumria. View the logs.
10 years ago
This ticket was mentioned in Slack in #core by netweb. View the logs.
10 years ago
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#49
@
10 years ago
21212-strict.diff was our implementation of strict mode, keeping it here in case we want to try it again in the future.
#55
@
10 years ago
21212.diff is the first pass at automatically upgrading utf8
sites to utf8mb4
, if they support it.
In WPDB:
init_charset()
now checks to see if the MySQL server is a high enough version to supportutf8mb4
. This means that it needs to be run after the connection happens, rather than in the constructor.- When run on
utf8mb4
strings, it turns out that PHP's multibyte function transforms the invalid characters into?
, instead of removing them. I upgraded the regex instrip_invalid_text()
to handle 4 byte UTF-8 characters, too.
During installation:
- If the MySQL supports it, set
DB_CHARSET
toutf8mb4
whenwp-config.php
is created.
In the DB upgrade routines:
- If we can do it, let's upgrade all tables to
utf8mb4
, including Multisite tables.
In the Charset tests:
- We had a couple of tests that assumed a valid 4-byte character would cause queries to fail. I changed it to an invalid 4-byte character.
This gives Emoji support for everyone using MySQL 5.5+ - about 60% of WordPress sites. What it doesn't do:
- Allow sites on older MySQL versions to use Emoji.
- Allow Chrome to display Emoji. We don't need to worry about that, though - Emoji support landed in Chrome Canary a few weeks ago, it should be in the stable branch in the next couple of weeks. By the time WordPress 4.2 is released, all Chrome users will have auto-updated.
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#57
@
10 years ago
Looks like [31051] broke test_query_tag_nun()
added in [UT1032].
$wpdb->get_row() in get_term_by()
returns unexpected result and causes some notices:
object(stdClass)#3499 (9) { ["Field"]=> string(7) "term_id" ["Type"]=> string(19) "bigint(20) unsigned" ["Collation"]=> NULL ["Null"]=> string(2) "NO" ["Key"]=> string(3) "PRI" ["Default"]=> NULL ["Extra"]=> string(14) "auto_increment" ["Privileges"]=> string(31) "select,insert,update,references" ["Comment"]=> string(0) "" } Notice: Undefined property: stdClass::$term_id in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 1415 Notice: Undefined property: stdClass::$term_id in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 2124 Notice: Undefined property: stdClass::$slug in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 2812
The same happens when adding a "test-נ" or "САПР" tag via UI, although these are valid UTF-8 characters.
#60
@
10 years ago
Here's a dump of my DB table collation https://gist.github.com/ntwb/00918b2f625281853502
The reason why I thought nw_term_relationships
was not being converted to utf8mb4
is because data types INT
, BIGINT
, DATETIME
do NOT have a collation, as the wp_terms_relationships
table only includes INT
and BIGINT
this is why MySQL was not displaying/showing this table as utf8mb4
for me.
One other note I read but cannot find reference now was:
"Altering a table can impact the auto increment values"
So, should we not alter numeric/datetime columns?
#61
@
10 years ago
In 21212.2.diff, wpdb now converts emoji into HTML entities, but only for post_content
, and only when post_content
is utf8
.
This expands emoji support to 99%+ of WordPress sites.
The good news is, I don't think we don't need to worry about converting back to proper characters, if the site later updates to utf8mb4
. The HTML entities will still continue to render correctly, and will automatically be updated to the proper character if the post is ever edited.
Still to do:
- Think about what other columns we want to allow emoji in, and whitelist them.
- Make sure that we're only running the
ALTER TABLE
queries onutf8
tables.utf8
is the only character set we can guarantee is an unaltered subset ofutf8mb4
. - Test that the
ALTER TABLE
works correctly for all indexes, with a wide range of data. - Probably other things.
So, should we not alter numeric/datetime columns?
No need to worry. The ALTER TABLE
only affects text-based columns, it won't touch numeric/datetime columns.
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#64
@
10 years ago
A little bit more MySQL version context on innodb_large_prefix
:
- It was created so that MySQL downgrades would be possible (i.e. 5.5 -> 5.1)
- This purpose is now obsolete, since 5.5+ support large prefix (and 5.1 is no longer officially supported).
- We are proposing to enable
innodb_large_prefix
by default in MySQL 5.7, since it is now a safe change. I have the call out on my blog here: http://www.tocker.ca/2015/01/05/what-defaults-would-you-like-to-see-changed-in-mysql-5-7.html
So I realize it's a while out for many people, but the story should be simpler with MySQL 5.7.
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#66
@
10 years ago
I've split the patch into two parts:
21212-utf8mb4.4.diff enables utf8mb4
, when available. It now sanity checks that all columns in a table (that have a character set) are utf8
, before doing the conversion. It also reduces the size of a handful of indexes, to fit within the 767 byte limit. This allows as to avoid the mess of checking for innodb_large_prefix
.
21212-emoji.diff adds Emoji support for databases that are using the utf8
character set. Emoji are allowed in post_title
, post_content
and 'post_excerpt', as well as the Site Name and Description. This patch is dependant upon 21212-utf8mb4.4.diff.
This ticket was mentioned in Slack in #core by ocean90. View the logs.
10 years ago
#69
@
10 years ago
21212-utf8mb4.5.diff fixes only one table being upgraded to utf8mb4
, instead of all of them.
#71
@
10 years ago
21212-utf8mb4.6.diff adds an extra sanity check to wpdb::init_charset()
.
Historically, it hasn't required a DB connection to be called, but now requires a connection to check for utf8mb4
compatibility. 21212-utf8mb4.6.diff allows it to be called without a connection, which reverts it to the old behaviour of not checking for utf8mb4
compatibility.
This ticket was mentioned in Slack in #core by drew. View the logs.
10 years ago
#73
@
10 years ago
With utf8mb4 on HHVM, texts are corrupt. All non-ASCII characters turns into "?" mark.
Problematic environment
- OS: CentOS 6.6
- Server: Nginx 1.6.2
- HHVM: HHVM 3.2 (Unofficial: http://www.hop5.in/yum/el6/repoview/hhvm.html)
- DB_CHARSET: utf8mb4
- DB_COLLATE: utf8mb4_general_ci (and utf8mb4_unicode_ci)
- WP: 4.2-alpha-31269 + 21212-utf8mb4.6.diff
- MySQL: 5.5.41(Percona Release37)
but… the problem is not happen on HHVM3.5.0 (or 3.2.0 too) + Ubuntu 14.04 + MySQL 5.5.41.
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#77
@
10 years ago
21212-utf8mb4.7.diff moves all of the index changes into single queries for each table, rather than relying on dbDelta()
to create the new indexes.
It also adds sanity checking to Multisite upgrades, so that we're not trying to upgrade any non-utf8 charsets.
This ticket was mentioned in Slack in #core by pento. View the logs.
10 years ago
#79
@
10 years ago
- Add a variable and comment explaining why we have the 191 character length indexes
- Use
$wpdb->tables()
instead of$wpdb->tables
- Add a
DO_NOT_UPGRADE_GLOBAL_TABLES
check, aswp_sitemeta
can easily be a big table - In
$wpdb->db_connect()
, we only need to call$wpdb->init_charset()
the first time we connect - Tidy up and clarify a few comments
#80
@
10 years ago
- Move the
utf8mb4
upgrade to a function, so that plugins can use it, too. - A little more comment tidying
#82
@
10 years ago
- Keywords has-patch removed
- Resolution set to fixed
- Status changed from reopened to closed
It's been a fun ride, everyone, but we're done here.
Please report any bugs in new tickets.
21212-emoji.diff will be merged into the work on #31242.
#86
@
10 years ago
Mentioned to @pento in Slack that I'm seeing real, live issues with an impromper collation_connection
, chat archive here: https://wordpress.slack.com/archives/core/p1423274312002892 (will probably open a new ticket for this later).
#88
follow-up:
↓ 89
@
10 years ago
I think wp_encode_emoji is not suitable for both escaping 4-byte characters and the standard api.
wp_encode_emoji function is leaky abstraction since new emoji characters will be added in Unicode Standard every year. Thus the users of function are forced to check the change of emoji-data.txt, emoji skin tone modifiers (U+1F3FB..U+1F3FF, see Unicode Technical Report 51 ) and use their own function.
Regional indicator symbols (U+1F1E6..U+1F1FF) are not emoji themself though, they are used for national flags. see GraphemeBreakProperty.txt or Unicode Standard Annex 29 for the details.
Another reason why I do not vote for wp_encode_emoji function is that all of 4-byte characters is not emoji.
A lot of 4-byte chinese characters are used for the names of places and the family names. U+20BB7 is used for Yoshinoya, which is Japanese fast food chain and have more than 1800 stores.
A part of Variation selectors supplements are used for variant form of chinese characters (U+E0100..U+E01EF). U+E0101 is used for Katsushika-ku (U+845B U+E0101 U+98FE U+533A). TAKARA TOMY which is famous for Pokemon has the head office in Katsushika-ku.
#89
in reply to:
↑ 88
@
10 years ago
Replying to masakielastic:
wp_encode_emoji function is leaky abstraction since new emoji characters will be added in Unicode Standard every year. Thus the users of function are forced to check the change of emoji-data.txt, emoji skin tone modifiers (U+1F3FB..U+1F3FF, see Unicode Technical Report 51 ) and use their own function.
Not supporting Unicode 8.0 changes is intentional. The Twemoji library also doesn't support Unicode 8.0 emoji or skin tone modifiers, we'll update both when Unicode 8.0 is finalised, and Twemoji adds support.
Regional indicator symbols (U+1F1E6..U+1F1FF) are not emoji themself though, they are used for national flags. see GraphemeBreakProperty.txt or Unicode Standard Annex 29 for the details.
For our purposes, I'm okay with treating the national flags as individual characters. For example, 🇬🇧
will still show as the GB flag.
There needs to be a little bit of extra logic to allow for the static image replacement, but that won't be too tricky. Thank you for bringing it to my attention - I've created a Github issue to track it.
Another reason why I do not vote for wp_encode_emoji function is that all of 4-byte characters is not emoji.
A lot of 4-byte chinese characters are used for the names of places and the family names. U+20BB7 is used for Yoshinoya, which is Japanese fast food chain and have more than 1800 stores.
A part of Variation selectors supplements are used for variant form of chinese characters (U+E0100..U+E01EF). U+E0101 is used for Katsushika-ku (U+845B U+E0101 U+98FE U+533A). TAKARA TOMY which is famous for Pokemon has the head office in Katsushika-ku.
I'm not following what the problem is here - none of these characters are encoded by wp_encode_emoji()
. Could you please expand on the problem?
First patch is closer to a proof of concept - adding the switch to wp-config-sample.php is fairly ugly.