Opened 10 months ago

Last modified 7 months ago

#21212 reopened enhancement

MySQL tables should use utf8mb4 character set

Reported by: pento Owned by:
Priority: normal Milestone: Awaiting Review
Component: Database Version: 3.4.1
Severity: minor Keywords:
Cc:

Description (last modified by SergeyBiryukov)

Historically, the MySQL utf8 character set has only supported the first character plane of UTF-8. With MySQL 5.5.3, it now supports the entire character plane, using the utf8mb4 character set. This character set is 100% backwards compatible, and does not require more space than utf8 for characters that fall within the utf8 set, only using an extra byte for characters outside of the utf8 set.

References:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Attachments (3)

21212-utf8mb4.diff (909 bytes) - added by pento 10 months ago.
21212-utf8mb4.2.diff (686 bytes) - added by pento 10 months ago.
21212-utf8mb4.3.diff (910 bytes) - added by pento 9 months ago.

Download all attachments as: .zip

Change History (18)

pento10 months ago

  • Keywords has-patch added

First patch is closer to a proof of concept - adding the switch to wp-config-sample.php is fairly ugly.

If it's 100% compatible, it sounds like something that could be implemented directly in $wpdb - but that makes it more complicated for existing installations (As you'd need it prior to install?)

We can't just add it to the config file either, as we support db.php dropins for other database setups (Such as hyperdb) which may or may not, use mysql_*() functions.

$wpdb would be fine for new installs, it may also be an option for existing installs, too.

According to the MySQL upgrade notes, feeding 4 byte characters into a utf8 column is not allowed. My testing showed, however, that it was just replaced with a '?' (the same as it always has) regardless of the character set of the input string.

So, if DB_CHARSET is set to utf8 and the MySQL version >= 5.5.3, wpdb::ini_charset() could just force it to utf8mb4.

For new installs, the same thing in wp-admin/includes/schema.php.

  • Description modified (diff)

Related: #13590

pento10 months ago

New patch implements this functionality in $wpdb instead. It's in wpdb::set_charset() instead of wpdb::init_charset() because mysql_get_server_info() needs to happen after the connection is opened.

This shouldn't affect existing installs - if they have MySQL < 5.5.3, the code won't be run. If they have MySQL >= 5.5.3, my testing shows that feeding a utf8mb4 string into a utf8 column has the same behaviour as it always did - replacing 4 byte characters with a placeholder.

pento9 months ago

Patch 3 moves the version check to wpdb::has_cap().

  • Milestone changed from Awaiting Review to 3.5
  • Keywords commit added

Patch also fixes #13590.

Our unique key on slug in the terms table fails miserably with utf8mb4.

WordPress database error Specified key was too long; max key length is 767 bytes for query 
CREATE TABLE wp_terms (
 term_id bigint(20) unsigned NOT NULL auto_increment,
 name varchar(200) NOT NULL default '',
 slug varchar(200) NOT NULL default '',
 term_group bigint(10) NOT NULL default 0,
 PRIMARY KEY  (term_id),
 UNIQUE KEY slug (slug),
 KEY name (name)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci made by wp_install, make_db_current_silent, dbDelta

The 767 byte limit won't be increased any time soon, due to http://bugs.mysql.com/bug.php?id=32915 .

I'll keep an eye on the MySQL bug, we can come back to this for a future version.

  • Milestone 3.5 deleted
  • Resolution set to maybelater
  • Status changed from new to closed
  • Keywords has-patch commit removed
  • Resolution maybelater deleted
  • Status changed from closed to reopened

It turns out this is possible, thanks to the innodb_large_prefix option, introduced in 5.5.14:
http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix

So, the requirements are:

MyISAM tables:

  • MySQL >= 5.5.3 (Assuming we don't add any indexes larger than 250 characters.)

InnoDB tables:

  • MySQL >= 5.5.14
  • innodb_file_format=barracuda
  • innodb_file_per_table=true
  • innodb_large_prefix=true
  • All tables with ROW_FORMAT=(DYNAMIC|COMPRESSED)

Any other table formats:

  • *head explode*

We can certainly detect these settings, it's just a question of whether that kind of complex (and edge-case-y) test should be in core.

Last edited 7 months ago by pento (previous) (diff)
  • Milestone set to Awaiting Review
Note: See TracTickets for help on using tickets.