WordPress.org

Make WordPress Core

Opened 22 months ago

Last modified 2 weeks ago

#21212 reopened enhancement

MySQL tables should use utf8mb4 character set

Reported by: pento Owned by:
Milestone: Awaiting Review Priority: normal
Severity: minor Version: 3.4.1
Component: Database Keywords:
Focuses: Cc:

Description (last modified by SergeyBiryukov)

Historically, the MySQL utf8 character set has only supported the first character plane of UTF-8. With MySQL 5.5.3, it now supports the entire character plane, using the utf8mb4 character set. This character set is 100% backwards compatible, and does not require more space than utf8 for characters that fall within the utf8 set, only using an extra byte for characters outside of the utf8 set.

References:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Attachments (3)

21212-utf8mb4.diff (909 bytes) - added by pento 22 months ago.
21212-utf8mb4.2.diff (686 bytes) - added by pento 21 months ago.
21212-utf8mb4.3.diff (910 bytes) - added by pento 21 months ago.

Download all attachments as: .zip

Change History (22)

pento22 months ago

comment:1 pento22 months ago

  • Keywords has-patch added

First patch is closer to a proof of concept - adding the switch to wp-config-sample.php is fairly ugly.

comment:2 dd3222 months ago

If it's 100% compatible, it sounds like something that could be implemented directly in $wpdb - but that makes it more complicated for existing installations (As you'd need it prior to install?)

We can't just add it to the config file either, as we support db.php dropins for other database setups (Such as hyperdb) which may or may not, use mysql_*() functions.

comment:3 pento22 months ago

$wpdb would be fine for new installs, it may also be an option for existing installs, too.

According to the MySQL upgrade notes, feeding 4 byte characters into a utf8 column is not allowed. My testing showed, however, that it was just replaced with a '?' (the same as it always has) regardless of the character set of the input string.

So, if DB_CHARSET is set to utf8 and the MySQL version >= 5.5.3, wpdb::ini_charset() could just force it to utf8mb4.

For new installs, the same thing in wp-admin/includes/schema.php.

comment:4 SergeyBiryukov22 months ago

  • Description modified (diff)

Related: #13590

pento21 months ago

comment:5 pento21 months ago

New patch implements this functionality in $wpdb instead. It's in wpdb::set_charset() instead of wpdb::init_charset() because mysql_get_server_info() needs to happen after the connection is opened.

This shouldn't affect existing installs - if they have MySQL < 5.5.3, the code won't be run. If they have MySQL >= 5.5.3, my testing shows that feeding a utf8mb4 string into a utf8 column has the same behaviour as it always did - replacing 4 byte characters with a placeholder.

pento21 months ago

comment:6 pento21 months ago

Patch 3 moves the version check to wpdb::has_cap().

comment:7 nacin21 months ago

  • Milestone changed from Awaiting Review to 3.5

comment:8 nacin21 months ago

  • Keywords commit added

comment:9 nacin21 months ago

Patch also fixes #13590.

comment:10 ryan20 months ago

Our unique key on slug in the terms table fails miserably with utf8mb4.

WordPress database error Specified key was too long; max key length is 767 bytes for query 
CREATE TABLE wp_terms (
 term_id bigint(20) unsigned NOT NULL auto_increment,
 name varchar(200) NOT NULL default '',
 slug varchar(200) NOT NULL default '',
 term_group bigint(10) NOT NULL default 0,
 PRIMARY KEY  (term_id),
 UNIQUE KEY slug (slug),
 KEY name (name)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci made by wp_install, make_db_current_silent, dbDelta

comment:12 pento20 months ago

The 767 byte limit won't be increased any time soon, due to http://bugs.mysql.com/bug.php?id=32915 .

I'll keep an eye on the MySQL bug, we can come back to this for a future version.

comment:13 SergeyBiryukov20 months ago

  • Milestone 3.5 deleted
  • Resolution set to maybelater
  • Status changed from new to closed

comment:14 pento19 months ago

  • Keywords has-patch commit removed
  • Resolution maybelater deleted
  • Status changed from closed to reopened

It turns out this is possible, thanks to the innodb_large_prefix option, introduced in 5.5.14:
http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix

So, the requirements are:

MyISAM tables:

  • MySQL >= 5.5.3 (Assuming we don't add any indexes larger than 250 characters.)

InnoDB tables:

  • MySQL >= 5.5.14
  • innodb_file_format=barracuda
  • innodb_file_per_table=true
  • innodb_large_prefix=true
  • All tables with ROW_FORMAT=(DYNAMIC|COMPRESSED)

Any other table formats:

  • *head explode*

We can certainly detect these settings, it's just a question of whether that kind of complex (and edge-case-y) test should be in core.

Last edited 19 months ago by pento (previous) (diff)

comment:15 SergeyBiryukov19 months ago

  • Milestone set to Awaiting Review

comment:16 sardisson7 months ago

  • Cc sardisson added

comment:17 tomdxw7 months ago

  • Cc tom@… added

comment:18 3flex5 months ago

  • Cc 3flex added
Note: See TracTickets for help on using tickets.