WordPress.org

Make WordPress Core

Opened 3 years ago

Closed 4 weeks ago

Last modified 3 weeks ago

#21212 closed task (blessed) (fixed)

MySQL tables should use utf8mb4 character set

Reported by: pento Owned by:
Milestone: 4.2 Priority: normal
Severity: normal Version: 3.4.1
Component: Database Keywords:
Focuses: Cc:

Description (last modified by SergeyBiryukov)

Historically, the MySQL utf8 character set has only supported the first character plane of UTF-8. With MySQL 5.5.3, it now supports the entire character plane, using the utf8mb4 character set. This character set is 100% backwards compatible, and does not require more space than utf8 for characters that fall within the utf8 set, only using an extra byte for characters outside of the utf8 set.

References:

http://en.wikipedia.org/wiki/Mapping_of_Unicode_character_planes
http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

Attachments (13)

21212-utf8mb4.diff (909 bytes) - added by pento 3 years ago.
21212-utf8mb4.2.diff (686 bytes) - added by pento 3 years ago.
21212-utf8mb4.3.diff (910 bytes) - added by pento 3 years ago.
21212-strict.diff (8.2 KB) - added by pento 3 months ago.
21212.diff (5.7 KB) - added by pento 8 weeks ago.
21212.2.diff (6.5 KB) - added by pento 7 weeks ago.
21212-utf8mb4.4.diff (9.6 KB) - added by pento 6 weeks ago.
21212-emoji.diff (3.3 KB) - added by pento 6 weeks ago.
21212-utf8mb4.5.diff (9.6 KB) - added by pento 6 weeks ago.
21212-utf8mb4.6.diff (10.2 KB) - added by pento 6 weeks ago.
21212-utf8mb4.7.diff (11.4 KB) - added by pento 4 weeks ago.
21212-utf8mb4.8.diff (12.4 KB) - added by pento 4 weeks ago.
21212-utf8mb4.9.diff (12.2 KB) - added by pento 4 weeks ago.

Download all attachments as: .zip

Change History (102)

@pento3 years ago

comment:1 @pento3 years ago

  • Keywords has-patch added

First patch is closer to a proof of concept - adding the switch to wp-config-sample.php is fairly ugly.

comment:2 @dd323 years ago

If it's 100% compatible, it sounds like something that could be implemented directly in $wpdb - but that makes it more complicated for existing installations (As you'd need it prior to install?)

We can't just add it to the config file either, as we support db.php dropins for other database setups (Such as hyperdb) which may or may not, use mysql_*() functions.

comment:3 @pento3 years ago

$wpdb would be fine for new installs, it may also be an option for existing installs, too.

According to the MySQL upgrade notes, feeding 4 byte characters into a utf8 column is not allowed. My testing showed, however, that it was just replaced with a '?' (the same as it always has) regardless of the character set of the input string.

So, if DB_CHARSET is set to utf8 and the MySQL version >= 5.5.3, wpdb::ini_charset() could just force it to utf8mb4.

For new installs, the same thing in wp-admin/includes/schema.php.

comment:4 @SergeyBiryukov3 years ago

  • Description modified (diff)

Related: #13590

@pento3 years ago

comment:5 @pento3 years ago

New patch implements this functionality in $wpdb instead. It's in wpdb::set_charset() instead of wpdb::init_charset() because mysql_get_server_info() needs to happen after the connection is opened.

This shouldn't affect existing installs - if they have MySQL < 5.5.3, the code won't be run. If they have MySQL >= 5.5.3, my testing shows that feeding a utf8mb4 string into a utf8 column has the same behaviour as it always did - replacing 4 byte characters with a placeholder.

@pento3 years ago

comment:6 @pento3 years ago

Patch 3 moves the version check to wpdb::has_cap().

comment:7 @nacin3 years ago

  • Milestone changed from Awaiting Review to 3.5

comment:8 @nacin3 years ago

  • Keywords commit added

comment:9 @nacin3 years ago

Patch also fixes #13590.

comment:10 @ryan3 years ago

Our unique key on slug in the terms table fails miserably with utf8mb4.

WordPress database error Specified key was too long; max key length is 767 bytes for query 
CREATE TABLE wp_terms (
 term_id bigint(20) unsigned NOT NULL auto_increment,
 name varchar(200) NOT NULL default '',
 slug varchar(200) NOT NULL default '',
 term_group bigint(10) NOT NULL default 0,
 PRIMARY KEY  (term_id),
 UNIQUE KEY slug (slug),
 KEY name (name)
) DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci made by wp_install, make_db_current_silent, dbDelta

comment:12 @pento3 years ago

The 767 byte limit won't be increased any time soon, due to http://bugs.mysql.com/bug.php?id=32915 .

I'll keep an eye on the MySQL bug, we can come back to this for a future version.

comment:13 @SergeyBiryukov3 years ago

  • Milestone 3.5 deleted
  • Resolution set to maybelater
  • Status changed from new to closed

comment:14 @pento2 years ago

  • Keywords has-patch commit removed
  • Resolution maybelater deleted
  • Status changed from closed to reopened

It turns out this is possible, thanks to the innodb_large_prefix option, introduced in 5.5.14:
http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_large_prefix

So, the requirements are:

MyISAM tables:

  • MySQL >= 5.5.3 (Assuming we don't add any indexes larger than 250 characters.)

InnoDB tables:

  • MySQL >= 5.5.14
  • innodb_file_format=barracuda
  • innodb_file_per_table=true
  • innodb_large_prefix=true
  • All tables with ROW_FORMAT=(DYNAMIC|COMPRESSED)

Any other table formats:

  • *head explode*

We can certainly detect these settings, it's just a question of whether that kind of complex (and edge-case-y) test should be in core.

Last edited 2 years ago by pento (previous) (diff)

comment:15 @SergeyBiryukov2 years ago

  • Milestone set to Awaiting Review

comment:16 @sardisson18 months ago

  • Cc sardisson added

comment:17 @tomdxw17 months ago

  • Cc tom@… added

comment:18 @3flex16 months ago

  • Cc 3flex added

comment:20 @johnbillion11 months ago

#27961 was marked as a duplicate.

comment:21 @ircbot9 months ago

This ticket was mentioned in IRC in #wordpress-dev by nacin. View the logs.

comment:22 @johnbillion8 months ago

FWIW I've been trying to configure a brand new site to use utf8mb4 but I'm still hitting the 767 byte key length issue when installing. Running MySQL 5.5.37 and configuration as per Pento's handy guide, and the innodb_large_prefix setting is definitely ON so I'm a bit stumped.

Switching an existing site to utf8mb4 appears to have worked fine, it's the fresh install that's not happy.

comment:23 @SergeyBiryukov6 months ago

#29322 was marked as a duplicate.

comment:24 @ircbot6 months ago

This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.

comment:25 @SergeyBiryukov5 months ago

#29773 was marked as a duplicate.

comment:26 @SergeyBiryukov5 months ago

#29857 was marked as a duplicate.

comment:27 @extendwings5 months ago

  • Severity changed from minor to normal

@pento, any updates on this?

A number of people, especially in Japan, wants to use Emoji in WordPress!

comment:28 @pento4 months ago

In 30345:

When creating a post containing emoji, the post can be saved incorrectly if MySQL isn't using a character set that supports emoji.

This change prevents the save from occurring, so it can be handled correctly in the UI.

See #21212.

comment:29 @pento4 months ago

In 30346:

If a saving a post fails, remove any invalid characters (such as emoji) from the primary text fields, then try to save it again.

See #21212.

comment:30 @TobiasBg4 months ago

[30366] missed the ticket:

WPDB: When removing invalid characters from utf8 strings in older versions of PHP, the regex was too large to be compiled.

Last edited 4 months ago by TobiasBg (previous) (diff)

comment:31 @pento4 months ago

In 30375:

WPDB: When a db.php drop-in is being used, and it doesn't explicitly define itself as connecting to MySQL, skip the character set checks. This ensures that existing drop-ins won't accidentally run checks that they don't support.

See #21212.

comment:32 @pento3 months ago

In 30400:

WPDB: Force STRICT_ALL_TABLES to be enabled as soon as we connect to the MySQL server.

This improves data integrity when inserting and updating rows in the database, particularly when trying to insert emoji into posts stored with character sets that don't support emoji.

See #21212.

comment:33 @pento3 months ago

In 30401:

Fix a MySQL warning in the test_get_adjacent_post_exclude_self_term unit test, caused by [30400].

See #21212, #22112.

comment:34 @nerrad3 months ago

I wonder if slipping [30400] into a late build of 4.1 is a good idea? There's likely a lot of plugins (maybe themes) out there that will have code that will start throwing warnings because of the strict checks. (We just noticed it with automated testing in our builds last night). It's a good idea, just needs to either be communicated REALLY well, or maybe delayed.

comment:35 @slackbot3 months ago

This ticket was mentioned in Slack in #core by nerrad. View the logs.

comment:36 @nerrad3 months ago

So here's an example error in our unit testing that came up...

WordPress database error BLOB/TEXT column 'DTT_description' can't have a default value for query CREATE TABLE wptests_esp_datetime ( DTT_ID int(10) unsigned NOT NULL AUTO_INCREMENT,
				  EVT_ID INT UNSIGNED NOT NULL ,
				  DTT_name VARCHAR(255) NOT NULL default '',
				  DTT_description TEXT NOT NULL default '',
				  DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00',
				  DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00',
				  DTT_reg_limit mediumint(8) DEFAULT -1,
				  DTT_sold mediumint(8) unsigned DEFAULT 0,
				  DTT_is_primary tinyint(1) UNSIGNED NOT NULL DEFAULT 0,
				  DTT_order mediumint(3) unsigned DEFAULT 0,
				  DTT_parent int(10) unsigned DEFAULT 0,
				  DTT_deleted tinyint(1) UNSIGNED NOT NULL DEFAULT 0,
						PRIMARY KEY  (DTT_ID),
						KEY EVT_ID (EVT_ID),
						KEY DTT_is_primary (DTT_is_primary) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci

Obviously, that reveals where our sql can be corrected. However I suspect there will be a number of plugin authors and potentially theme authors who will have similar sql that fails strict standards and depending on who coded, might not understand what the problem is right away.

comment:37 @mnelson43 months ago

Setting STRICT_ALL_TABLES means that queries that were passing before [30400] are now failing, even when WP_DEBUG is set to OFF.

I think the idea of making the MySQL modes filterable, so plugin or theme authors can choose to disable STRICT_ALL_TABLES is good, but I agree with @nerrad: this will come as a surprise to plugin or theme developers, and even if they do put out a quick update to their plugin or theme (which might be as simple as using that filter to disable STRICT_ALL_TABLES), many of their users may update wordpress core before they update the affected plugins or themes.

If we're going to be setting STRICT_ALL_TABLES as a mode in 4.1, I'd suggest it just be the default when WP_DEBUG is on. That way affected plugins and themes will be made aware soon, but it won't affect users live installs.

Last edited 3 months ago by mnelson4 (previous) (diff)

comment:38 @nerrad3 months ago

Should milestone be updated on this now that there are actual commits in 4.1.0 for this ticket?

comment:39 @mnelson43 months ago

also I'm a little confused at how this HAS NOT been assigned a milestone, YET its changes are currently in 4.1 beta. Shouldn't it be assigned to the 4.1 milestone if it's work is currently in that branch?

comment:40 @helen3 months ago

  • Milestone changed from Awaiting Review to 4.1
  • Type changed from enhancement to task (blessed)

We're human, we miss things sometimes, especially when working on the releases that just happened a few minutes ago.

Also, sometimes we go with aggressive changes in betas to see what kind of real-world breakage is out there, so we can communicate, mitigate, and pull back if we need to. That MySQL strict mode requires values or defaults for write queries yet doesn't support defaults for all column types is, IMO, ridiculous, but at least we know now, and can go from here. Thanks for reporting back with the specific breakage, nerrad.

comment:41 @nerrad3 months ago

No problems, and no worries here on my end :) Just want to help things along and reporting seems to be the way we should do that. Mike and I are on the same team and its purely coincidence we posted the same time about the milestone thing - we had a laugh about it. Sorry if it comes across like making noise.

comment:42 follow-up: @morgantocker3 months ago

Sorry to complicate things:

MySQL 5.7 will enable STRICT_TRANS_TABLE,NO_ENGINE_SUBSTITUTION and ONLY_FULL_GROUP_BY by default. Modes ERROR_FOR_DIVISION_BY_ZERO, NO_ZERO_DATE and NO_ZERO_IN_DATE have also been merged into the definitions of STRICT_TRANS_TABLES or STRICT_ALL_TABLES, and thus will also be enabled by default.

So this means that these three columns will all violate the STRICT_ALL_TABLES goal:

 DTT_description TEXT NOT NULL default '',
 DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00',
 DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00',

The more future proof alternative:

 DTT_description TEXT NOT NULL,
 DTT_EVT_start datetime NOT NULL,
 DTT_EVT_end datetime NOT NULL,

(Disclaimer for those that don't know me yet: I work on the MySQL team @ Oracle.)

comment:43 in reply to: ↑ 42 @mnelson43 months ago

Replying to morgantocker:

Sorry to complicate things:

MySQL 5.7 will enable STRICT_TRANS_TABLE,NO_ENGINE_SUBSTITUTION and ONLY_FULL_GROUP_BY by default. Modes ERROR_FOR_DIVISION_BY_ZERO, NO_ZERO_DATE and NO_ZERO_IN_DATE have also been merged into the definitions of STRICT_TRANS_TABLES or STRICT_ALL_TABLES, and thus will also be enabled by default.

So this means that these three columns will all violate the STRICT_ALL_TABLES goal:

 DTT_description TEXT NOT NULL default '',
 DTT_EVT_start datetime NOT NULL default '0000-00-00 00:00:00',
 DTT_EVT_end datetime NOT NULL default '0000-00-00 00:00:00',

The more future proof alternative:

 DTT_description TEXT NOT NULL,
 DTT_EVT_start datetime NOT NULL,
 DTT_EVT_end datetime NOT NULL,

(Disclaimer for those that don't know me yet: I work on the MySQL team @ Oracle.)

ya @morgantocker, we were made aware of our plugin's invalid SQL since STRICT_ALL_TABLES was enabled in WP core, and we are changing our code. But just as a reminder: the SQL @nerrad posted is specific to our plugin only, not WP core.

Generally, enabling STRICT_ALL_TABLES has been useful for debugging like this which is why using it when WP_DEBUG is on may be handy.

comment:44 @slackbot3 months ago

This ticket was mentioned in Slack in #core by akumria. View the logs.

comment:45 @slackbot3 months ago

This ticket was mentioned in Slack in #core by netweb. View the logs.

comment:46 @slackbot3 months ago

This ticket was mentioned in Slack in #core by pento. View the logs.

comment:47 @johnbillion3 months ago

  • Milestone changed from 4.1 to Future Release

comment:48 @pento3 months ago

In 30587:

WPDB: Remove the addition of STRICT_ALL_TABLES, added in [30400]. This proved to be incompatible with too many plugins, as well as being incompatible with MySQL 5.7.

See #21212.

@pento3 months ago

comment:49 @pento3 months ago

21212-strict.diff was our implementation of strict mode, keeping it here in case we want to try it again in the future.

comment:50 @DrewAPicture3 months ago

In 30768:

Improve inline documentation for wpdb::has_cap().

Adds a changelog entry for the addition of the 'utf8mb4' feature, accepted values for $db_cap, and a return description.

Props kpdesign.
See #21212.

comment:51 @pento3 months ago

In 30807:

WPDB: In [30345,30346,30366,30375], we added sanity checks for saving posts containing emoji. We need to revert these checks for now, as they're not ready for release in 4.1.

See #21212.

comment:52 @pento2 months ago

In 31050:

WPDB Docs: Update the @since version from 4.1.0 to 4.2.0, for all the methods/members that were punted from 4.1.

See #21212.

comment:53 @pento2 months ago

In 31051:

WPDB: To ensure we're not asking MySQL to read data using queries that it can't handle, make sure that we run all read queries through the character set sanity checks.

See #21212.

comment:54 @pento2 months ago

  • Milestone changed from Future Release to 4.2

@pento8 weeks ago

comment:55 @pento8 weeks ago

21212.diff is the first pass at automatically upgrading utf8 sites to utf8mb4, if they support it.

In WPDB:

  • init_charset() now checks to see if the MySQL server is a high enough version to support utf8mb4. This means that it needs to be run after the connection happens, rather than in the constructor.
  • When run on utf8mb4 strings, it turns out that PHP's multibyte function transforms the invalid characters into ?, instead of removing them. I upgraded the regex in strip_invalid_text() to handle 4 byte UTF-8 characters, too.

During installation:

  • If the MySQL supports it, set DB_CHARSET to utf8mb4 when wp-config.php is created.

In the DB upgrade routines:

  • If we can do it, let's upgrade all tables to utf8mb4, including Multisite tables.

In the Charset tests:

  • We had a couple of tests that assumed a valid 4-byte character would cause queries to fail. I changed it to an invalid 4-byte character.

This gives Emoji support for everyone using MySQL 5.5+ - about 60% of WordPress sites. What it doesn't do:

  • Allow sites on older MySQL versions to use Emoji.
  • Allow Chrome to display Emoji. We don't need to worry about that, though - Emoji support landed in Chrome Canary a few weeks ago, it should be in the stable branch in the next couple of weeks. By the time WordPress 4.2 is released, all Chrome users will have auto-updated.

comment:56 @slackbot8 weeks ago

This ticket was mentioned in Slack in #core by pento. View the logs.

comment:57 @SergeyBiryukov8 weeks ago

Looks like [31051] broke test_query_tag_nun() added in [UT1032].

$wpdb->get_row() in get_term_by() returns unexpected result and causes some notices:

object(stdClass)#3499 (9) {
  ["Field"]=>
  string(7) "term_id"
  ["Type"]=>
  string(19) "bigint(20) unsigned"
  ["Collation"]=>
  NULL
  ["Null"]=>
  string(2) "NO"
  ["Key"]=>
  string(3) "PRI"
  ["Default"]=>
  NULL
  ["Extra"]=>
  string(14) "auto_increment"
  ["Privileges"]=>
  string(31) "select,insert,update,references"
  ["Comment"]=>
  string(0) ""
}

Notice: Undefined property: stdClass::$term_id in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 1415
Notice: Undefined property: stdClass::$term_id in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 2124
Notice: Undefined property: stdClass::$slug in S:\home\wordpress\develop\src\wp-includes\taxonomy.php on line 2812

The same happens when adding a "test-נ" or "САПР" tag via UI, although these are valid UTF-8 characters.

comment:58 @pento8 weeks ago

In 31093:

WPDB: When wpdb::query() needs to sanity check a query string, make sure to run wpdb:flush() afterwards, to ensure the results from sanity check queries aren't mixed up with the results for the user query.

See #21212.

Fixes #30948.

comment:59 @pento8 weeks ago

In 31094:

WPDB: When sanity checking a string, check that it's a string, first - PHP notices can occur if an array or object is handled like a string.

See #21212

comment:60 @netweb8 weeks ago

Here's a dump of my DB table collation https://gist.github.com/ntwb/00918b2f625281853502

The reason why I thought nw_term_relationships was not being converted to utf8mb4 is because data types INT, BIGINT, DATETIME do NOT have a collation, as the wp_terms_relationships table only includes INT and BIGINT this is why MySQL was not displaying/showing this table as utf8mb4 for me.

One other note I read but cannot find reference now was:

"Altering a table can impact the auto increment values"

So, should we not alter numeric/datetime columns?

@pento7 weeks ago

comment:61 @pento7 weeks ago

In 21212.2.diff, wpdb now converts emoji into HTML entities, but only for post_content, and only when post_content is utf8.

This expands emoji support to 99%+ of WordPress sites.

The good news is, I don't think we don't need to worry about converting back to proper characters, if the site later updates to utf8mb4. The HTML entities will still continue to render correctly, and will automatically be updated to the proper character if the post is ever edited.

Still to do:

  • Think about what other columns we want to allow emoji in, and whitelist them.
  • Make sure that we're only running the ALTER TABLE queries on utf8 tables. utf8 is the only character set we can guarantee is an unaltered subset of utf8mb4.
  • Test that the ALTER TABLE works correctly for all indexes, with a wide range of data.
  • Probably other things.

So, should we not alter numeric/datetime columns?

No need to worry. The ALTER TABLE only affects text-based columns, it won't touch numeric/datetime columns.

comment:62 @slackbot7 weeks ago

This ticket was mentioned in Slack in #core by pento. View the logs.

comment:63 @pento7 weeks ago

#31019 was marked as a duplicate.

comment:64 @morgantocker6 weeks ago

A little bit more MySQL version context on innodb_large_prefix:

So I realize it's a while out for many people, but the story should be simpler with MySQL 5.7.

comment:65 @slackbot6 weeks ago

This ticket was mentioned in Slack in #core by pento. View the logs.

@pento6 weeks ago

@pento6 weeks ago

comment:66 @pento6 weeks ago

I've split the patch into two parts:

21212-utf8mb4.4.diff enables utf8mb4, when available. It now sanity checks that all columns in a table (that have a character set) are utf8, before doing the conversion. It also reduces the size of a handful of indexes, to fit within the 767 byte limit. This allows as to avoid the mess of checking for innodb_large_prefix.

21212-emoji.diff adds Emoji support for databases that are using the utf8 character set. Emoji are allowed in post_title, post_content and 'post_excerpt', as well as the Site Name and Description. This patch is dependant upon 21212-utf8mb4.4.diff.

comment:67 @pento6 weeks ago

In 31263:

Comments: When a comment fails to insert, remove any invalid characters and try again.

See #21212

comment:68 @slackbot6 weeks ago

This ticket was mentioned in Slack in #core by ocean90. View the logs.

@pento6 weeks ago

comment:69 @pento6 weeks ago

21212-utf8mb4.5.diff fixes only one table being upgraded to utf8mb4, instead of all of them.

comment:70 @MikeHansenMe6 weeks ago

  • Keywords has-patch added

@pento6 weeks ago

comment:71 @pento6 weeks ago

21212-utf8mb4.6.diff adds an extra sanity check to wpdb::init_charset().

Historically, it hasn't required a DB connection to be called, but now requires a connection to check for utf8mb4 compatibility. 21212-utf8mb4.6.diff allows it to be called without a connection, which reverts it to the old behaviour of not checking for utf8mb4 compatibility.

comment:72 @slackbot6 weeks ago

This ticket was mentioned in Slack in #core by drew. View the logs.

comment:73 @hnle5 weeks ago

With utf8mb4 on HHVM, texts are corrupt. All non-ASCII characters turns into "?" mark.

Problematic environment

  • OS: CentOS 6.6
  • Server: Nginx 1.6.2
  • HHVM: HHVM 3.2 (Unofficial: http://www.hop5.in/yum/el6/repoview/hhvm.html)
  • DB_CHARSET: utf8mb4
  • DB_COLLATE: utf8mb4_general_ci (and utf8mb4_unicode_ci)
  • WP: 4.2-alpha-31269 + 21212-utf8mb4.6.diff
  • MySQL: 5.5.41(Percona Release37)


but… the problem is not happen on HHVM3.5.0 (or 3.2.0 too) + Ubuntu 14.04 + MySQL 5.5.41.

comment:74 @boonebgorges5 weeks ago

In 31277:

In Tests_Ajax_ReplytoComment::test_blocked_comment(), don't stop blocking comments until test is complete.

The test uses the _block_comments() method to prevent comment inserts from
happening. Since [31263], failed comment inserts lead to the comment content's
being stripped of invalid characters and inserted again. By immediately
unhooking itself after first being run, _block_comments() was causing the
INSERT block to work only for the first attempt, while the second attempt was
going through, causing the test to fail. As a fix, we move the
remove_filter() call to the class's tearDown() method - sufficient for
cleanup, but late enough that *all* comment inserts will be blocked for the
test method in question.

See #21212.
Since [31263], comment INSERT queries that were pseudo-bl

comment:75 @SergeyBiryukov4 weeks ago

#13590 was marked as a duplicate.

comment:76 @slackbot4 weeks ago

This ticket was mentioned in Slack in #core by pento. View the logs.

@pento4 weeks ago

comment:77 @pento4 weeks ago

21212-utf8mb4.7.diff moves all of the index changes into single queries for each table, rather than relying on dbDelta() to create the new indexes.

It also adds sanity checking to Multisite upgrades, so that we're not trying to upgrade any non-utf8 charsets.

comment:78 @slackbot4 weeks ago

This ticket was mentioned in Slack in #core by pento. View the logs.

@pento4 weeks ago

comment:79 @pento4 weeks ago

In 21212-utf8mb4.8.diff:

  • Add a variable and comment explaining why we have the 191 character length indexes
  • Use $wpdb->tables() instead of $wpdb->tables
  • Add a DO_NOT_UPGRADE_GLOBAL_TABLES check, as wp_sitemeta can easily be a big table
  • In $wpdb->db_connect(), we only need to call $wpdb->init_charset() the first time we connect
  • Tidy up and clarify a few comments

@pento4 weeks ago

comment:80 @pento4 weeks ago

In 21212-utf8mb4.9.diff:

  • Move the utf8mb4 upgrade to a function, so that plugins can use it, too.
  • A little more comment tidying

comment:81 @pento4 weeks ago

In 31349:

WPDB: If a site is using the utf8 charset, and their version of MySQL supports utf8mb4, auto-upgrade them to utf8mb4.

This patch also resizes some indexes, to allow for the 767 byte index size limit in standard MySQL installs.

See #21212

comment:82 @pento4 weeks ago

  • Keywords has-patch removed
  • Resolution set to fixed
  • Status changed from reopened to closed

It's been a fun ride, everyone, but we're done here.

Please report any bugs in new tickets.

21212-emoji.diff will be merged into the work on #31242.

comment:83 @pento4 weeks ago

In 31351:

Upgrades: When deciding if we can upgrade a table to utf8mb4, make sure we're checking the character set, not the collation.

Bump the DB version as well, as the update in [31349] wouldn't have triggered for anyone who tried it.

See #21212

comment:84 @pento4 weeks ago

In 31354:

Upgrades: When upgrading a table to utf8mb4, we should run the ALTER TABLE query, even if it doesn't have any text columns. This will update the table's default character set, so that any text columns added in the future will have the expected character set.

See #21212

comment:85 @pento4 weeks ago

In 31358:

Upgrades: wp_usermeta is a global table in multisite, so it should only be upgraded by the single site routines when we're not in a multisite install.

See #21212

comment:86 @helen3 weeks ago

Mentioned to @pento in Slack that I'm seeing real, live issues with an impromper collation_connection, chat archive here: https://wordpress.slack.com/archives/core/p1423274312002892 (will probably open a new ticket for this later).

comment:87 @pento3 weeks ago

In 31391:

WPDB: When checking to see if we can use utf8mb4, we also need to make sure PHP's MySQL client library is capable of using utf8mb4.

See #21212

comment:88 follow-up: @masakielastic3 weeks ago

I think wp_encode_emoji is not suitable for both escaping 4-byte characters and the standard api.

wp_encode_emoji function is leaky abstraction since new emoji characters will be added in Unicode Standard every year. Thus the users of function are forced to check the change of emoji-data.txt, emoji skin tone modifiers (U+1F3FB..U+1F3FF, see Unicode Technical Report 51 ) and use their own function.

Regional indicator symbols (U+1F1E6..U+1F1FF) are not emoji themself though, they are used for national flags. see GraphemeBreakProperty.txt or Unicode Standard Annex 29 for the details.

Another reason why I do not vote for wp_encode_emoji function is that all of 4-byte characters is not emoji.

A lot of 4-byte chinese characters are used for the names of places and the family names. U+20BB7 is used for Yoshinoya, which is Japanese fast food chain and have more than 1800 stores.

A part of Variation selectors supplements are used for variant form of chinese characters (U+E0100..U+E01EF). U+E0101 is used for Katsushika-ku (U+845B U+E0101 U+98FE U+533A). TAKARA TOMY which is famous for Pokemon has the head office in Katsushika-ku.

comment:89 in reply to: ↑ 88 @pento3 weeks ago

Replying to masakielastic:

wp_encode_emoji function is leaky abstraction since new emoji characters will be added in Unicode Standard every year. Thus the users of function are forced to check the change of emoji-data.txt, emoji skin tone modifiers (U+1F3FB..U+1F3FF, see Unicode Technical Report 51 ) and use their own function.

Not supporting Unicode 8.0 changes is intentional. The Twemoji library also doesn't support Unicode 8.0 emoji or skin tone modifiers, we'll update both when Unicode 8.0 is finalised, and Twemoji adds support.

Regional indicator symbols (U+1F1E6..U+1F1FF) are not emoji themself though, they are used for national flags. see GraphemeBreakProperty.txt or Unicode Standard Annex 29 for the details.

For our purposes, I'm okay with treating the national flags as individual characters. For example, &#x1f1ec;&#x1f1e7; will still show as the GB flag.

There needs to be a little bit of extra logic to allow for the static image replacement, but that won't be too tricky. Thank you for bringing it to my attention - I've created a Github issue to track it.

Another reason why I do not vote for wp_encode_emoji function is that all of 4-byte characters is not emoji.

A lot of 4-byte chinese characters are used for the names of places and the family names. U+20BB7 is used for Yoshinoya, which is Japanese fast food chain and have more than 1800 stores.

A part of Variation selectors supplements are used for variant form of chinese characters (U+E0100..U+E01EF). U+E0101 is used for Katsushika-ku (U+845B U+E0101 U+98FE U+533A). TAKARA TOMY which is famous for Pokemon has the head office in Katsushika-ku.

I'm not following what the problem is here - none of these characters are encoded by wp_encode_emoji(). Could you please expand on the problem?

Note: See TracTickets for help on using tickets.