Make WordPress Core

Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#32105 closed defect (bug) (fixed)

Database collate should be utf8mb4_unicode_520_ci

Reported by: miyauchi's profile miyauchi Owned by: pento's profile pento
Milestone: 4.6 Priority: normal
Severity: normal Version: 4.2
Component: Database Keywords:
Focuses: Cc:

Description

If the collate is utf8mb4_unicode_ci, All the 4 byte characters are recognized that they are identical.

For example, I set a sushi at line 16 and when searching for a beer, it's not null.
https://gist.github.com/miya0001/91c837a1d3a0b75cd233

<?php

class SampleTest extends WP_UnitTestCase {

	/**
	 * @test
	 */
	function utf8_test()
	{
		global $wp_version;
		global $wpdb;

		$this->assertSame( '4.2', $wp_version );

		// post_title is sushi
		$post_id = $this->factory->post->create( array( 'post_title' => '

Change History (16)

#1 @miyauchi
9 years ago

<?php

class SampleTest extends WP_UnitTestCase {

/**
 * @test
 */
function utf8_test()
{
	global $wp_version;
	global $wpdb;

	$this->assertSame( '4.2', $wp_version );

	// post_title is sushi
	$post_id = $this->factory->post->create( array( 'post_title' => '
Last edited 9 years ago by miyauchi (previous) (diff)

#2 @miyauchi
9 years ago

Umm I can't paste code...

#3 @netweb
9 years ago

  • Component changed from General to Database

Miyauchi, thanks for the unit test, as you point out you expect null yet return post id 3 which will be the result of the test for any version of MySQL before version 5.6 and not using utf8mb4_unicode_520_ci.

My take on this is that utf8mb4_unicode_520_ci is only available in MySQL 5.6 and above and as such unless you are running at least MySQL 5.6 and utf8mb4_unicode_520_ci then there is no way to compare one emoji against another emoji (or any other character in that unicode plane) in older MySQL versions.

For reference this has been previously discussed in #31328

Via https://core.trac.wordpress.org/ticket/31328#comment:6

The problem stems from MySQL's collation behaviour - it treats all Unicode Supplementary Characters (which emoji fall under) as being equivalent. It's not until MySQL 5.6, wich the addition of the utf8mb4_unicode_520_ci collation that this changes.

Via https://core.trac.wordpress.org/ticket/31328#comment:20

This is reproducible for any two terms containing the same number of emoji characters (as opposed to glyphs).

This is because we were searching for duplicates by term name. All utf8mb4_* collations (prior to utf8mb4_unicode_520_ci) treat emoji as being equivalent characters, so would just match the first one found. Now that we're searching for duplicates by term term slug, this is no longer a problem. Term slugs store the URL-encoded version of the emoji character, which the utf8mb4_ collations correctly interpret as a string of ASCII text.

Last edited 9 years ago by netweb (previous) (diff)

#4 @pento
9 years ago

  • Milestone changed from Awaiting Review to Future Release

I absolutely agree that we should be looking at moving to utf8mb4_unicode_520_ci.

My primary concern is that we're already splitting behaviour between installs that can support utf8mb4 and those that can't - figuring out a nice way to maintain splitting between utf8mb4_unicode_ci and utf8mb4_unicode_520_ci is yet another problem. It's possible that we'll need a general solution, for when MySQL adds support for later Unicode Collation Algorithms.

@morgantocker, do you have any insight into when we need to start planning for MySQL to have later UCA support?

#5 follow-up: @kraftbj
9 years ago

@miyauchi - Quick note to say the code pasting failed because Trac doesn't support utf8mb4, so the insert is dropping everything after the first four-byte character.

#6 @morgantocker
9 years ago

@pento, I believe what has been described is MySQL bug http://bugs.mysql.com/bug.php?id=76553 - Please subscribe for updates.

#7 @pento
9 years ago

@morgantocker: It looks like that bug is referring to a problem in utf8mb4_unicode_520_ci. I was interested if there any public plans for MySQL to support newer versions of the UCA. There was WL#2673 when utf8mb4_unicode_520_ci was introduced, but I couldn't find anything for the newer UCA versions.

#8 @morgantocker
9 years ago

@pento: I will pass this feature request onto the development team. MySQL 5.7 has just hit RC, we are collecting feedback on what should go into 5.8.

#9 in reply to: ↑ 5 @miyauchi
9 years ago

Replying to kraftbj:

@miyauchi - Quick note to say the code pasting failed because Trac doesn't support utf8mb4, so the insert is dropping everything after the first four-byte character.

Oh, now I get it. Thanks. :)

#10 follow-up: @morgantocker
9 years ago

An update on character set/collation support. Here is the plan for MySQL 5.8:
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/

#11 in reply to: ↑ 10 @netweb
9 years ago

Replying to morgantocker:

An update on character set/collation support. Here is the plan for MySQL 5.8:
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/

Nice, thanks for this

A couple of the key points:

We are considering changing the default character set to utf8mb4. Modern applications frequently store 4 byte characters, as emoji input is common for mobile devices.

In addition to utf8mb4, we are also considering switching the default collation to be utf8mb4_unicode_520_ci. We plan to add collations to include per-language collation rules as we currently offer with latin1 based collations. They will have a more modern version of the Unicode Collation Algorithm than the utf8mb4_unicode_520_ci.

#12 @pento
8 years ago

  • Milestone changed from Future Release to 4.6
  • Owner set to pento
  • Status changed from new to assigned

#13 follow-up: @pento
8 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 37523:

Database: Use the utf8mb4_unicode_520_ci collation, when available.

The utf8mb4_unicode_520_ci (Unicode Collation Algorithm 5.2.0, October 2010) collation is an improvement over utf8mb4_unicode_ci (UCA 4.0.0, November 2003).

There is no word on when MySQL will support later UCAs.

Fixes #32105.

#14 @pento
8 years ago

In 37601:

Database: Split the logic of wpdb::init_charset() into a separate method.

The logic for determining the appropriate character set and collation to use is becoming more complex, particularly with the recent additions of [37522] and [37523]. As init_charset() has side effects, and makes use of constants instead of parameters, it's not possible to unit test this logic.

This commit splits the logic part of init_charset() out into a new method, wpdb::determine_charset(), along with appropriate unit tests.

See #32105, #37522.

Fixes #36917.

#15 in reply to: ↑ 13 @morgantocker
8 years ago

Replying to pento:

There is no word on when MySQL will support later UCAs.

Just an update on this specific point. MySQL 8.0 will support UCA 9.0 (the latest):
http://mysqlserverteam.com/the-mysql-8-0-0-milestone-release-is-available/
http://dev.mysql.com/worklog/task/?id=9479
http://dev.mysql.com/worklog/task/?id=9125

#16 @pento
8 years ago

Thanks, @morgantocker! Tracking this in #38036.

Note: See TracTickets for help on using tickets.