#32105 closed defect (bug) (fixed)
Database collate should be utf8mb4_unicode_520_ci
Reported by: | miyauchi | Owned by: | pento |
---|---|---|---|
Milestone: | 4.6 | Priority: | normal |
Severity: | normal | Version: | 4.2 |
Component: | Database | Keywords: | |
Focuses: | Cc: |
Description
If the collate is utf8mb4_unicode_ci, All the 4 byte characters are recognized that they are identical.
For example, I set a sushi at line 16 and when searching for a beer, it's not null.
https://gist.github.com/miya0001/91c837a1d3a0b75cd233
<?php class SampleTest extends WP_UnitTestCase { /** * @test */ function utf8_test() { global $wp_version; global $wpdb; $this->assertSame( '4.2', $wp_version ); // post_title is sushi $post_id = $this->factory->post->create( array( 'post_title' => '
Change History (16)
#3
@
9 years ago
- Component changed from General to Database
Miyauchi, thanks for the unit test, as you point out you expect null
yet return post id 3
which will be the result of the test for any version of MySQL before version 5.6 and not using utf8mb4_unicode_520_ci
.
My take on this is that utf8mb4_unicode_520_ci
is only available in MySQL 5.6 and above and as such unless you are running at least MySQL 5.6 and utf8mb4_unicode_520_ci
then there is no way to compare one emoji against another emoji (or any other character in that unicode plane) in older MySQL versions.
For reference this has been previously discussed in #31328
Via https://core.trac.wordpress.org/ticket/31328#comment:6
The problem stems from MySQL's collation behaviour - it treats all Unicode Supplementary Characters (which emoji fall under) as being equivalent. It's not until MySQL 5.6, wich the addition of the
utf8mb4_unicode_520_ci
collation that this changes.
Via https://core.trac.wordpress.org/ticket/31328#comment:20
This is reproducible for any two terms containing the same number of emoji characters (as opposed to glyphs).
This is because we were searching for duplicates by term name. Allutf8mb4_*
collations (prior toutf8mb4_unicode_520_ci
) treat emoji as being equivalent characters, so would just match the first one found. Now that we're searching for duplicates by term term slug, this is no longer a problem. Term slugs store the URL-encoded version of the emoji character, which theutf8mb4_
collations correctly interpret as a string of ASCII text.
#4
@
9 years ago
- Milestone changed from Awaiting Review to Future Release
I absolutely agree that we should be looking at moving to utf8mb4_unicode_520_ci
.
My primary concern is that we're already splitting behaviour between installs that can support utf8mb4
and those that can't - figuring out a nice way to maintain splitting between utf8mb4_unicode_ci
and utf8mb4_unicode_520_ci
is yet another problem. It's possible that we'll need a general solution, for when MySQL adds support for later Unicode Collation Algorithms.
@morgantocker, do you have any insight into when we need to start planning for MySQL to have later UCA support?
#5
follow-up:
↓ 9
@
9 years ago
@miyauchi - Quick note to say the code pasting failed because Trac doesn't support utf8mb4, so the insert is dropping everything after the first four-byte character.
#6
@
9 years ago
@pento, I believe what has been described is MySQL bug http://bugs.mysql.com/bug.php?id=76553 - Please subscribe for updates.
#7
@
9 years ago
@morgantocker: It looks like that bug is referring to a problem in utf8mb4_unicode_520_ci
. I was interested if there any public plans for MySQL to support newer versions of the UCA. There was WL#2673 when utf8mb4_unicode_520_ci
was introduced, but I couldn't find anything for the newer UCA versions.
#8
@
9 years ago
@pento: I will pass this feature request onto the development team. MySQL 5.7 has just hit RC, we are collecting feedback on what should go into 5.8.
#9
in reply to:
↑ 5
@
9 years ago
Replying to kraftbj:
@miyauchi - Quick note to say the code pasting failed because Trac doesn't support utf8mb4, so the insert is dropping everything after the first four-byte character.
Oh, now I get it. Thanks. :)
#10
follow-up:
↓ 11
@
9 years ago
An update on character set/collation support. Here is the plan for MySQL 5.8:
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/
#11
in reply to:
↑ 10
@
9 years ago
Replying to morgantocker:
An update on character set/collation support. Here is the plan for MySQL 5.8:
http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/
Nice, thanks for this
A couple of the key points:
We are considering changing the default character set to
utf8mb4
. Modern applications frequently store 4 byte characters, as emoji input is common for mobile devices.
In addition to
utf8mb4
, we are also considering switching the default collation to beutf8mb4_unicode_520_ci
. We plan to add collations to include per-language collation rules as we currently offer with latin1 based collations. They will have a more modern version of the Unicode Collation Algorithm than theutf8mb4_unicode_520_ci
.
#12
@
8 years ago
- Milestone changed from Future Release to 4.6
- Owner set to pento
- Status changed from new to assigned
#13
follow-up:
↓ 15
@
8 years ago
- Resolution set to fixed
- Status changed from assigned to closed
In 37523:
#15
in reply to:
↑ 13
@
8 years ago
Replying to pento:
There is no word on when MySQL will support later UCAs.
Just an update on this specific point. MySQL 8.0 will support UCA 9.0 (the latest):
http://mysqlserverteam.com/the-mysql-8-0-0-milestone-release-is-available/
http://dev.mysql.com/worklog/task/?id=9479
http://dev.mysql.com/worklog/task/?id=9125