Opened 14 months ago
Last modified 12 months ago
#58871 new enhancement
support uca14.0.0 collation in database where available
Reported by: | danielblack | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | 6.3 |
Component: | Database | Keywords: | has-patch has-unit-tests needs-testing |
Focuses: | Cc: |
Description
Like #32105 supporting a modern collation is a good idea.
MariaDB added this in 10.10.1 (ref: https://jira.mariadb.org/browse/MDEV-27009).
As the current default is case insensitive, we'll follow the same insensitivity and use accent insensitive.
So of the collations available, use uca1400_ai_ci.
ref: https://mariadb.com/kb/en/supported-character-sets-and-collations/#collations
Change History (11)
This ticket was mentioned in PR #4878 on WordPress/wordpress-develop by grooverdan.
14 months ago
#1
- Keywords has-patch has-unit-tests added
This ticket was mentioned in Slack in #core by danielblack. View the logs.
13 months ago
#4
@
12 months ago
I don't think we can simply use:
return version_compare( $db_version, '10.10.1', '>=' );
As it depends on the database in use (MySQL vs MariaDB), and while MariaDB is currently using version 10, I assume MySQL will move from 8.1 to 10.10 at some point.
Maybe use SHOW CHARACTER SET?
Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.
Notes:
- WordPress currently tries to use utf8mb4_unicode_520_ci when the database version is >= 5.6.
- MariaDB added
xxx_unicode_520_ci
in version 10.0.6 (which was a beta release). For reference, MariaDB forked MySQL at version 5.1, and followed the numbering scheme up to version 5.5, then they jumped to version 10.0.10 (so the >= 5.6 version check works fairly well).
- The "520" in the name represents UCA 5.2.0.
- UCA 5.2.0 was a big improvement from
utf8_*
andutf8mb3_*
(no support for Emoji, missing CJK characters, etc); it's also better thanutf8mb4_general_ci
andutf8mb4_unicode_ci
which are affected by the "Sushi-Beer" problem (treating all characters in SMP as equal, see MySQL bug #76553).
- But, UCA 5.2.0 has the "Mother-Father issue in Japanese", this affects sorting of multi-byte characters, where MySQL does not recognise “ハ” (U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ” (U+30D0 KATAKANA LETTER BA) as different characters (see Problem #3 Sorting level).
- A MySQL blog post from July 2021 suggests using
utf8mb4_0900_*
(UCA 9.0.0) to address this issue; and specificallyutf8mb4_0900_ai_ci
if you want to be accent-insensitive and case-insensitive.
- The
utf8mb4_0900_ai_ci
charset was added in MySQL 8.0, released 2016-09-12. MySQL 8.1 was released 2023-07-18.
- The MySQL Unicode Character Sets page only notes support for UCA 9.0.0 (
utf8mb4_0900_ai_ci
), not UCA 14.0.0.
- MariaDB 10.10 (where GA 10.10.2 was released 2022-11-17), added support for UCA 14.0.0 collations (e.g. uca1400_ai_ci, ref MDEV-27009).
- The MariaDB character sets documentation does not list anything for
utf8mb4_0900_*
(and a quick test on MariaDB 10.6.14 returns an "unknown collation" error).
Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:
SET @@character_set_collations='utf8mb4=uca1400_ai_ci';
So a CREATE TABLE
which uses utf8mb4
, will use uca1400_ai_ci
instead.
#5
@
12 months ago
MySQL will move from 8.1 to 10.10 at some point.
Yep, seems about 5 years at a rough guess.
Maybe use SHOW CHARACTER SET?
SHOW COLLATION is a bit more direct, and MySQL have supported it for a while too (5.5 example - https://dbfiddle.uk/PCg3HiK3).
MariaDB [(none)]> show collation where Collation='uca1400_ai_ci';
+---------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+------+---------+----------+---------+
| uca1400_ai_ci | NULL | NULL | NULL | Yes | 8 |
+---------------+---------+------+---------+----------+---------+
I'll get this incorporated.
Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.
Yes, I was just following the insensitive theme there. I haven't any justification either way to thing accent sensitive is more important than case sensitive.
Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:
Nice. I hadn't come across this change. By "slightly off topic" I assume its not necessary to handle here, however if it was, and its desired, can you create a ticket with rough guidance on how you want it implemented.
Yes, you're right, MariaDB has no 9.0.0 collation support. Bumping to 14.0.0 was a large enough body of work without incorporating an intermediate version.
#6
@
12 months ago
Thanks @danielblack.
Just a thought (as I'm not sure what the repercussions are), but if we added support for MySQL's utf8mb4_0900_ai_ci
as well, to avoid multiple SHOW COLLATION
queries, we could use:
SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');
Then store the results on a private wpdb property, so it's cached, and can be used by has_cap()
?
Note that determine_charset()
is called by db_connect()
, via init_charset()
; and while it's fairly fast (on my localhost ~0.0006s, which does not use a network connection), it won't be as fast as mysqli_get_server_info()
to guess the supported character sets based on version number (~0.0000001s).
---
And misc points...
- I'm fine with accent-insensitive (like case-insensitive), I just don't know if it would cause any problems for anyone else (only reason I'm noting it).
- Agreed, I don't think
@@character_set_collations
is useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.
- The MaraDB documentation says "the character set name is always part of the collation name" (source), I assume that's incorrect as collation
uca1400_ai_ci
would imply a different character set.
- Running
SHOW COLLATION WHERE Collation LIKE "%uca1400%"
provides NULL for theCharset
?
- I assume it's still correct to use the
utf8mb4
character set, along withmysqli_set_charset('utf8mb4')
for the connection?
- Also, tables that exist today will use
utf8mb4_unicode_520_ci
, I don't think these will be changed during an update, seemaybe_convert_table_to_utf8mb4()
; would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?
- Oddly, if I manually run
ALTER TABLE wp_commentmeta CHANGE meta_key meta_key VARCHAR(255) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci NULL DEFAULT NULL
, then themeta_key
field collation is set toutf8mb4_uca1400_ai_ci
, which does kinda work withmaybe_convert_table_to_utf8mb4()
with it's use ofexplode('_')
.
#7
@
12 months ago
SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');
Sounds useful, and this could be just implemented in determine_charset
, with cache, so there's one query. If we do that, maybe has_cap( 'uca1400' )
need not be implemented. We'll see how easy the test cases are to write.
I'll prepare another draft implemented in determine_charset
and maybe_convert_table_to_utf8mb4
doing a collation conversion too.
From Misc:
@@character_set_collations
is useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.
It has some possibly useful implications as a default connection for coercing the collation that are probably worth while.
https://mariadb.com/kb/en/setting-character-sets-and-collations/#changing-default-collation
The MaraDB documentation says "the character set name is always part of the collation name...
Yep needs an update. I'll see what can be written.
Running SHOW COLLATION WHERE Collation LIKE "%uca1400%" provides NULL for the Charset
Finally got to the bottom of this with the original commit - https://github.com/MariaDB/server/commit/133446828c9dcb484476e4b3598af0d63d056a6e (also a documentation task to pick up)
Null implies it can apply to multiple character sets.
MariaDB [test]> select * from INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY where COLLATION_NAME='uca1400_ai_ci'; +----------------+--------------------+-----------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +----------------+--------------------+-----------------------+------+------------+ | uca1400_ai_ci | utf8mb3 | utf8mb3_uca1400_ai_ci | 2048 | | | uca1400_ai_ci | ucs2 | ucs2_uca1400_ai_ci | 2560 | | | uca1400_ai_ci | utf8mb4 | utf8mb4_uca1400_ai_ci | 2304 | | | uca1400_ai_ci | utf16 | utf16_uca1400_ai_ci | 2816 | | | uca1400_ai_ci | utf32 | utf32_uca1400_ai_ci | 3072 | | +----------------+--------------------+-----------------------+------+------------+
MySQL-5.5 still has the first two columns.
I assume it's still correct to use the utf8mb4 character set, along with mysqli_set_charset('utf8mb4') for the connection?
Yes. Or any charset from above it seems.
Also, tables that exist today will use utf8mb4_unicode_520_ci, I don't think these will be changed during an update, see
maybe_convert_table_to_utf8mb4()
;
But should they? I suspect doing so would be prudent.
would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?
Only when the SQL use corresponds to existing tables as well.
MariaDB [test]> create table t520 (t varchar(30) character set utf8mb4 collate utf8mb4_unicode_520_ci); MariaDB [test]> create table t1400 (t varchar(30) character set utf8mb4 collate utf8mb4_uca1400_ai_ci); MariaDB [test]> insert into t520 values ('bob'),('jack'), ('jane'); MariaDB [test]> insert into t1400 values ('bob'),('jack'), ('jane'); MariaDB [test]> select * from t1400 join t520 on t1400.t = t520.t; ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_uca1400_ai_ci,IMPLICIT) and (utf8mb4_unicode_520_ci,IMPLICIT) for operation '='
(and bug https://jira.mariadb.org/browse/MDEV-32192 for using @@character_set_collations
to resolve this (for 11.2+)).
Given the implicitness of this and compatibility with existing tables a conversion in update seems a way to avoid some problems.
Oddly, if I manually run ... which does kinda work with
maybe_convert_table_to_utf8mb4()
with it's use of explode('_').
I assume that was intentional.
#8
@
12 months ago
Thanks @danielblack
Not got time to check atm, but assuming we will ALTER TABLE
to avoid the "Illegal mix of collations" issue, do you know how the system will respond; i.e. will it be a full table lock while it's being updated (imagine a 10GB table, will that cause the website to be unavailable for seconds, minutes, hours? I suspect InnoDB might do better than MyISAM).
Also, failure on my part, but thought I should share... I updated a database yesterday (MariaDB 10.11.5 on RDS), but I forgot my backup servers run Ubuntu 22.04.3 LTS, which provides MariaDB 10.6.12 (the "long-term maintenance stable version"), so it could not import the dump (I do this to verify the backups work).
#9
follow-up:
↓ 10
@
12 months ago
At the minium an ALTER TABLE changing collation of an index column will result in rebuilding that index. There's nothing explicit about changing collation - https://mariadb.com/kb/en/innodb-online-ddl-operations-with-the-instant-alter-algorithm so I'd expect a rebuild.
So what's the implication here, need to pin to the charset/collation of the original install?
There are 10.11 packages for 22.04 - https://mariadb.org/download/?t=repo-config&d=22.04+%22jammy%22&v=10.11
Have a SQL query that's working: https://dbfiddle.uk/-AipUQih for detect_charset to pick the optimal:
select * from INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY where CHARACTER_SET_NAME IN ('utf8mb4', 'utf8mb3','utf8') and COLLATION_NAME IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci', 'utf8mb4_unicode_520_ci','utf8mb4_unicode_ci','utf8mb4_general_ci','utf8_general_ci','utf8_general_ci') ORDER BY FIND_IN_SET(COLLATION_NAME,'uca1400_ai_ci,utf8mb4_unicode_520_ci,utf8mb4_unicode_ci,utf8mb4_general_ci,utf8_general_ci,utf8_general_ci'), FIND_IN_SET(CHARACTER_SET_NAME, 'utf8mb4,utf8mb3,utf8') LIMIT 1;
Code (that isn't quite working yet).
https://github.com/grooverdan/wordpress-develop/commit/370444364ca6455cc9dd28e61abdbe8f0b9bd6c5
#10
in reply to:
↑ 9
@
12 months ago
Replying to danielblack:
So what's the implication here, need to pin to the charset/collation of the original install?
I think that will be for someone else to decide... @pento, as you looked at #38036, and worked on #21212, do you have any thoughts on changing utf8mb4_unicode_520_ci
to utf8mb4_uca1400_ai_ci
(MariaDB) or utf8mb4_0900_ai_ci
(MySQL)?
i.e. using a query to find out which collations are supported, and running an ALTER TABLE
during upgrade.
There are 10.11 packages for 22.04 (link)
Thanks, but apt-get upgrade
on a nightly cron is safer than me forgetting to apply patches :-)
#11
@
12 months ago
Seems there's only a few JOIN
s in the and codebase src/wp-includes/class-wp-meta-query.php
could gain an explicit collate on comparison criteria to coerce comparison collation to the larger table.
select * from t1400 join t520 on t1400.t = t520.t collate utf8mb4_uca1400_ai_ci
Simpler however is:
Alternately in the upgrade.php
there is already a wp_should_upgrade_global_tables
that could apply in the upgrade where existing criteria is around character set conversions. If the same criteria is used with collations that would be consistent.
Use UCA-14.0.0 that was added in MariaDB-10.10.1 as default.