Opened 3 years ago
Last modified 5 months ago
#58871 new enhancement
support uca14.0.0 collation in database where available
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | normal | Version: | 6.3 |
| Component: | Database | Keywords: | has-patch has-unit-tests needs-testing needs-refresh |
| Focuses: | performance | Cc: |
Description
Like #32105 supporting a modern collation is a good idea.
MariaDB added this in 10.10.1 (ref: https://jira.mariadb.org/browse/MDEV-27009).
As the current default is case insensitive, we'll follow the same insensitivity and use accent insensitive.
So of the collations available, use uca1400_ai_ci.
ref: https://mariadb.com/kb/en/supported-character-sets-and-collations/#collations
Change History (17)
This ticket was mentioned in PR #4878 on WordPress/wordpress-develop by grooverdan.
3 years ago
#1
- Keywords has-patch has-unit-tests added
This ticket was mentioned in Slack in #core by danielblack. View the logs.
2 years ago
#4
@
2 years ago
I don't think we can simply use:
return version_compare( $db_version, '10.10.1', '>=' );
As it depends on the database in use (MySQL vs MariaDB), and while MariaDB is currently using version 10, I assume MySQL will move from 8.1 to 10.10 at some point.
Maybe use SHOW CHARACTER SET?
Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.
Notes:
- WordPress currently tries to use utf8mb4_unicode_520_ci when the database version is >= 5.6.
- MariaDB added
xxx_unicode_520_ciin version 10.0.6 (which was a beta release). For reference, MariaDB forked MySQL at version 5.1, and followed the numbering scheme up to version 5.5, then they jumped to version 10.0.10 (so the >= 5.6 version check works fairly well).
- The "520" in the name represents UCA 5.2.0.
- UCA 5.2.0 was a big improvement from
utf8_*andutf8mb3_*(no support for Emoji, missing CJK characters, etc); it's also better thanutf8mb4_general_ciandutf8mb4_unicode_ciwhich are affected by the "Sushi-Beer" problem (treating all characters in SMP as equal, see MySQL bug #76553).
- But, UCA 5.2.0 has the "Mother-Father issue in Japanese", this affects sorting of multi-byte characters, where MySQL does not recognise “ハ” (U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ” (U+30D0 KATAKANA LETTER BA) as different characters (see Problem #3 Sorting level).
- A MySQL blog post from July 2021 suggests using
utf8mb4_0900_*(UCA 9.0.0) to address this issue; and specificallyutf8mb4_0900_ai_ciif you want to be accent-insensitive and case-insensitive.
- The
utf8mb4_0900_ai_cicharset was added in MySQL 8.0, released 2016-09-12. MySQL 8.1 was released 2023-07-18.
- The MySQL Unicode Character Sets page only notes support for UCA 9.0.0 (
utf8mb4_0900_ai_ci), not UCA 14.0.0.
- MariaDB 10.10 (where GA 10.10.2 was released 2022-11-17), added support for UCA 14.0.0 collations (e.g. uca1400_ai_ci, ref MDEV-27009).
- The MariaDB character sets documentation does not list anything for
utf8mb4_0900_*(and a quick test on MariaDB 10.6.14 returns an "unknown collation" error).
Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:
SET @@character_set_collations='utf8mb4=uca1400_ai_ci';
So a CREATE TABLE which uses utf8mb4, will use uca1400_ai_ci instead.
#5
@
2 years ago
MySQL will move from 8.1 to 10.10 at some point.
Yep, seems about 5 years at a rough guess.
Maybe use SHOW CHARACTER SET?
SHOW COLLATION is a bit more direct, and MySQL have supported it for a while too (5.5 example - https://dbfiddle.uk/PCg3HiK3).
MariaDB [(none)]> show collation where Collation='uca1400_ai_ci';
+---------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+------+---------+----------+---------+
| uca1400_ai_ci | NULL | NULL | NULL | Yes | 8 |
+---------------+---------+------+---------+----------+---------+
I'll get this incorporated.
Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.
Yes, I was just following the insensitive theme there. I haven't any justification either way to thing accent sensitive is more important than case sensitive.
Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:
Nice. I hadn't come across this change. By "slightly off topic" I assume its not necessary to handle here, however if it was, and its desired, can you create a ticket with rough guidance on how you want it implemented.
Yes, you're right, MariaDB has no 9.0.0 collation support. Bumping to 14.0.0 was a large enough body of work without incorporating an intermediate version.
#6
@
2 years ago
Thanks @danielblack.
Just a thought (as I'm not sure what the repercussions are), but if we added support for MySQL's utf8mb4_0900_ai_ci as well, to avoid multiple SHOW COLLATION queries, we could use:
SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');
Then store the results on a private wpdb property, so it's cached, and can be used by has_cap()?
Note that determine_charset() is called by db_connect(), via init_charset(); and while it's fairly fast (on my localhost ~0.0006s, which does not use a network connection), it won't be as fast as mysqli_get_server_info() to guess the supported character sets based on version number (~0.0000001s).
---
And misc points...
- I'm fine with accent-insensitive (like case-insensitive), I just don't know if it would cause any problems for anyone else (only reason I'm noting it).
- Agreed, I don't think
@@character_set_collationsis useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.
- The MaraDB documentation says "the character set name is always part of the collation name" (source), I assume that's incorrect as collation
uca1400_ai_ciwould imply a different character set.
- Running
SHOW COLLATION WHERE Collation LIKE "%uca1400%"provides NULL for theCharset?
- I assume it's still correct to use the
utf8mb4character set, along withmysqli_set_charset('utf8mb4')for the connection?
- Also, tables that exist today will use
utf8mb4_unicode_520_ci, I don't think these will be changed during an update, seemaybe_convert_table_to_utf8mb4(); would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?
- Oddly, if I manually run
ALTER TABLE wp_commentmeta CHANGE meta_key meta_key VARCHAR(255) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci NULL DEFAULT NULL, then themeta_keyfield collation is set toutf8mb4_uca1400_ai_ci, which does kinda work withmaybe_convert_table_to_utf8mb4()with it's use ofexplode('_').
#7
@
2 years ago
SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');
Sounds useful, and this could be just implemented in determine_charset, with cache, so there's one query. If we do that, maybe has_cap( 'uca1400' ) need not be implemented. We'll see how easy the test cases are to write.
I'll prepare another draft implemented in determine_charset and maybe_convert_table_to_utf8mb4 doing a collation conversion too.
From Misc:
@@character_set_collationsis useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.
It has some possibly useful implications as a default connection for coercing the collation that are probably worth while.
https://mariadb.com/kb/en/setting-character-sets-and-collations/#changing-default-collation
The MaraDB documentation says "the character set name is always part of the collation name...
Yep needs an update. I'll see what can be written.
Running SHOW COLLATION WHERE Collation LIKE "%uca1400%" provides NULL for the Charset
Finally got to the bottom of this with the original commit - https://github.com/MariaDB/server/commit/133446828c9dcb484476e4b3598af0d63d056a6e (also a documentation task to pick up)
Null implies it can apply to multiple character sets.
MariaDB [test]> select * from INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY where COLLATION_NAME='uca1400_ai_ci'; +----------------+--------------------+-----------------------+------+------------+ | COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME | ID | IS_DEFAULT | +----------------+--------------------+-----------------------+------+------------+ | uca1400_ai_ci | utf8mb3 | utf8mb3_uca1400_ai_ci | 2048 | | | uca1400_ai_ci | ucs2 | ucs2_uca1400_ai_ci | 2560 | | | uca1400_ai_ci | utf8mb4 | utf8mb4_uca1400_ai_ci | 2304 | | | uca1400_ai_ci | utf16 | utf16_uca1400_ai_ci | 2816 | | | uca1400_ai_ci | utf32 | utf32_uca1400_ai_ci | 3072 | | +----------------+--------------------+-----------------------+------+------------+
MySQL-5.5 still has the first two columns.
I assume it's still correct to use the utf8mb4 character set, along with mysqli_set_charset('utf8mb4') for the connection?
Yes. Or any charset from above it seems.
Also, tables that exist today will use utf8mb4_unicode_520_ci, I don't think these will be changed during an update, see
maybe_convert_table_to_utf8mb4();
But should they? I suspect doing so would be prudent.
would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?
Only when the SQL use corresponds to existing tables as well.
MariaDB [test]> create table t520 (t varchar(30) character set utf8mb4 collate utf8mb4_unicode_520_ci);
MariaDB [test]> create table t1400 (t varchar(30) character set utf8mb4 collate utf8mb4_uca1400_ai_ci);
MariaDB [test]> insert into t520 values ('bob'),('jack'), ('jane');
MariaDB [test]> insert into t1400 values ('bob'),('jack'), ('jane');
MariaDB [test]> select * from t1400 join t520 on t1400.t = t520.t;
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_uca1400_ai_ci,IMPLICIT) and (utf8mb4_unicode_520_ci,IMPLICIT) for operation '='
(and bug https://jira.mariadb.org/browse/MDEV-32192 for using @@character_set_collations to resolve this (for 11.2+)).
Given the implicitness of this and compatibility with existing tables a conversion in update seems a way to avoid some problems.
Oddly, if I manually run ... which does kinda work with
maybe_convert_table_to_utf8mb4()with it's use of explode('_').
I assume that was intentional.
#8
@
2 years ago
Thanks @danielblack
Not got time to check atm, but assuming we will ALTER TABLE to avoid the "Illegal mix of collations" issue, do you know how the system will respond; i.e. will it be a full table lock while it's being updated (imagine a 10GB table, will that cause the website to be unavailable for seconds, minutes, hours? I suspect InnoDB might do better than MyISAM).
Also, failure on my part, but thought I should share... I updated a database yesterday (MariaDB 10.11.5 on RDS), but I forgot my backup servers run Ubuntu 22.04.3 LTS, which provides MariaDB 10.6.12 (the "long-term maintenance stable version"), so it could not import the dump (I do this to verify the backups work).
#9
follow-up:
↓ 10
@
2 years ago
At the minium an ALTER TABLE changing collation of an index column will result in rebuilding that index. There's nothing explicit about changing collation - https://mariadb.com/kb/en/innodb-online-ddl-operations-with-the-instant-alter-algorithm so I'd expect a rebuild.
So what's the implication here, need to pin to the charset/collation of the original install?
There are 10.11 packages for 22.04 - https://mariadb.org/download/?t=repo-config&d=22.04+%22jammy%22&v=10.11
Have a SQL query that's working: https://dbfiddle.uk/-AipUQih for detect_charset to pick the optimal:
select * from INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
where CHARACTER_SET_NAME IN ('utf8mb4', 'utf8mb3','utf8')
and COLLATION_NAME IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci', 'utf8mb4_unicode_520_ci','utf8mb4_unicode_ci','utf8mb4_general_ci','utf8_general_ci','utf8_general_ci')
ORDER BY
FIND_IN_SET(COLLATION_NAME,'uca1400_ai_ci,utf8mb4_unicode_520_ci,utf8mb4_unicode_ci,utf8mb4_general_ci,utf8_general_ci,utf8_general_ci'),
FIND_IN_SET(CHARACTER_SET_NAME, 'utf8mb4,utf8mb3,utf8') LIMIT 1;
Code (that isn't quite working yet).
https://github.com/grooverdan/wordpress-develop/commit/370444364ca6455cc9dd28e61abdbe8f0b9bd6c5
#10
in reply to:
↑ 9
@
2 years ago
Replying to danielblack:
So what's the implication here, need to pin to the charset/collation of the original install?
I think that will be for someone else to decide... @pento, as you looked at #38036, and worked on #21212, do you have any thoughts on changing utf8mb4_unicode_520_ci to utf8mb4_uca1400_ai_ci (MariaDB) or utf8mb4_0900_ai_ci (MySQL)?
i.e. using a query to find out which collations are supported, and running an ALTER TABLE during upgrade.
There are 10.11 packages for 22.04 (link)
Thanks, but apt-get upgrade on a nightly cron is safer than me forgetting to apply patches :-)
#11
@
2 years ago
Seems there's only a few JOINs in the and codebase src/wp-includes/class-wp-meta-query.php could gain an explicit collate on comparison criteria to coerce comparison collation to the larger table.
select * from t1400 join t520 on t1400.t = t520.t collate utf8mb4_uca1400_ai_ci
Simpler however is:
Alternately in the upgrade.php there is already a wp_should_upgrade_global_tables that could apply in the upgrade where existing criteria is around character set conversions. If the same criteria is used with collations that would be consistent.
#12
@
9 months ago
@craigfrancis Can you please help Daniel finalize this? It would potentially improve performance and reduce disk space usage on millions of WP installations. Thanks!
#13
follow-up:
↓ 14
@
5 months ago
- Focuses performance added
- Keywords needs-refresh added
- Severity changed from normal to major
@craigfrancis Hello! Shall you please finalize this? Just to report that we are happily working with utf8mb4_uca1400_ai_ci yet if this collation is not defined in wp-config withdefine('DB_COLLATE', 'utf8mb4_uca1400_ai_ci'); all new tables are defaulting to utf8mb4_unicode_520_ci, due to the instructions here:
<?php public function determine_charset( $charset, $collate ) { if ( ( ! ( $this->dbh instanceof mysqli ) ) || empty( $this->dbh ) ) { return compact( 'charset', 'collate' ); } if ( 'utf8' === $charset ) { $charset = 'utf8mb4'; } if ( 'utf8mb4' === $charset ) { // _general_ is outdated, so we can upgrade it to _unicode_, instead. if ( ! $collate || 'utf8_general_ci' === $collate ) { $collate = 'utf8mb4_unicode_ci'; } else { $collate = str_replace( 'utf8_', 'utf8mb4_', $collate ); } } // _unicode_520_ is a better collation, we should use that when it's available. if ( $this->has_cap( 'utf8mb4_520' ) && 'utf8mb4_unicode_ci' === $collate ) { $collate = 'utf8mb4_unicode_520_ci'; } return compact( 'charset', 'collate' ); }
Also please take into account this paper by MariaDB: https://mariadb.org/wp-content/uploads/2023/10/Improving-Consistency-Performance-Compatibility-Ease-of-use.pdf
Never mind asking "any" AI chatbot like:
https://claude.ai/share/9c9849b8-7f45-4e2b-aaa5-a51d1f5454e4
https://g.co/gemini/share/06fccd03103c
https://chatgpt.com/share/68da27d7-ee5c-800a-9035-574623b58e2c
#14
in reply to:
↑ 13
@
5 months ago
- Severity changed from major to normal
Replying to oxfordmetadata:
@craigfrancis Hello! Shall you please finalize this?
Please be nice; and the AI output isn't helping... as noted above (#4 and #6), mysqli_get_server_info() is currently used to determine if utf8mb4 and utf8mb4_520 is available, it's not ideal, but it works and is fast... unfortunately it doesn't work with the requested change, and the alternatives suggested so far have their own problems (e.g. performance impact; and the different names in use - uca1400_ai_ci, utf8mb4_0900_ai_ci, and utf8mb4_unicode_520_ci).
#15
follow-up:
↓ 16
@
5 months ago
My apologies for the misunderstanding in my tone, in my fast and furious indeed message.
But the "performance", or efficiency shall I say, improvement from the transition to utf8mb4_uca1400_ai_ci are pretty noticeable (thought I cannot really tell if it is more from the new collation itself or from the fact that now all objects have the same collation).
Indeed I was pretty pissed off noticing that our Pods plugin when creating Advanced Content Types (in their own tables) kept using unicode_520_ci, although the other tables were unicode_ci, leading to the need of explicit COLLATE statement mess and so on.
Earlier today we found that the "culprit" was that piece of code.
From MariaDB's presentation it appeared to me that these naming differences are here to stay, so we have to figure out something that if anything, it will maintain all the tables to a homogeneous collation.
I guess I overstayed my welcome.
Once more, please accept my apologies.
#16
in reply to:
↑ 15
@
5 months ago
Replying to oxfordmetadata:
My apologies for the misunderstanding in my tone.
It's fine, I'm just really tired atm... as to a solution, I'm not sure what it will be, I've looked at this problem a couple of times, I can't find anything obvious, hopefully one will be found soon (I would also like to use the more modern character sets).
#17
@
5 months ago
@craigfrancis I just do not have enough data to convey to you how much better a server with about 40 applications performs. I am in the process of migrating to uca_1400 as we speak :)
I think this function ought to run once during initial installation. Else as you move from one service provider to the other, you may end up with 4 (four) different collations - as is the case with many of our customers.
I mean we even had one guy who moved from say MariaDB 8 or 9, to MySQL 5.7, then to MySQL 8, then to MariaDB 10.4 and now with us on MariaDB 11.4. We even had utf8mb3 collations present. It seemed to me that as they were adding plugins/features in each host, the collations were proliferating.
Keep well!
Use UCA-14.0.0 that was added in MariaDB-10.10.1 as default.