Make WordPress Core

Opened 14 months ago

Last modified 12 months ago

#58871 new enhancement

support uca14.0.0 collation in database where available

Reported by: danielblack's profile danielblack Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 6.3
Component: Database Keywords: has-patch has-unit-tests needs-testing
Focuses: Cc:

Description

Like #32105 supporting a modern collation is a good idea.

MariaDB added this in 10.10.1 (ref: https://jira.mariadb.org/browse/MDEV-27009).

As the current default is case insensitive, we'll follow the same insensitivity and use accent insensitive.

So of the collations available, use uca1400_ai_ci.

ref: https://mariadb.com/kb/en/supported-character-sets-and-collations/#collations

Change History (11)

This ticket was mentioned in PR #4878 on WordPress/wordpress-develop by grooverdan.


14 months ago
#1

  • Keywords has-patch has-unit-tests added

Use UCA-14.0.0 that was added in MariaDB-10.10.1 as default.

This ticket was mentioned in Slack in #core by danielblack. View the logs.


13 months ago

#3 @danielblack
13 months ago

  • Keywords needs-testing added

#4 @craigfrancis
12 months ago

I don't think we can simply use:

return version_compare( $db_version, '10.10.1', '>=' );

As it depends on the database in use (MySQL vs MariaDB), and while MariaDB is currently using version 10, I assume MySQL will move from 8.1 to 10.10 at some point.

Maybe use SHOW CHARACTER SET?

Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.


Notes:

  • MariaDB added xxx_unicode_520_ci in version 10.0.6 (which was a beta release). For reference, MariaDB forked MySQL at version 5.1, and followed the numbering scheme up to version 5.5, then they jumped to version 10.0.10 (so the >= 5.6 version check works fairly well).
  • The "520" in the name represents UCA 5.2.0.
  • UCA 5.2.0 was a big improvement from utf8_* and utf8mb3_* (no support for Emoji, missing CJK characters, etc); it's also better than utf8mb4_general_ci and utf8mb4_unicode_ci which are affected by the "Sushi-Beer" problem (treating all characters in SMP as equal, see MySQL bug #76553).
  • But, UCA 5.2.0 has the "Mother-Father issue in Japanese", this affects sorting of multi-byte characters, where MySQL does not recognise “ハ” (U+30CF KATAKANA LETTER HA), “パ” (U+30D1 KATAKANA LETTER PA), and “バ” (U+30D0 KATAKANA LETTER BA) as different characters (see Problem #3 Sorting level).
  • A MySQL blog post from July 2021 suggests using utf8mb4_0900_* (UCA 9.0.0) to address this issue; and specifically utf8mb4_0900_ai_ci if you want to be accent-insensitive and case-insensitive.

Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:

SET @@character_set_collations='utf8mb4=uca1400_ai_ci';

So a CREATE TABLE which uses utf8mb4, will use uca1400_ai_ci instead.

#5 @danielblack
12 months ago

MySQL will move from 8.1 to 10.10 at some point.

Yep, seems about 5 years at a rough guess.

Maybe use SHOW CHARACTER SET?

SHOW COLLATION is a bit more direct, and MySQL have supported it for a while too (5.5 example - https://dbfiddle.uk/PCg3HiK3).

MariaDB [(none)]> show collation where Collation='uca1400_ai_ci';
+---------------+---------+------+---------+----------+---------+
| Collation | Charset | Id | Default | Compiled | Sortlen |
+---------------+---------+------+---------+----------+---------+
| uca1400_ai_ci | NULL | NULL | NULL | Yes | 8 |
+---------------+---------+------+---------+----------+---------+

I'll get this incorporated.

Also, do we want WordPress to be accent-insensitive by default? WordPress is currently case-insensitive by default.

Yes, I was just following the insensitive theme there. I haven't any justification either way to thing accent sensitive is more important than case sensitive.

Slightly off topic, the MariaDB Jira ticket MDEV-30164 gets MariaDB 11.2.1 to support:

Nice. I hadn't come across this change. By "slightly off topic" I assume its not necessary to handle here, however if it was, and its desired, can you create a ticket with rough guidance on how you want it implemented.

Yes, you're right, MariaDB has no 9.0.0 collation support. Bumping to 14.0.0 was a large enough body of work without incorporating an intermediate version.

#6 @craigfrancis
12 months ago

Thanks @danielblack.

Just a thought (as I'm not sure what the repercussions are), but if we added support for MySQL's utf8mb4_0900_ai_ci as well, to avoid multiple SHOW COLLATION queries, we could use:

SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');

Then store the results on a private wpdb property, so it's cached, and can be used by has_cap()?

Note that determine_charset() is called by db_connect(), via init_charset(); and while it's fairly fast (on my localhost ~0.0006s, which does not use a network connection), it won't be as fast as mysqli_get_server_info() to guess the supported character sets based on version number (~0.0000001s).

---

And misc points...

  • I'm fine with accent-insensitive (like case-insensitive), I just don't know if it would cause any problems for anyone else (only reason I'm noting it).
  • Agreed, I don't think @@character_set_collations is useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.
  • The MaraDB documentation says "the character set name is always part of the collation name" (source), I assume that's incorrect as collation uca1400_ai_ci would imply a different character set.
  • Running SHOW COLLATION WHERE Collation LIKE "%uca1400%" provides NULL for the Charset?
  • I assume it's still correct to use the utf8mb4 character set, along with mysqli_set_charset('utf8mb4') for the connection?
  • Also, tables that exist today will use utf8mb4_unicode_520_ci, I don't think these will be changed during an update, see maybe_convert_table_to_utf8mb4(); would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?
  • Oddly, if I manually run ALTER TABLE wp_commentmeta CHANGE meta_key meta_key VARCHAR(255) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci NULL DEFAULT NULL, then the meta_key field collation is set to utf8mb4_uca1400_ai_ci, which does kinda work with maybe_convert_table_to_utf8mb4() with it's use of explode('_').

#7 @danielblack
12 months ago

 SHOW COLLATION where Collation IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci','utf8mb4_unicode_520_ci');

Sounds useful, and this could be just implemented in determine_charset, with cache, so there's one query. If we do that, maybe has_cap( 'uca1400' ) need not be implemented. We'll see how easy the test cases are to write.

I'll prepare another draft implemented in determine_charset and maybe_convert_table_to_utf8mb4 doing a collation conversion too.

From Misc:

@@character_set_collations is useful here, I just thought I would mention it incase it gave any inspiration for alternative solutions.

It has some possibly useful implications as a default connection for coercing the collation that are probably worth while.

https://mariadb.com/kb/en/setting-character-sets-and-collations/#changing-default-collation

The MaraDB documentation says "the character set name is always part of the collation name...

Yep needs an update. I'll see what can be written.

Running SHOW COLLATION WHERE Collation LIKE "%uca1400%" provides NULL for the Charset

Finally got to the bottom of this with the original commit - https://github.com/MariaDB/server/commit/133446828c9dcb484476e4b3598af0d63d056a6e (also a documentation task to pick up)

Null implies it can apply to multiple character sets.

MariaDB [test]> select * from  INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY where COLLATION_NAME='uca1400_ai_ci';
+----------------+--------------------+-----------------------+------+------------+
| COLLATION_NAME | CHARACTER_SET_NAME | FULL_COLLATION_NAME   | ID   | IS_DEFAULT |
+----------------+--------------------+-----------------------+------+------------+
| uca1400_ai_ci  | utf8mb3            | utf8mb3_uca1400_ai_ci | 2048 |            |
| uca1400_ai_ci  | ucs2               | ucs2_uca1400_ai_ci    | 2560 |            |
| uca1400_ai_ci  | utf8mb4            | utf8mb4_uca1400_ai_ci | 2304 |            |
| uca1400_ai_ci  | utf16              | utf16_uca1400_ai_ci   | 2816 |            |
| uca1400_ai_ci  | utf32              | utf32_uca1400_ai_ci   | 3072 |            |
+----------------+--------------------+-----------------------+------+------------+

MySQL-5.5 still has the first two columns.

I assume it's still correct to use the utf8mb4 character set, along with mysqli_set_charset('utf8mb4') for the connection?

Yes. Or any charset from above it seems.

Also, tables that exist today will use utf8mb4_unicode_520_ci, I don't think these will be changed during an update, see maybe_convert_table_to_utf8mb4();

But should they? I suspect doing so would be prudent.

would that cause any problems (e.g. adding new tables/columns that would then use a different collation)?

Only when the SQL use corresponds to existing tables as well.

MariaDB [test]> create table t520 (t varchar(30) character set  utf8mb4 collate utf8mb4_unicode_520_ci);
MariaDB [test]> create table t1400 (t varchar(30) character set  utf8mb4 collate utf8mb4_uca1400_ai_ci);
MariaDB [test]> insert into t520 values ('bob'),('jack'), ('jane');
MariaDB [test]> insert into t1400 values ('bob'),('jack'), ('jane');

MariaDB [test]> select * from t1400 join t520 on t1400.t = t520.t;
ERROR 1267 (HY000): Illegal mix of collations (utf8mb4_uca1400_ai_ci,IMPLICIT) and (utf8mb4_unicode_520_ci,IMPLICIT) for operation '='

(and bug https://jira.mariadb.org/browse/MDEV-32192 for using @@character_set_collations to resolve this (for 11.2+)).

Given the implicitness of this and compatibility with existing tables a conversion in update seems a way to avoid some problems.

Oddly, if I manually run ... which does kinda work with maybe_convert_table_to_utf8mb4() with it's use of explode('_').

I assume that was intentional.

#8 @craigfrancis
12 months ago

Thanks @danielblack

Not got time to check atm, but assuming we will ALTER TABLE to avoid the "Illegal mix of collations" issue, do you know how the system will respond; i.e. will it be a full table lock while it's being updated (imagine a 10GB table, will that cause the website to be unavailable for seconds, minutes, hours? I suspect InnoDB might do better than MyISAM).

Also, failure on my part, but thought I should share... I updated a database yesterday (MariaDB 10.11.5 on RDS), but I forgot my backup servers run Ubuntu 22.04.3 LTS, which provides MariaDB 10.6.12 (the "long-term maintenance stable version"), so it could not import the dump (I do this to verify the backups work).

#9 follow-up: @danielblack
12 months ago

At the minium an ALTER TABLE changing collation of an index column will result in rebuilding that index. There's nothing explicit about changing collation - https://mariadb.com/kb/en/innodb-online-ddl-operations-with-the-instant-alter-algorithm so I'd expect a rebuild.

So what's the implication here, need to pin to the charset/collation of the original install?

There are 10.11 packages for 22.04 - https://mariadb.org/download/?t=repo-config&d=22.04+%22jammy%22&v=10.11

Have a SQL query that's working: https://dbfiddle.uk/-AipUQih for detect_charset to pick the optimal:

select * from  INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
 where CHARACTER_SET_NAME IN ('utf8mb4', 'utf8mb3','utf8') 
  and  COLLATION_NAME IN ('uca1400_ai_ci','utf8mb4_0900_ai_ci', 'utf8mb4_unicode_520_ci','utf8mb4_unicode_ci','utf8mb4_general_ci','utf8_general_ci','utf8_general_ci') 
  ORDER BY 
  FIND_IN_SET(COLLATION_NAME,'uca1400_ai_ci,utf8mb4_unicode_520_ci,utf8mb4_unicode_ci,utf8mb4_general_ci,utf8_general_ci,utf8_general_ci'), 
  FIND_IN_SET(CHARACTER_SET_NAME, 'utf8mb4,utf8mb3,utf8') LIMIT 1;

Code (that isn't quite working yet).
https://github.com/grooverdan/wordpress-develop/commit/370444364ca6455cc9dd28e61abdbe8f0b9bd6c5

#10 in reply to: ↑ 9 @craigfrancis
12 months ago

Replying to danielblack:

So what's the implication here, need to pin to the charset/collation of the original install?

I think that will be for someone else to decide... @pento, as you looked at #38036, and worked on #21212, do you have any thoughts on changing utf8mb4_unicode_520_ci to utf8mb4_uca1400_ai_ci (MariaDB) or utf8mb4_0900_ai_ci (MySQL)?

i.e. using a query to find out which collations are supported, and running an ALTER TABLE during upgrade.


There are 10.11 packages for 22.04 (link)

Thanks, but apt-get upgrade on a nightly cron is safer than me forgetting to apply patches :-)

#11 @danielblack
12 months ago

Seems there's only a few JOINs in the and codebase src/wp-includes/class-wp-meta-query.php could gain an explicit collate on comparison criteria to coerce comparison collation to the larger table.

select * from t1400 join t520 on t1400.t = t520.t collate utf8mb4_uca1400_ai_ci

Simpler however is:

Alternately in the upgrade.php there is already a wp_should_upgrade_global_tables that could apply in the upgrade where existing criteria is around character set conversions. If the same criteria is used with collations that would be consistent.

Note: See TracTickets for help on using tickets.