Make WordPress Core

Opened 5 years ago

Last modified 19 months ago

#45697 new enhancement

Database: Upgrading field collations

Reported by: pento's profile pento Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Database Keywords:
Focuses: Cc:

Description

[37523] added support for the utf8mb4_unicode_520_ci collation, and #38036 exists to look at adding support for the utf8mb4_0900_ai_ci collation.

With these variations, it's possible for a WordPress site to be installed, then MySQL is upgraded sometime later. Subsequent table changes (for example, a plugin creating a new table) would use the newer collation, while the original tables would use the older collation. This will cause JOIN queries between fields of different collations to fail.

Change History (6)

#1 @pento
5 years ago

There are a few options that come to mind for this:

  • Check during the upgrade routine, and upgrade the collations then, if needed. While this makes the upgrade predictable, it means there's a period of time where the collations might be out of sync.
  • Schedule background upgrades when the MySQL change is detected. This avoids problems with collations being out of sync, but could generate long running queries that lock the tables. This may be less of an issue with newer versions of MySQL that support online DDL changes.
  • Add this to Health Check, and require it to be triggered manually.

#2 @knutsp
5 years ago

Will this sort Norwegian/Danish 'æ','ø','å' or Swedish 'æ','ö','å', correctly or will we have to wait for language specific collations?

I have been doing manual collation changes for more than ten years, to get terms and titles sorted correctly.

By only changing meta values, term names and post titles, not slugs, collation mismatch for, some short time, is mostly avoided. Just and idea.

#3 @pento
5 years ago

That's a good thing to take into account, thanks @knutsp.

utf8mb4_0900_ai_ci does accent-insensitive ("ai") sorting, as well as case insensitive ("ci") sorting.

For language-specific sorting, there are language-specific collations (eg, utf8mb4_sv_0900_ai_ci), as well as accent and case sensitive collations (eg, utf8mb4_sv_0900_as_cs). Any upgrade process should take these into account.

Ref: https://dev.mysql.com/doc/refman/8.0/en/charset-mysql.html

#4 @johnjamesjacoby
4 years ago

Running into this already, though I'm able to work around it manually for the time being.

Relatedly, it is not uncommon to run into plugins that have not moved to utf8mb4 yet, resulting in collections of tables with varying character sets and collations, but I think it's rare that plugins JOIN their own tables with core ones.

[rant]
I believe plugins should be responsible for their own database tables, but I wish WordPress came with more robust tools to help plugins take on that responsibility in a more uniform way.

BerlinDB versions individual tables with an option for each, and upgrade routines run automatically when tables are out of date. They run their own character set and collation changes like any other upgrade. It's cheap and effective.

Plugins probably should not be JOINing with WordPress core tables, as there is no guarantee all of the tables will be on the same database server and accessible to each other.
[/rant]

#5 @JavierCasares
2 years ago

Setting a charset is important and will be great or having a universal one (like "utf8mb4") because, right now, "utf8" and "utf8mb3" are confusing between versions.

About the collation, each database have their own. For example, default collations:

  • MySQL / UTF8mb3: utf8_general_ci
  • MariaDB / UTF8mb3: utf8mb3_general_ci
  • MySQL / UTF8mb4: utf8mb4_0900_ai_ci
  • MariaDB / UTF8mb4: utf8mb4_general_ci

Also, about the collation... Should we explain (like at Sie Health) that, if you write in Spanish, may be better to have something like utf8mb4_spanish_ci so you have order like "...m, n, ñ, o, p..."?

Note: See TracTickets for help on using tickets.