WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 3 years ago

#38186 new defect (bug)

Database Collations Bypassed by determine_charset() in wp-db.php

Reported by: natecf Owned by:
Milestone: Awaiting Review Priority: normal
Severity: major Version: 4.6.1
Component: Charset Keywords:
Focuses: Cc:
PR Number:

Description

The function 'determine_charset' in wp-db.php, since 4.6.0 will try to set the database collation to the best option available, 'utf8mb4_unicode_520_ci'. This makes sense when the collation is not explicitly defined in wp-config.php, and most likely beneficial to novice developers.

However, this assumptive behavior occurs even when the DB collation is explicitly defined in wp-config, effectively ignoring 'utf8_general_ci', 'utf8mb4_unicode_ci', and collations starting with 'utf8_' in certain environments.

There appears to be no way to force these collations when defined by the developer. If the developer has set the collation in the config file, shouldn't the software obey the word of the developer?

In short, a valid database collation set in wp-config should not be bypassed by WP Core. There is no workaround that vindicates the logic in determine_charset().

A possible solution would involve determine_charset() checking for a config setting such as 'FORCE_DB_COLLATE'. If true, use the collation defined in wp-config without assuming otherwise.

Change History (2)

#1 in reply to: ↑ description @MikeGillihan
3 years ago

Replying to natecf:

Totally agree that this is a problem that should be solved. Not sure adding another constant is the best approach though. I submitted a patch for this (see #37956), but it was closed without any discussion.

Hopefully, you'll have better luck!

#2 @cimatti
3 years ago

  • Severity changed from normal to major

I think this is a deep issue with many potential consequence, because even if WordPress changed the default result of $wpdb->charset and $wpdb->collate , charset and collations in databases of already existings installations are not updated.

Even plugins are involved, because they should use $wpdb->charset and $wpdb->collate to create tables. So plugins that created tables with collation utf8mb4_unicode_ci with an older WordPress version, now may create new tables and columns with collation utf8mb4_unicode_520_ci

I already noticed in an old WordPress installation that wordpress columns remained on collation utf8mb4_unicode_ci but a plugin created a table with utf8mb4_unicode_520_ci. I have a plugin that has to create a temporary table and join it to existing tables to do a task. This stopped to work because old tables uses utf8mb4_unicode_ci and the new temporary table uses utf8mb4_unicode_520_ci

So the big problem is that if you make a join or an operation between two columns with collation utf8mb4_unicode_ci and utf8mb4_unicode_520_ci the query fails

The passage from utf8 to utf8mb4 could be problematic, because MySQL normally has a limit of 1000 byte for keys, so with utf8 the key can't hold more than 333 characters, and with utf8mb4 the limit is 250 and if you have a key valid with utf8 it may be too long with utf8mb4

Changing a collation may be problematic too if you change it on a column with an unique key, because values that were considered different before, may be considered equal with the new collation

I propose to follow this path:

  • default charset and collation should be chosen during installation, and you should stick with that
  • you should define a standard procedure to change collation, and plugins should implement a callback to change it in their tables when called
  • anyway migration to another collation should be discouraged, and if necessary it should be tested before on a copy of the site, and in any case a backup is strongly suggested
Note: See TracTickets for help on using tickets.