Make WordPress Core

Opened 15 months ago

Last modified 2 weeks ago

#59868 new defect (bug)

Database insert with emoji fails when table has columns with both utf8mb3 (utf8) and utf8mb4 charsets

Reported by: ianmjones's profile ianmjones Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.2
Component: Charset Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

The wpdb::get_table_charset() function currently sets the charset to utf8 when it detects that both utf8 and utf8mb4 charsets are present in the table's column definitions.

That same function also swaps in utf8 for utf8mb3 as they are effectively the same thing.

This means that the wpdb::strip_invalid_text_from_query() function used early by the wpdb::query() function to determine whether text is safe to be inserted, ends up stripping utf8mb4 safe characters because it forces the use of utf8 in the called wpdb::strip_invalid_text() function.

This results in insert queries failing where a table has columns with both utf8mb3/utf8 and utf8mb4 collations used, and there are emojis or other 4 byte characters being used in the column that has a utf8mb4 charset and collation defined.

I propose that the wpdb::get_table_charset() function should use utf8mb4 as the returned charset when it detects that 2 charsets are defined on the table, and they are utf8 and utf8mb4, instead of the current behaviour of returning utf8.

Change History (6)

This ticket was mentioned in Slack in #core by jorbin. View the logs.


14 months ago

#2 @jorbin
14 months ago

  • Milestone changed from Awaiting Review to Future Release
  • Version changed from trunk to 4.2

I think the first step for this is an automated test to demonstrate the issue. This might only affect specific versions of MySQL or MariaDB, so it would be good to know if this is an all version bug or a specific version one. Once there is a test and it's known, I think this can get moved into a milestone. Depending on the scope of the fix, it may also need the early tag but I think it's too early for that decision.

Updating the version to 4.2 since that is when utf8mb support was added and from the description, it sounds like something that has been around since then.

#3 @jondaley
4 months ago

  • Keywords has-patch added; needs-patch removed

I'm not sure how I only ran into this now, but my users are now complaining that they can't submit emojis.

My setup:
Wordpress 6.6.2
MariaDB: 10.11.6-MariaDB-0+deb12u1-log
PHP 8.2.20

wp_posts.post_content collation: utf8mb3_unicode_ci
I just upgraded it to utf8mb4 to see if that would fix the problem, as I saw some forum posts saying that WordPress is using utf8mb4 and so maybe an upgrade failed somewhere along the way, and I didn't notice. (the alter table command did take over 20 minutes - my database is fairly large: 3.2GB)

So, now the wp_posts.post_content collation: utf8mb4_unicode_520_ci

And I have the same problem - inserting emojis (using the buddyboss plugin, that tries to save 😃 gets tripped up in wp_insert_post because wp_encode_emoji() is NOT called, because the code is checking for exactly "utf8".

I was able to fix it:

4557c4557
< 			if ( strpos($charset, 'utf8') !== FALSE ) {
---
> 			if ( 'utf8' === $charset ) {

I don't know if that would be an acceptable fix for everyone. Since this code is emoji specific and not trying to cover all UTF8 characters, maybe it would be ok?

I saw some references to Japanese emojis having problems, so maybe they have 4 byte emojis?

Perhaps a safer change would be to check for utf8mb4
... (I'm not sure what goes in the ... - if I understand the character sets correctly, utf8 is 3 bytes by default, and so anyone using utf8mb3 or utf8mb4 would be at least as good as someone using utf8, so I'm thinking that all three of those choices would work fine, and I don't know if the strpos() solution is easier/safe enough to implement?

Let me know if I can help debug this any further. I have a development site that I can play with and not affect live users.

This ticket was mentioned in Slack in #core-test by sppramodh. View the logs.


3 months ago

This ticket was mentioned in PR #7662 on WordPress/wordpress-develop by @debarghyabanerjee.


3 months ago
#5

  • Keywords has-unit-tests added; needs-unit-tests removed

Trac Ticket: Core-59868

## Summary

  • This pull request addresses an issue with the wpdb::get_table_charset function regarding character set determination for database tables with mixed charset definitions. The previous implementation prioritized utf8 when both utf8 and utf8mb4 were present, leading to potential data loss for 4-byte characters (e.g., emojis). This change ensures that utf8mb4 is used when applicable.

## Changes Made

### Updated Charset Logic:

  • The logic in the get_table_charset function was modified to return utf8mb4 when both utf8 and utf8mb4 character sets are detected in the table's columns.
  • This ensures that columns defined with utf8mb4 can store all valid characters, including 4-byte characters.

## Test Updates:

  • Adjusted the test cases in Tests_DB_Charset::test_get_table_charset to reflect the new behavior, specifically modifying expectations where the charset was previously set to utf8 when utf8mb4 should now be expected.

## Rationale

  • The changes are crucial for maintaining data integrity and supporting modern applications that require the storage of a wider range of characters. By ensuring that the database can correctly handle 4-byte characters, we prevent potential insert errors and improve the overall robustness of the character handling in WordPress.

## Impact

  • This improvement should enhance compatibility with rich media content and internationalization, benefiting developers and users alike.

## Testing

  • Updated tests to reflect new charset logic.
  • All tests have been run and passed successfully.

#6 @jondaley
2 weeks ago

I think the code has been written and a PR generated, but not merged into the core yet? Is this waiting for someone? I see the PR was checked by a few people.

My customer complained that my fix stopped working.

I currently have the following in post.php. I'll do some debugging to see if I can figure out what has changed.

        foreach ( $emoji_fields as $emoji_field ) {
                if ( isset( $data[ $emoji_field ] ) ) {
                        $charset = $wpdb->get_col_charset( $wpdb->posts, $emoji_field );

            // TSL mod: https://core.trac.wordpress.org/ticket/59868
            if ( strpos($charset, 'utf8') !== FALSE ) {
                                $data[ $emoji_field ] = wp_encode_emoji( $data[ $emoji_field ] );
                        }
                }
        }

Note: See TracTickets for help on using tickets.