Opened 8 years ago
Last modified 5 years ago
#36610 new defect (bug)
Loss of multibyte category and tag names
Reported by: | cfinke | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | |
Component: | Taxonomy | Keywords: | needs-patch |
Focuses: | Cc: |
Description
Some multibyte category and tag names can be lost during creation.
Example: create a category with the name テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAAA
. It is 201 bytes long and will be truncated by $wpdb->strip_invalid_text_for_column()
to 200 bytes (テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAA
) before the category is created.
However, the category name AAAテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
is also 201 bytes, but when it is truncated to 200 bytes, it splits a multibyte character, so when wp_check_invalid_utf8()
gets called, it will truncate the string to zero bytes out of an abundance of caution, since the string ends with something that is not valid utf8.
It's clear that the category creator was not submitting invalid utf8, and the true goal of $wpdb->strip_invalid_text_for_column()
was to ensure that the text would fit in the DB column without auto-truncation by the DB engine, so the ideal behavior should be that the string is truncated to the longest possible length that remains valid and fits within the column.
One way to get around this data loss would be a wrapper around wp_check_invalid_utf8()
. If wp_check_invalid_utf8()
fails, chop a single byte off the end of the string and check it again, up to the point where you have checked the string without the last five bytes (as I believe that the longest a single character can be is six bytes, although I'm not positive about that and I think anything longer than four bytes is mostly theoretical). Or, fix $wpdb->strip_invalid_text_for_column()
so that it doesn't truncate in the middle of a multibyte character.
There might be a solution lurking in mb_strlen(). If wp_check_invalid_utf8()
returns an empty string, take bytes off of the original string (up to 5 bytes) until mb_strlen()
returns a smaller number and then try wp_check_invalid_utf8()
.
Configuration details: Tested in WordPress trunk (4.5-RC1-37153) and PHP 5.2.17
Here's my wp_terms
structure:
CREATE TABLE `wp_terms` ( `term_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `name` varchar(200) NOT NULL DEFAULT '', `slug` varchar(200) NOT NULL DEFAULT '', `term_group` bigint(10) NOT NULL DEFAULT '0', PRIMARY KEY (`term_id`), KEY `slug` (`slug`(191)), KEY `name` (`name`(191)) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
See #36393 for discussion of a similar (but now-fixed) bug.