Opened 9 years ago
Closed 9 years ago
#36393 closed defect (bug) (fixed)
Loss of multibyte comment author names
Reported by: | cfinke | Owned by: | |
---|---|---|---|
Milestone: | 4.5 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Comments | Keywords: | close |
Focuses: | Cc: |
Description
Some multibyte comment author names can be lost during comment submission.
Example: consider a comment authored by a user named テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
. This name is a 258-byte string, longer than the maximum length of the comment_author
column. $wpdb->strip_invalid_text_for_column()
will truncate it to 255 bytes, and because each character is three bytes, the string is still "valid," albeit one character shorter.
After $wpdb->strip_invalid_text_for_column()
runs, sanitize_text_field()
will run, which calls wp_check_invalid_utf8()
, which will do nothing, because the string is still valid utf8.
If this commenter's older sister, Aテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
, also tries to comment, the result is very different. This name is a 259 byte string. $wpdb->strip_invalid_text_for_column()
will truncate it to 255 bytes, taking off one character and 1/3 of another. When wp_check_invalid_utf8()
gets called, it will truncate the string to zero bytes out of an abundance of caution, since the string ends with something that is not valid utf8.
It's clear that the commenter was not submitting invalid utf8, and the true goal of $wpdb->strip_invalid_text_for_column()
was to ensure that the text would fit in the DB column without auto-truncation by the DB engine, so the ideal behavior should be that the string is truncated to the longest possible length that remains valid and fits within the column.
One way to get around this data loss would be a wrapper around wp_check_invalid_utf8()
. If wp_check_invalid_utf8()
fails, chop a single byte off the end of the string and check it again, up to the point where you have checked the string without the last five bytes (as I believe that the longest a single character can be is six bytes, although I'm not positive about that and I think anything longer than four bytes is mostly theoretical). Or, fix $wpdb->strip_invalid_text_for_column()
so that it doesn't truncate in the middle of a multibyte character.
Configuration details: Tested in both WordPress 4.4.2 and trunk (4.5-RC1-37153); PHP 5.2.17
I noticed this issue in regards to commenter names, so here's the structure of my comments DB table (created in 2006, FWIW):
CREATE TABLE `wp_comments` ( `comment_ID` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `comment_post_ID` bigint(20) unsigned NOT NULL DEFAULT '0', `comment_author` tinytext NOT NULL, `comment_author_email` varchar(100) NOT NULL DEFAULT '', `comment_author_url` varchar(200) NOT NULL DEFAULT '', `comment_author_IP` varchar(100) NOT NULL DEFAULT '', `comment_date` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `comment_date_gmt` datetime NOT NULL DEFAULT '0000-00-00 00:00:00', `comment_content` text NOT NULL, `comment_karma` int(11) NOT NULL DEFAULT '0', `comment_approved` varchar(20) NOT NULL DEFAULT '1', `comment_agent` varchar(255) NOT NULL DEFAULT '', `comment_type` varchar(20) NOT NULL DEFAULT '', `comment_parent` bigint(20) unsigned NOT NULL DEFAULT '0', `user_id` bigint(20) unsigned NOT NULL DEFAULT '0', PRIMARY KEY (`comment_ID`), KEY `comment_post_ID` (`comment_post_ID`), KEY `comment_approved_date_gmt` (`comment_approved`,`comment_date_gmt`), KEY `comment_date_gmt` (`comment_date_gmt`), KEY `comment_parent` (`comment_parent`), KEY `comment_author_email` (`comment_author_email`(10)) ) ENGINE=MyISAM AUTO_INCREMENT=2130254 DEFAULT CHARSET=latin1;
In case the strings I used as example commenter names above get mangled, here are their base64 encodings:
commenter1: string(344) "776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D776D"
commenter2: string(348) "Qe++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++g+++gw=="
I'm attaching a POC plugin that manually walks through how the commenter name gets handled in the comment submission process (but only when the first attempt to save the comment fails and then requires the $wpdb->strip_invalid_text_for_column()
call).
Attachments (1)
Change History (8)
#1
@
9 years ago
There might be a solution lurking in mb_strlen()
. If wp_check_invalid_utf8()
returns an empty string, take bytes off of the original string (up to 5 bytes) until mb_strlen()
returns a smaller number and then try wp_check_invalid_utf8()
.
#2
follow-up:
↓ 3
@
9 years ago
This is (more or less) taken care of in 4.5 by checking the submitted strings lengths, see #10377. Trying to use テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
as comment author name in trunk blocks the comment and triggers an error.
In this particular case showing an error is quite better UX than silently truncating somebody's name :)
#3
in reply to:
↑ 2
;
follow-up:
↓ 5
@
9 years ago
Replying to azaozz:
This is (more or less) taken care of in 4.5 by checking the submitted strings lengths, see #10377.
Agreed that #10377 covers the situation I encountered. I had tested it in 4.4 by actually doing the comment form submission, but when I tested it against trunk, I was using the code I had written to simulate the process, which didn't trigger those checks.
This same bug is present in term creation (tested in trunk just now): create a tag or category with the name AAAテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテ
(201 bytes, destined for a 200-byte field). The category or tag will be created, but with a blank name. If you use the string テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAAA
(again 201 bytes, but with the 200th byte being a complete character), the name will be properly truncated to テテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテテAA
before creating the category.
Here's my wp_terms
structure:
CREATE TABLE `wp_terms` ( `term_id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `name` varchar(200) NOT NULL DEFAULT '', `slug` varchar(200) NOT NULL DEFAULT '', `term_group` bigint(10) NOT NULL DEFAULT '0', PRIMARY KEY (`term_id`), KEY `slug` (`slug`(191)), KEY `name` (`name`(191)) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
#5
in reply to:
↑ 3
;
follow-up:
↓ 6
@
9 years ago
- Keywords close added
Replying to cfinke:
Agreed that #10377 covers the situation I encountered. I had tested it in 4.4 by actually doing the comment form submission, but when I tested it against trunk, I was using the code I had written to simulate the process, which didn't trigger those checks.
@cfinke Since the particular scope of this ticket was resolved with #10377 would you be able to open a new ticket or at least modify the title and description of this one to accurately match the remaining multibyte issues?
#6
in reply to:
↑ 5
@
9 years ago
Replying to rachelbaker:
Replying to cfinke:
Agreed that #10377 covers the situation I encountered. I had tested it in 4.4 by actually doing the comment form submission, but when I tested it against trunk, I was using the code I had written to simulate the process, which didn't trigger those checks.
@cfinke Since the particular scope of this ticket was resolved with #10377 would you be able to open a new ticket or at least modify the title and description of this one to accurately match the remaining multibyte issues?
Sure, I've created #36610 to cover the category/tag version of this bug.
A POC plugin for observing how a commenter name can be truncated.