#13590 closed enhancement (duplicate)
Inserting a 4-byte UTF-8 character truncates data
Reported by: | sardisson | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 2.9.2 |
Component: | Database | Keywords: | has-patch |
Focuses: | Cc: |
Description
WordPress 2.9.2 on Apache 2.2.15, MySQL 5.1.46-log, PHP 5.2.13
I was writing a post in which I used the "tetragram for advance" (U+1D319) in both the post title and in the body of the post (as a raw UTF-8 glyph rather than an entity).
When I had WordPress save a draft of the post, both the title and the post body were truncated at the point where U+1D319 had been (U+1D319 was also removed).
(In addition, in the permalink field, WP generated something that was represented by the glyph for "invalid codepoint" [black diamond with ? inside] on Mac OS X, rather than successfully percent-encoding the glyph as WordPress does for other non-ASCII characters in post titles when generating permalinks. And, although I manually percent-encoded the glyph for the URL, the permalink ended up being 404. I suspect there's a whole host of places where unexpected glyphs cause problems?)
On the one hand, this is very much a dataloss issue (I lost 1/3 of my post), but on the other hand it's probably not likely to happen often in real-world usage, so I've left priority and severity set to default values ;)
Attachments (1)
Change History (16)
#1
@
14 years ago
- Component changed from General to Charset
- Milestone changed from Awaiting Review to Future Release
#4
@
14 years ago
- Cc aercolino added
- Component changed from Charset to Database
- Keywords utf8 added
- Resolution invalid deleted
- Status changed from closed to reopened
- Summary changed from Inserting a tetragram (SMP/Plane 1) character truncates post fields to Inserting a 4-byte UTF-8 character truncates data
- Version changed from 2.9.2 to 3.0.4
I've recently developed a class for escaping / unescaping UTF-8 characters. I've released it as a Zend Framework class (Zend_Utf8, now in the proposal stage), and as stand-alone class for a WordPress plugin (Full UTF-8).
The plugin was meant to fix the same issue of this ticket, which I stumbled upon some days ago when trying to write an article about the RFC4627 (JSON).
The plugin works pretty well for post content (and title, excerpt and search) but it doesn't cover custom fields. For them I had to write a patch that changed 8 different files. Anyway, a plugin + a patch is not a clean solution. And it's possible that some data string get's through to the db, following an alternative path I couldn't find with my reverse engineering.
So I thought: Why not to wrap db queries inside escape / unescape parentheses? In this way
- nothing will ever hit the db without taking care of
- the patch will be extremely localized
I wrote the patch, ran the wordpress tests, saw that the issue got solved, and all seemed fine to me.
There are some questions that need to be answered:
- Does this solution slow WP down too much?
- Does this solution fail sometime?
I've no clear answers, but hints.
- The escaping/unescaping are cheap operations, but they do examine a string char by char. In this use case, I already short-circuited any char MySQL can handle by itself (1 to 3 bytes UTF-8).
- Only strings are escaped/unescaped, the rest is short-circuited (at least when writing to the db, when reading all is a string), so I think that only a binary string could cause some troubles.
I'd like to know your thoughts, and if the patch could find its way into some next (close) WP release.
Here are the parts of the patch that change WP, the whole patch (with two added files) is instead attached.
diff -rupN --exclude-from wpdiffexclude.txt wordpress-3.0.4/wp-includes/wp-db.php wp-db-patched/wp-includes/wp-db.php --- wordpress-3.0.4/wp-includes/wp-db.php 2010-07-25 08:34:50.000000000 +0200 +++ wp-db-patched/wp-includes/wp-db.php 2011-01-18 12:51:56.000000000 +0100 @@ -1108,7 +1108,8 @@ class wpdb { $dbh =& $this->dbh; $this->last_db_used = "other/read"; } - + + full_utf8_escape($query); $this->result = @mysql_query( $query, $dbh ); $this->num_queries++; @@ -1136,8 +1137,9 @@ class wpdb { $i++; } $num_rows = 0; - while ( $row = @mysql_fetch_object( $this->result ) ) { - $this->last_result[$num_rows] = $row; + while ( $row = @mysql_fetch_assoc( $this->result ) ) { + array_walk($row, 'full_utf8_unescape'); + $this->last_result[$num_rows] = (object) $row; $num_rows++; } diff -rupN --exclude-from wpdiffexclude.txt wordpress-3.0.4/wp-settings.php wp-db-patched/wp-settings.php --- wordpress-3.0.4/wp-settings.php 2010-05-02 23:18:36.000000000 +0200 +++ wp-db-patched/wp-settings.php 2011-01-16 20:19:32.000000000 +0100 @@ -66,6 +66,7 @@ wp_set_lang_dir(); require( ABSPATH . WPINC . '/compat.php' ); require( ABSPATH . WPINC . '/functions.php' ); require( ABSPATH . WPINC . '/classes.php' ); +require( ABSPATH . WPINC . '/full-utf8.php' ); // Include the wpdb class, or a db.php database drop-in if present. require_wp_db();
#5
@
14 years ago
This is an interesting idea.
A few comments:
- This code doesn't follow our Coding Standards - http://codex.wordpress.org/WordPress_Coding_Standards
- Would you be willing to make it available under the same license as WordPress ? i.e GPL2 or later
- How much of a performance hit is this going to be on every query - have you done any profiling on the time it takes on large posts?
#6
@
14 years ago
It seems like a lot of overhead, considering it would run for each and every query and most installs wouldn't even need it.
Also, there are some installs that don't even use UTF-8, which I presume would break badly with this patch.
Instead of patching WP, you could make a db.php drop-in, but I realize that's not an ideal solution.
#8
@
14 years ago
About the performance hit, I have tried a post with 160KB of 1-byte UTF-8 (ASCII) text and just one 4-byte UTF-8 character (a G clef), near the end. It works fine, I think around a single second roundtrip (using localhost). I could time this better, but I don't feel the urge.
I had to change a bit the internal workings of my class, because the pattern matching I chose in the unescape method was driving PHP crazy after 16KB of text. This is why I updated the patch. Thanks to westi for suggesting this test.
About the license, I chose the same used in Zend Framework, and AFAIK the New BSD License is GPLv2 and GPLv3 compatible: see this Wikipedia article and this GNU page.
About the WP coding standards, I'd leave my class as is, because it's intended as a drop in, it's not WP specific, but my style on purpose. Anyone can easily adapt the code in the full-utf8.php file to the WP coding standards. It's just 7 misplaced open braces away, I think.
About the installs that don't even use UTF-8, I know it's possible in WP, but I cannot understand why, except for historical reasons. Anyway, I think it's very simple to adapt the patch so that it only escapes / unescapes if the install is UTF-8, something like
if (DB_CHARSET == 'utf8') ...
Incidentally, during my 160KB test, I discovered #16311.
#9
@
14 years ago
I've updated my plugin so that it drops in a db replacement, as suggested by scribu.
It works pretty well and fast.
#11
@
12 years ago
This will be fixed with #21212. Given that MySQL 5.5 has been stable for nearly 2 years, I think this is a reasonable minimum requirement for this edge case.
#12
@
12 years ago
- Milestone changed from Awaiting Review to 3.5
I agree with pento on this one. If you want 4-byte UTF-8, you need to run MySQL 5.5 (plus 3.5 with #21212).
MySQL versions earlier than 6.0 don't support 4-byte UTF-8 characters.