WordPress.org

Make WordPress Core

Opened 8 years ago

Closed 6 years ago

#3517 closed defect (bug) (invalid)

WordPress should be 100% UTF-8

Reported by: sehh Owned by:
Milestone: Priority: normal
Severity: normal Version: 2.0.5
Component: General Keywords: UTF-8
Focuses: Cc:

Description (last modified by foolswisdom)

WP is running in semi-unicode and ascii/latin mode. As a result, people with weird languages that require UTF-8 character sets are having major problems. The issue isn't easily detectable, since storing and retreiving UTF-8 data to an SQL database with latin character set seems to work. Unfortuantely, it doesn't really work. WP can store UTF-8 data on a database/table/field with latin character set, but all SQL-based text functions return wrong values.

For example: SORTING, COMPARING, MANIPULATING of any string returns invalid data (not sorted properly, etc). Its about time WP started using UTF-8 everywhere.

The change to UTF-8 isn't simple. Some people thing that they can just "ALTER TABLE" to UTF-8 charset and then use "SET NAMES utf-8" that they'll be fine, WRONG!

For a new installation, its rather easy:

1) All database and table definitions must be set to UTF-8, some examples:
create database wordpress DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
create table wp_users (etc...) DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci;

2) Modify the WP database connection to execute the following:
SET NAMES utf8;
SET COLLATION_CONNECTION=utf8_general_ci;

Thats about it, a new installation can easily run with full UTF-8 support without any more changes.

Now, how about upgrading from an existing database? Thats more complex. Read this carefuly:

When doing an ALTER TABLE to change the character set, all TEXT (and similar) fields are converted to UTF-8. The conversion BREAKS existing text because the conversion expects the data to be in Latin, but they are not since WP has stored unicode characters in a latin database, as a result we get garbage after the conversion!

The solution is to ALTER all TEXT and related fields to BLOB, then alter the character set and finaly change back the BLOB fields to TEXT.

Example steps:

1) ALTER TABLE users MODIFY Last_Name BLOB;
2) ALTER DATABASE wordpress charset=utf8;
3) ALTER TABLE users charset=utf8;
4) ALTER TABLE users MODIFY Last_Name TEXT CHARACTER SET utf8;

so, we change our text fields to BLOB, switch our database and tables to UTF-8 and finaly in one go we return our initial TEXT fields and switch them to UTF-8.

the key here is that a BLOB field will not be converted to garbage when switched to UTF-8, unlike a TEXT field.

Hopefuly, the developers of WP will be able to create a conversion script to upgrade old latin databases.

Some of the related tickets: #2828, #2942, #3184

Attachments (1)

charset.diff (5.4 KB) - added by ryan 8 years ago.
Support for declaring charset for new installs under mysql >= 4.1

Download all attachments as: .zip

Change History (31)

comment:1 @foolswisdom8 years ago

  • Description modified (diff)

comment:2 @foolswisdom8 years ago

  • Milestone changed from 2.0.7 to 2.2

comment:3 @sehh8 years ago

  • Cc sehh@… added

comment:4 @sehh8 years ago

milestone 2.2? OMG!

Couldn't we make it 2.1 at least? This is rather important and i can help with development since i've already done this for my own applications.

comment:5 @foolswisdom8 years ago

  • Summary changed from 100% UTF-8 to WordPress should be 100% UTF-8

sehh, Thank you for this bug report. It is very well written!

Milestone is an option that WordPress developers will set. 2.1 is currently close to beta, and without a developer taking ownership of this issue, it will not make it.

A patch would definitely improve the likelihood of a developer marking this milestone 2.1 . Thorough testing info will earn you a gold star.

comment:6 @sehh8 years ago

Thank you for the prompt explanation.

If i have enough time i'll work on a patch within the next few days and hopefuly i'll have something ready during the weekend. If i don't make it in time then we can use my work in the future for the 2.2 milestone as suggested.

comment:7 @tenpura8 years ago

The issue is all about how we setup WP and MySQL character set in the right manner, not only about UTF-8. Since some people prefer to use WP in other encodings than UTF-8, we need to think about the solution more globally.

since storing and retreiving UTF-8 data to an SQL database with latin character set seems to work.

This is true only when all the related MySQL system variables are set to the same character set. In this state (so-called skip-character-set-client-handshake), MySQL 4.1 and higher won't auto-convert character set like good old 4.0, so we can retrieve the original input. However, the situation is not always like this.

comment:8 @sehh8 years ago

I am sorry but i believe the issue is not just about storing/retreiving. Storing and retreiving UTF-8 data with latin character set is buggy to say the least. The issue here is that all character related functions break.

So if you store the character U+9AA8, you'll retreive it correctly as U+9AA8.

BUT if you try to SORT() or COUNT() you'll get wrong values.

For example, COUNT() for U+0041 should return 1, instead it returns 6 because the function sees the character as "U", "+", etc.

Now, about the issue with other character sets. I don't know anyone who would rather use anything other than UTF-8. Unicode is now a worldwide standard for everything, all current operating systems run in full Unicode (Windows, MacOSX, Linux, *BSD, OS/2 even BeOS and other more obscure systems). Ofcourse i don't know everything, so i'm open to suggestions here and reasons why someone would want to use non-unicode.

Remember that only unicode allows you to display multiple languages on the same page.

comment:9 @foolswisdom8 years ago

  • Description modified (diff)

comment:10 @ryan8 years ago

Here's a patch that will SET NAMES and create tables with DEFAULT CHARACTER SET for new installs running under mysql 4.1 or greater. The charset and collation are set in wp-config.php as DB_CHARSET and DB_COLLATE. Old WP installs will not have these defined in their existing wp-config.php so SET NAMES will not be run. New installs will pick these values up from wp-config-sample.php. The defaults are utf8 and utf8_general_ci, but those can be changed. Those upgrading from older WP versions can set the charset if they like. It is currently up to them to upgrade their databases to use the new character set. We can look at what it takes to upgrade old installations. Drupal has some code in the update_convert_table_utf8() function in update.php. I think there can be some problems doing that in some scenarios though.

@ryan8 years ago

Support for declaring charset for new installs under mysql >= 4.1

comment:11 @sehh8 years ago

I looked at the Drupal function and it looks good to me. It does the proper conversion to binary and back again.

Could you please elaborate on the problems with some scenarios? Are there any test cases we can look at?

comment:12 @ryan8 years ago

I'm curious if it will work in the scenario where the table/column charset does not match the blog charset. Often times UTF-8 blog data is saved to latin2 tables, for example. Will that code handle that?

comment:13 @sehh8 years ago

As long as the UTF-8 data are stored without modification, then it will work.

Please read the description of the bug report at the top. Currently, we store UTF-8 data on latin1 or latin2 etc tables (non-utf8 tables that is). It works at first glance because Input data matches Output data (but as i described SQL functions break).

So i believe, latin2 will work fine, since its similar to latin1.

The only exception would be in a table that somehow modifed the input UTF8 data or modifies the output (thus input!=output).

I'd be interested to test this on multibyte languages...

comment:14 @ryan8 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [4860]) First cut at mysql utf-8 charset suport. Props to sehh, drupal, and textpattern. fixes #3517

comment:15 @ryan8 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

[4860] takes a whack at fixing fresh installs. Let's see what can be done with upgrades next.

comment:16 @sehh8 years ago

Thank you for taking the time to work on this, much appriciated.

Regarding the collation, would it be a better idea if we didn't specify the collation by hand and instead let MySQL use the default collation for the specified charset?

Based on the MySQL docs, each charset has a default collation and its enough to specify just the charset.

That should also accomodate for future MySQL versions which change the defaults to include new languages/character sets.

Not a big issue really.

comment:17 @ryan8 years ago

We can leave collation blank by default. The define would still be there for people who want a specific collation.

comment:18 @tenpura8 years ago

ryan,

I have just tested your patch. It seems perfect for new installs. I think you spent hours on this one, the code is beautifully minimized!

If you planning to do the conversion script thing in this ticket, I will come back with examples of some bad scenarios.

comment:19 @tenpura8 years ago

Here I bring up 4 different scenarios. I hope this info will help... Case2 is the case ryan mentioned.

Case 1: Retrievable

[Configuration]
table/column: latin1 (default)
MySQL system variables related to client (character_set_client, character_set_connection, character_set_result): latin1 (default)
input: UTF-8

[Description]
All the related variables and table/column character set have the same value (=latin1). In this case MySQL stores input data without conversion so that the data in the database can be retrievable. These datas are good to go to the UTF-8 table converter.

Case 2: Unretrievable

[Configuration]
table/column: latin2
MySQL system variables related to client: latin1 (default)
input: UTF-8

[Description]
Only ascii survives. Multybyte based characters (e.g. UTF-8 Japanese) are converted wrongly and destroyed when they are stored. These datas are already broken and unretrievable.

Case 3: Retrievable by reverse process

[Configuration]
table/column: utf8
MySQL system variables related to client: latin1 (default)
input: UTF-8(I have tested with UTF-8 Japanese)

[Description]
This may be a common misconfiguration. In this case, web browsers can display the contents as expected, but in the database, datas are force converted and garbled. This type of datas can only be retrieved by reverse process, and thus the regular UTF-8 table converter won't help them.

Case 4: Case1 with multiple input encodings

[Configuration]
table/column: latin1 (default)
MySQL system variables related to client: latin1 (default)
input: UTF-8, EUC-JP (e.g. UTF-8 articles and EUC-JP pingbacks)

[Description]
Both data can be retrievable, but of course they need to be treated in different way. If EUC-JP pingbacks is processed regardlessly with the UTF-8 table converter, the data will be lost.

comment:20 @tenpura8 years ago

Maybe DB_CHARSET had better be picked up (if exists) for blog_charset instead of one in upgrade-schema.php? (Function to convert names like utf8 -> UTF-8 is needed, though.)

comment:21 @maxsite8 years ago

I offer to use the coding for WordPress and apart coding for database.

1) "External" part: all files WordPress must work in UTF8. Since needs correct functioning AJAX, that other way no.

2) Database can work in any coding, for instance CP1251. Work with the base is realized through one interface (class wpdb). If in it to add several functions, which:

  • before performing the request convert the text of the request from UTF8 in coding of the base;
  • after performing the request convert the answer from coding of the base in UTF8.

I this has already done and has published for testing (http://maxsite.org/wp210rus-all). If this interesting developer WordPress, that I shall show ready files.

MAX - Русский WordPress (http://maxsite.org/)

comment:22 @foolswisdom8 years ago

  • Milestone changed from 2.2 to 2.3

comment:23 follow-up: @smalldust8 years ago

I think the problem is not whether wordpress should be 100% UTF-8, but wordpress MUST tell MySQL which encoding it is using.

The problem now is, system variable "character_set_client" is set to "latin1" defaultly, but the characters wordpress is writing into database are not in latin1, but in UTF-8. This results in that the UTF-8 characters are considers as latin1, and are converted to UTF-8 again.

comment:24 in reply to: ↑ 23 @foolswisdom8 years ago

Replying to smalldust:

I created a new ticket #4219 "character_set_client" should be set to UTF-8 by default
for that specific problem. Thank you for the bug report!

comment:25 @MichaelH8 years ago

Since this is still an open ticket, adding references regarding converting database character sets, and a conversion script submitted via the Forums by andersapt.

http://codex.wordpress.org/Converting_Database_Character_Sets
http://kunde.apt.no/aso/wordpress/convert_to_utf8_sql_generator.txt
http://wordpress.org/support/topic/117955?replies=6#post-571753

comment:26 @markjaquith8 years ago

  • Milestone changed from 2.3 to 2.4

comment:27 @foolswisdom8 years ago

  • Milestone changed from 2.4 to 2.5

comment:28 @mrmist6 years ago

  • Milestone 2.9 deleted
  • Resolution set to fixed
  • Severity changed from major to normal
  • Status changed from reopened to closed

This seems to have run out of steam, and has beeen variously either fixed or not.

Closing as fixed. I guess it can be re-opened if there is sufficient interest.

comment:29 @jacobsantos6 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:30 @jacobsantos6 years ago

  • Resolution set to invalid
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.