Opened 8 years ago
Last modified 5 weeks ago
#44386 new enhancement
Problem with utf8mb4_unicode_ci collation for arabic content
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | major | Version: | 4.9.6 |
| Component: | Database | Keywords: | 2nd-opinion |
| Focuses: | Cc: |
Description
I see that since version 4.6, WordPress uses utf8mb4_unicode_ci as the default collation. I see this in the determine_charset function in the /wp-includes/wp-db.php file (CMIIW).
In my experience, it looks like utf8mb4_unicode_ci has problems with content that uses arabic letters.
Example:
I created a tag with the name:
And I created another tag with the name:
Then when I do a tag search (via wp-admin), with keyword:
the search results that appear are:
and
tags. Whereas it should appear only tag:
according to the search keyword.
This becomes a problem when a post wants to use the tag
, but can not be due to existing tag
My guess is not a bug from WordPress, but a bug from MySQL.
For information, perhaps this link is a related issue:
https://bugs.mysql.com/bug.php?id=76218
(CMIIW).
Change History (4)
#2
@
5 weeks ago
- Keywords close added; needs-testing removed
Reproduction Report
Environment
- WordPress: 6.9
- PHP: 8.4.17
- Server: PHP.wasm
- Database: WP_SQLite_Driver (Server: 8.0.38 / Client: 3.51.0)
- Browser: Chrome 144.0.0.0
- OS: Windows 10/11
- Theme: Twenty Twenty-Five 1.4
- MU Plugins: None activated
- Plugins:
- Test Reports 1.2.1
Steps taken
- Head over to Posts > Tags.
- Add these two tags provided by the reporter above.
- Use the first tag provided by the reporter to search for a tag.
- ❌ Bug is not occurring
Expected behavior
- Only one result is shown for that search term, matching the search term.
Additional Notes
- I was not able to reproduce the behavior where both terms appear.
- This issue may no longer be reproducible on current WordPress/MySQL versions.
- I will be marking this as
closeunless further reports confirm otherwise
Screencast with results
#3
@
5 weeks ago
- Keywords needs-patch added; close removed
Reproduction Report
Description
This report validates whether the issue can be reproduced.
Environment
- WordPress: 6.9
- PHP: 8.3.29
- Server: nginx/1.27.5
- Database: mysqli (Server: 8.0.33 / Client: mysqlnd 8.3.29)
- Browser: Firefox 147.0
- OS: Windows 10/11
- Theme: Astra 4.12.1
- MU Plugins:
- Object Cache Pro (MU) 1.19.0
- Plugins:
- Test Reports 1.2.1
Steps taken
- Went to Posts > tags.
- Created 2 tags with
ٱللَّهِ&ٱللَّهُ. - Performed a search with
ٱللَّهُ. - Both
ٱللَّهِ&ٱللَّهُterm is shown in the result. - ✅ Error condition occurs (reproduced).
Screenshots/Screencast with results
#4
@
5 weeks ago
- Keywords 2nd-opinion added; needs-patch removed
While I successfully reproduced the reported issue with Arabic content and utf8mb4_unicode_ci collation, the original reporter's assessment appears correct - this seems to be MySQL collation behavior rather than a WordPress core bug.
The issue occurs at the MySQL level when comparing/sorting Arabic text with utf8mb4_unicode_ci. Using utf8mb4_general_ci works around the issue, though this may not be the ideal solution as utf8mb4_unicode_ci is generally recommended for non-Latin scripts according to MySQL documentation.
Before proceeding with any patch, we need guidance from component maintainers on:
- Whether this is something WordPress core should address
- If WordPress can/should work around MySQL collation behavior
- Whether this should be documented as a known limitation
Removing needs-patch and adding 2nd-opinion to get input from database component maintainers on the appropriate path forward.
I forgot to write this:
The above problem does not occur if using utf8mb4_general_ci (or utf8_general_ci) as collaction.
So when installing WordPress, I use the above collation on wp-config.php and MySQL, for some of my websites containing Arabic text.