Make WordPress Core

Opened 8 years ago

Last modified 5 weeks ago

#44386 new enhancement

Problem with utf8mb4_unicode_ci collation for arabic content

Reported by: array064's profile array064 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: major Version: 4.9.6
Component: Database Keywords: 2nd-opinion
Focuses: Cc:

Description

I see that since version 4.6, WordPress uses utf8mb4_unicode_ci as the default collation. I see this in the determine_charset function in the /wp-includes/wp-db.php file (CMIIW).

In my experience, it looks like utf8mb4_unicode_ci has problems with content that uses arabic letters.

Example:

I created a tag with the name:

ٱللَّهِ

And I created another tag with the name:

ٱللَّهُ

Then when I do a tag search (via wp-admin), with keyword:

ٱللَّهُ

the search results that appear are:

ٱللَّهِ

and

ٱللَّهُ

tags. Whereas it should appear only tag:

ٱللَّهُ

according to the search keyword.

This becomes a problem when a post wants to use the tag

ٱللَّهُ

, but can not be due to existing tag

ٱللَّهِ

My guess is not a bug from WordPress, but a bug from MySQL.

For information, perhaps this link is a related issue:

https://bugs.mysql.com/bug.php?id=76218

(CMIIW).

Change History (4)

#1 @array064
8 years ago

I forgot to write this:

The above problem does not occur if using utf8mb4_general_ci (or utf8_general_ci) as collaction.

So when installing WordPress, I use the above collation on wp-config.php and MySQL, for some of my websites containing Arabic text.

#2 @r1k0
5 weeks ago

  • Keywords close added; needs-testing removed

Reproduction Report

Environment

  • WordPress: 6.9
  • PHP: 8.4.17
  • Server: PHP.wasm
  • Database: WP_SQLite_Driver (Server: 8.0.38 / Client: 3.51.0)
  • Browser: Chrome 144.0.0.0
  • OS: Windows 10/11
  • Theme: Twenty Twenty-Five 1.4
  • MU Plugins: None activated
  • Plugins:
    • Test Reports 1.2.1

Steps taken

  1. Head over to Posts > Tags.
  2. Add these two tags provided by the reporter above.
  3. Use the first tag provided by the reporter to search for a tag.
  4. ❌ Bug is not occurring

Expected behavior

  • Only one result is shown for that search term, matching the search term.

Additional Notes

  • I was not able to reproduce the behavior where both terms appear.
  • This issue may no longer be reproducible on current WordPress/MySQL versions.
  • I will be marking this as close unless further reports confirm otherwise

Screencast with results

https://imgur.com/a/qBVlSmr

Last edited 5 weeks ago by r1k0 (previous) (diff)

#3 @sajib1223
5 weeks ago

  • Keywords needs-patch added; close removed

Reproduction Report

Description

This report validates whether the issue can be reproduced.

Environment

  • WordPress: 6.9
  • PHP: 8.3.29
  • Server: nginx/1.27.5
  • Database: mysqli (Server: 8.0.33 / Client: mysqlnd 8.3.29)
  • Browser: Firefox 147.0
  • OS: Windows 10/11
  • Theme: Astra 4.12.1
  • MU Plugins:
    • Object Cache Pro (MU) 1.19.0
  • Plugins:
    • Test Reports 1.2.1

Steps taken

  1. Went to Posts > tags.
  2. Created 2 tags with ٱللَّهِ & ٱللَّهُ.
  3. Performed a search with ٱللَّهُ.
  4. Both ٱللَّهِ & ٱللَّهُ term is shown in the result.
  5. ✅ Error condition occurs (reproduced).

Screenshots/Screencast with results

Search result:
https://imgur.com/QcotVe0

Last edited 5 weeks ago by sajib1223 (previous) (diff)

#4 @sajib1223
5 weeks ago

  • Keywords 2nd-opinion added; needs-patch removed

While I successfully reproduced the reported issue with Arabic content and utf8mb4_unicode_ci collation, the original reporter's assessment appears correct - this seems to be MySQL collation behavior rather than a WordPress core bug.

The issue occurs at the MySQL level when comparing/sorting Arabic text with utf8mb4_unicode_ci. Using utf8mb4_general_ci works around the issue, though this may not be the ideal solution as utf8mb4_unicode_ci is generally recommended for non-Latin scripts according to MySQL documentation.

Before proceeding with any patch, we need guidance from component maintainers on:

  1. Whether this is something WordPress core should address
  2. If WordPress can/should work around MySQL collation behavior
  3. Whether this should be documented as a known limitation

Removing needs-patch and adding 2nd-opinion to get input from database component maintainers on the appropriate path forward.

Note: See TracTickets for help on using tickets.