Opened 3 months ago
Last modified 7 weeks ago
#63825 new defect (bug)
Creating a post containing UTF-8, then changing WP_CHARSET to "latin1", makes posts un-editable.
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | normal | Version: | 6.6.2 |
| Component: | Database | Keywords: | needs-testing has-screenshots close needs-test-info |
| Focuses: | ui | Cc: |
Description
Context: I have a very old (circa 2007) WordPress install which I have been dutifully updating for 18 years. I post infrequently (my last two posts were in 2024 and 2016 respectively). Apparently when I first set this blog up, my posts were being made with the latin1 charset. Somewhere around 2020, WordPress silently switched to a different charset (I assume UTF-8). As a result, my old posts (the 2016 one and before) had all special characters become "garbage", for example "…" became "…" (see attachment "mojibake1") whereas the unicode characters in my 2024 posts were displayed correctly. (This is not the problem.) In a bid to fix this, I today tried adding "define('DB_CHARSET', 'latin1');" to my wp-config.php and restarted apache. I then found the 2016-and-before posts had correct special characters, whereas the 2024 post, I guess unsurprisingly, now showed garbage for all unicode characters (see attachment "mojibake2"). (This is still not the problem.)
The problem: Once I had added "define('DB_CHARSET', 'latin1');" to wp-config.php, my 2024 ("UTF-8 containing") post became non-editable. Clicking "edit" on it brings up a blank white page in the block editor (see attachment "editor1"). With the DB_CHARSET unset, opening the 2024 blog post shows me the block editor as expected (see "attachment editor2") and commenting the DB_CHARSET line out in wp-config restores block editor functionality. Oddly, if I install the "Classic Editor" plugin and try to edit the text of the 2024 post, even in text mode the classic editor shows only a blank window (see attachment "editor3"). In none of these scenarios is an error message shown.
The problem is not the date at which the 2024 post was created; the problem is the actual presence of the UTF-8 character bytes. I know this because I did the following: I commented-out my DB_CHARSET line (ie reset DB_CHARSET to defaults). I edited my 2024 post in the block editor, and I edited out all Unicode symbols (¹²³…) replacing them with plain ASCII equivalents. I then turned "'DB_CHARSET', 'latin1'" back on. Going back into the backend, I now find that the 2024 post is entirely editable, even in latin1 mode, and both the 2024 and 2016 post appear correctly. Conclusion: The editor is seeing UTF-8 bytes while constructing the editor page and bailing out silently.
Expected behavior: The post should be editable regardless of what the DB_CHARSET is set to. But if there is a technical reason why a post containing UTF-8 cannot be edited when DB_CHARSET is set to latin1, then I would expect a clear error message. I believe showing an error message is more important than the editor being functional, as a blank white page is *very* hostile and raises fears of data loss.
This is not a support request. I have a workaround (which, as described above, I have already applied) and my blog is currently working. But I believe you should fix this bug.
Attachments (5)
Change History (11)
#1
@
3 months ago
I'm sorry, I forgot to mention: In addition to running WordPress 6.6.2, I am running with the plugins "Akismet Anti-spam: Spam Protection" and "reCaptcha by BestWebSoft". I also usually run with "Cachify", "Classic Editor", and a third entirely custom plugin which makes small changes, but I turned these three plugins off while writing this bug. I expect Akismet and reCaptcha should have no effect on the backend.
#3
@
3 months ago
- Component changed from Posts, Post Types to Database
- Keywords close added
Thank you for the detailed report!
Whatever happened to your site around 2020 when the encodings got confused should not have happened. I don't know how to debug that without a lot more detail like a more specific time frame or details about WordPress updates.
It's expected that data is corrupt when data is stored in the database and different charsets are changed on a site without the appropriate database migrations. A site should not do that on its own and certainly in a way that content is becomes mangled.
See #62172 which seeks to standardize on UTF-8 and would help reduce this type of problem in the future.
#4
@
3 months ago
Hi jonsurrell, the bug report was not about why the encodings got confused— I do not even know for a fact this happened in WordPress at all, as far as I know the mixup could have happened in MySQL. The problem is
- I need to run in DB_CHARSET=latin1, for a valid reason (my site appears wrong without it)
- Running the site with the option DB_CHARSET=latin1 causes an unacceptable breakage (silent failure with symptoms that actively mislead the user)
It seems you've closed my bug arguing the reason I want DB_CHARSET=latin1 is invalid (IE, you believe that my 2020 failure "shouldn't happen", or isn't debuggable). I think this is unreasonable because it doesn't matter why I set DB_CHARSET to latin1— anyone who sets DB_CHARSET to latin1 for any reason while having at least one UTF-8 character in a post will see this same failure (the 2024 failure, the UI-failure-without-error-message). I only mentioned the 2020 failure to provide context.
Now, it might be the case #62172 will make this irrelevant anyway by removing DB_CHARSET=latin1 completely. If so I guess that does obviate the need for good error messages around DB_CHARSET=latin1, but it will create a larger and potentially harder-to-solve set of problems for me (as currently DB_CHARSET=latin1 solves a problem I have).
#5
@
3 months ago
- Keywords needs-test-info added
Thanks for the additional detail.
One clarification: I did not close the ticket. I added a workflow keyword close that marks it as a candidate for closure. Subtle and somewhat confusing, but not the same thing. This ticket remains open, but other folks may close it if they agree with my reasoning.
I marked it as a closure candidate because it's unclear to me that this has a productive path forward. The information I understand is that:
- This site was working fine with encoding X
- Something happened to cause the site to have encoding Y
- At this point it's likely there's data with different encoding in the database, things are already corrupt
- You were able to workaround the issue and fix the problem to some extent by changing
DB_CHARSET=latin1. - A post that may have been stored with encoding X, Y or
latin1could not be edited in the editor
At the final step, data was likely already corrupt. I don't want to dismiss that, it's terrible for a user. But, with corrupt data, editing a post could make things worse! Depending on the contents of the post, it's possible that the editor made a best effort to show you the post but found it so corrupt that there was nothing to show. At that point it doesn't seem like there's much to be done.
If there are more details you can provide such as the post content for the post in question, the database encoding before and after the problem, or some minimal reproduction examples, it may help this move forward.
#6
@
7 weeks ago
@mcc111 what you show with … is evident of double-encoding content as UTF-8. That is, it was already UTF-8 and then it was encoded into UTF-8 again, as if it was actually latin1.
I wonder if any database tables might have been updated by your host at some point? Or some database migration took place?
My guess is that something looked at the DB_CHARSET and thought it needed encoding from latin1 to UTF-8 when it was already in UTF-8.
Unfortunately the reason #62172 is in part to acknowledge that this has almost always been broken in WordPress. PHP itself changed the internal encoding from latin1 to UTF-8 many years ago.
- Did you ever export and re-import your site to migrate it from one environment to another?
- Can you verify the collation of the database table and also show the raw
post_contentfor one of the posts with the double-encoded UTF-8?
It can be tricky knowing exactly what bytes are stored in the database because there are multiple levels of implicit text re-encoding in the process, but knowing what’s there could be a good start.
mojibake1