Opened 16 months ago
Last modified 2 weeks ago
#62172 new enhancement
Deprecate non-UTF-8 Support
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Future Release | Priority: | normal |
| Severity: | normal | Version: | 6.7 |
| Component: | General | Keywords: | |
| Focuses: | Cc: |
Description (last modified by )
WordPress' code and history is full of ambiguity on character encoding. When WordPress was formed, many websites and systems still used various single-byte region-specific text encodings, and some used more complicated shifting encodings, but today, UTF-8 is near-universal and the standard recommendation for interoperability between systems.
Significant complexity in WordPress codebase exists in an attempt to properly handle various character encodings. Unfortunately, in many (if not most) of these cases, the code is confused on what strings are in what encodings and how those need to be transformed in order to make proper sense of it all.
Furthermore, the blog_charset appears to have been introduced for the purpose of writing a <meta> tag on rendered pages to let a browser know what encoding to expect, while WordPress itself was to remain agnostic with regards to that same encoding. Over time, the option has been used as a mechanism for indicating how to transform strings, which doesn't resolve any of the problems introduced by working with multiple character encodings. (thanks @mdawaffe for the history there).
In any given WordPress request:
- data in the database is stored in one of two ways: either as encoded text in some character encoding, or as the raw bytes of some encoding which are mislabeled as
latin1so that MySQL doesn't attempt to interpret the bytes. - data is read from MySQL and possibly transformed from the stored bytes into a connection/session-determined encoding and collation, unless a query-specified encoding is also provided.
- PHP source code is stored as UTF-8 or is US-ASCII compatible, making string-based operations against possibly-transformed data from the database.
- Various PHP code will read the currently-set locale or
default_charset,input_encoding,output_encoding, orinternal_encodingand operate differently because of an assumption that the bytes on which they are operating is in those other encodings. - Files are read from the filesystem which are probably encoded in UTF-8.
- Query args are parsed and percent-escaping is decoded, whose source encoding is not guaranteed to be UTF-8.
- POST arguments are read, parsed, and percent-decoded, again without clarity on which byte encoding they are escaping.
- HTML named character references are encoded and decoded, which translate into different byte sequences based on the configured character encoding, often set by
blog_charset. - Various filters and functions in Core, like
wp_spaces_regex()examine specific byte sequences, which are UTF-8-specific, against strings which may have the same character sequence but in a different byte sequence. - Network requests might be made, which are read and parsed, which may come in different encodings according to the
Content-type. - HTML is sent to the browser and a
<meta charset="">tag is produced to instruct the browser how to interpret the bytes it receives. This may or may not match the HTML which WordPress is generating, as most block code and most filters are hard-coded PHP strings in UTF-8 or are at least isomorphic to it up to US-ASCII.
So as is the case with deprecating XHTML and HTML4 support, deprecating non-UTF-8 support is mostly about being honest with ourselves and making space officially to remove complex and risky parts of the codebase that often do more harm and help. There's a good chance today that WordPress is already extremely fragile when working with non-UTF-8 systems, and deprecating it would make it possible to fix those existing issues.
Deprecating non-UTF-8 support means WordPress can stop attempting to support an N-to-M text-encoding architecture and replace it with an N-to-1 architecture, where strings that need to be converted are converted at the boundary of the system while everything inside the system is UTF-8, harmonizing all of the different levels of encoding and code.
Change History (3)
#1
in reply to:
↑ description
@
16 months ago
#2
@
16 months ago
- Description modified (diff)
Updated to fix the inverted deprecation (let's go back to US-ASCII-only 🙃), and thanks @mdawaffe!
#3
@
2 weeks ago
While working on #64427 I realized that the full set of text encodings we likely want to ever support is that given in the `WHATWG Encoding` specification. It’s possible that some WordPress installations interact with content from other encodings, but this is the set a browser will/should recognize and operate on.
There are 37 encodings, one slight variant, a fake encoding that shifts non-US-ASCII into the Private Use Area, and a replacement encoding which always fails and produces an empty string (mitigating security issues from legacy and tricky encodings).
From my own laptop with PHP 8.5 installed with the mbstring and intl extensions, most of these are supported. Given the list, I believe it’s feasible for us to polyfill text conversion among these encodings, removing our dependence on the PHP extensions to process them. Polyfills would be slow, but could follow the native-by-default approach taken with UTF-8 support in WordPress 6.9.
In any potential future where we prune support for non-UTF-8 we might have a plausible phase-out mechanism:
- Dynamically convert content from the database into UTF-8. It’s generally not safe to text-convert HTML but probably safe-enough to do so for a support phase-out.
- Provide an option in Site Health to backup and migrate a site to UTF-8.
- Provide a “dry run” check to see if a site can be safely migrated without data loss.
Further, we could start with simpler steps: ensure that WXR exports are fully and universally UTF-8, performing the costlier-but-more-reliable HTML conversion which is syntax-aware, as highlighted in the linked blog post.
At one point I ran some analysis on declared charset for HTML sites on the web, however, I did not inspect the Content-type header in that analysis. I will attempt to rerun the analysis at some point to understand at-large encodings better. Of all of the top N sites on the Internet, for each one:
- Does the document contain non-US-ASCII?
- Does it parse as valid UTF-8?
- What is the set of declared encodings (a surprising number of websites report multiple and conflicting encodings, like UTF-8 and UTF-16)?
- Does it parse as valid in each of the declared encodings?
- What does the HTML Determining the Character Encoding algorithm report?
- Does it parse in the detected encoding?
Replying to dmsnell:
I'm not sure that that's completely accurate (especially the strict aspiration that "WordPress itself was to remain agnostic…"), but I think it is a practical description of history.
I and my poor memory of course welcome others' remembrances :)
PS: @dmsnell, in several places above you mention "deprecating UTF-8". I think you mean "deprecating non-UTF-8".