Opened 6 weeks ago
Last modified 6 weeks ago
#64473 assigned enhancement
Embrace WHATWG Encoding Standards
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | Charset | Keywords: | has-patch |
| Focuses: | Cc: |
Description (last modified by )
Text encoding can be extremely complicated. Worse, it can draw in a wide array of security issues. Because of this complexity and because of the issues which arise when different systems interpret the same text differently, even through such basic actions as using text decoders which have different internal behaviors, the WHATWG established the Encoding standard.
This specification standardizes many different aspects of the text data flow, including, but not limited to:
- How can the encoding for a stream of bytes be guessed?
- When someone says their text is “1252” or “UTF7” or “UTF-8;ASCII” or any number of invalid or non-standard declarations, what should the system pick as the correct encoding declaration?
- How should certain security-sensitive encodings be handled?
- How exactly should certain kinds of errors be handled when decoding multibyte characters?
It also strongly asserts that all systems should ideally use UTF-8 (see #62172).
The specification is rather short and would provide considerable value to the tricky parts of WordPress’ encoding woes.
It should be designed in a way to answer questions that developers have when using WordPress, touching notable parts such as:
- Parsing HTML when an encoding is uncertain or unknown.
- Converting text from the database to HTML.
- Converting text when exporting to WXR.
- Converting text when existing decoders aren’t available (polyfilling conversion).
- Providing security-sensitive aids to text-handling code.
Related Tickets
- #7813, #38479, #39190 export functions need reliable conversion from a likely-unknown legacy encoding into UTF-8 (and not
utf8_encode()— see #55603). - #20368
htmlspecialchars()woes when charset not provided. a separate issue, but the proposed patch includes a simplified form of aname_from_labeltable. - #49355 seems like posts can fail to save into a database when supplied invalid encodings. this is a bigger issue requiring coordination with
wpdb - #63864 MIME decoding from email should be cautious about what it decodes
Trac ticket: Core-64473
Prep work for Core-64427
Introduces a class to contain relevant WHATWG spec-compliant handling of character encodings, conversions, and recognition.
Answers two valuable questions:
Later on, this will: