Make WordPress Core

Opened 6 weeks ago

Last modified 6 weeks ago

#64473 assigned enhancement

Embrace WHATWG Encoding Standards

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 7.0 Priority: normal
Severity: normal Version:
Component: Charset Keywords: has-patch
Focuses: Cc:

Description (last modified by dmsnell)

Text encoding can be extremely complicated. Worse, it can draw in a wide array of security issues. Because of this complexity and because of the issues which arise when different systems interpret the same text differently, even through such basic actions as using text decoders which have different internal behaviors, the WHATWG established the Encoding standard.

This specification standardizes many different aspects of the text data flow, including, but not limited to:

  • How can the encoding for a stream of bytes be guessed?
  • When someone says their text is “1252” or “UTF7” or “UTF-8;ASCII” or any number of invalid or non-standard declarations, what should the system pick as the correct encoding declaration?
  • How should certain security-sensitive encodings be handled?
  • How exactly should certain kinds of errors be handled when decoding multibyte characters?

It also strongly asserts that all systems should ideally use UTF-8 (see #62172).


The specification is rather short and would provide considerable value to the tricky parts of WordPress’ encoding woes.

It should be designed in a way to answer questions that developers have when using WordPress, touching notable parts such as:

  • Parsing HTML when an encoding is uncertain or unknown.
  • Converting text from the database to HTML.
  • Converting text when exporting to WXR.
  • Converting text when existing decoders aren’t available (polyfilling conversion).
  • Providing security-sensitive aids to text-handling code.

Related Tickets

  • #7813, #38479, #39190 export functions need reliable conversion from a likely-unknown legacy encoding into UTF-8 (and not utf8_encode() — see #55603).
  • #20368 htmlspecialchars() woes when charset not provided. a separate issue, but the proposed patch includes a simplified form of a name_from_label table.
  • #49355 seems like posts can fail to save into a database when supplied invalid encodings. this is a bigger issue requiring coordination with wpdb
  • #63864 MIME decoding from email should be cautious about what it decodes

Change History (3)

This ticket was mentioned in PR #10677 on WordPress/wordpress-develop by @dmsnell.


6 weeks ago
#1

  • Keywords has-patch added

Trac ticket: Core-64473

Prep work for Core-64427

Introduces a class to contain relevant WHATWG spec-compliant handling of character encodings, conversions, and recognition.

Answers two valuable questions:

  • Given this charset description, what charset is it?
  • What charsets should WordPress support?

Later on, this will:

  • Provide fallback decoders/encoders for supported types.
  • Infer charset from a byte stream.

#2 @dmsnell
6 weeks ago

  • Description modified (diff)
  • Owner set to dmsnell
  • Status changed from new to assigned

#3 @dmsnell
6 weeks ago

  • Description modified (diff)
Note: See TracTickets for help on using tickets.