Make WordPress Core

Opened 5 months ago

Last modified 5 months ago

#63864 new enhancement

Support RFC 2047 MIME-decoding / improve `wp_iso_descrambler()`

Reported by: dmsnell's profile dmsnell Owned by:
Milestone: Future Release Priority: low
Severity: normal Version: 6.9
Component: Formatting Keywords: has-patch has-unit-tests
Focuses: Cc:

Description (last modified by dmsnell)

The existing wp_iso_descrambler() supports an extremely limited subset of MIME-encoded data. Specifically, it supports only the Q encoding and directly reads bytes from the encoded string instead of converting those bytes. It was added to fix an issue where subjects from inbound emails were “scrambled.”

While this surely improved the situation in 2004 when many systems were sending latin1 and where the system locale was latin1, it’s pretty insufficient today. WordPress could benefit from improving its support for RFC 2047 enabling proper reading of things like email subjects containing emoji.

Proposal

  • Introduce wp_decode_rfc2047() for focus and clarity around the intention of what is happening. This communicates more clearly to developers and provides more opportunity to test and improve support for the function.
  • Deprecate wp_iso_descrambler() and delegate its responsibility to wp_decode_rfc2047(). The unclear and inaccurate naming and description of this function leaves little room to substantially improve it.
  • Require calling-code to indicate how to handle parsing errors for explicit recovery.

RFC 2047 / MIME decoding is not too complicated in the “happy path.” It indicates the encoding of the escaped bytes and whether the escaping is via replacing certain bytes with their hex equivalent (the “quoted” or Q encoding) or replacing the whole byte sequence with a base64 representation (the “binary” or B encoding).

What remains is uncertainty in the path of invalid encodings. There may be standardized behaviors for handling parse errors and that would be ideal to incorporate into this enhancement.

What about iconv_mime_decode()?

PHP provides iconv_mime_decode() whose purpose is the same as for this ticket. In supported environments it may be useful, though it’s not clear how it resolves parsing errors or what changes exactly are made by its options.

If a PHP implementation is going to be required anyway to support runtimes lacking the iconv support then it makes sense to lean into a custom solution where WordPress can define the error-handling behaviors and change them as is appropriate, retaining full control over the behavior and specification.

A short background

For those unfamiliar, email systems were based on 7-bit ASCII interchange. This posed challenges when attempting to communicate between systems which relied on 8-bit or multi-byte encodings. MIME encoding was introduced as a way of incorporating other character sets within the existing supported domain of 7-bit US-ASCII. The syntax was chosen with an attempt to minimize the chance of conflating intended plaintext with encoded text.

  • Certain email headers may contain MIME-encoded strings.
  • Spans of encoded text MUST not exceed 75 characters, but a single header may contain multiple sections of encoded text.
  • When unable to decode the spans, it’s permitted to display the raw text of the encoding.

Examples

Before

<?php
var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(4) "��d�"

var_dump( wp_iso_descrambler( '=?UTF-8?Q?Caf=C3=A9?= and =?US-ASCII?B?SGVsbG8=?=' ) );
string(33) "Café?= and =?US-ASCII?B?SGVsbG8="

var_dump( wp_iso_descrambler( '=?UTF-8?B?4q2QIOKtkA==?=' ) );
string(24) "=?UTF-8?B?4q2QIOKtkA==?="

After

<?php
var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(6) "Łódź"

var_dump( rfc2047_decode( '=?UTF-8?Q?Caf=C3=A9?= and =?US-ASCII?B?SGVsbG8=?=' ) );
string(15) "Café and Hello"

var_dump( rfc2047_decode( '=?UTF-8?B?4q2QIOKtkA==?=' ) );
string(7) "⭐ ⭐"

Change History (2)

This ticket was mentioned in PR #9313 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-63864

## Status

Please feel free to ignore this for now.

## Description

The existing wp_iso_descrambler() was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:

this may only work with iso-8859-1, I'm afraid

But even so, it’s only likely to truly work with US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.

https://github.com/user-attachments/assets/0c55c640-0798-462d-bd37-c4015bd9322c

The above image illustrates how the bytes print as an invalid UTF-8 sequence in trunk after decoding. The 0x80 byte was chosen for this demonstration because in latin1 it’s a control character, in cp1252 and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.

Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is _not_ ISO-8859-1 (latin1) then the decoding is wrong for all character sets.

var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(4) "��d�"

---

This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.

var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) );
string(6) "Łódź"

With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.

https://github.com/user-attachments/assets/a21a0e12-1366-48bc-ac41-024402ab531f

### Supported encodings

This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.

If mb_convert_encoding() is available it will be preferred, followed by iconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.

### Error handling

Unfortunately, even where iconv_mime_decode() is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:

  • By default, invalid encoded words are preserved as unencoded plain text. This corresponds to the preserve-errors flag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.
  • Passing in replace-errors will remove the entire encoded word and replace it with the replacement character U+FFFD . This discards information from the input, but leaves a placemarker indicating that it was there before.
  • Passing in bail-on-error will cause the function to return early and return null, effectively the same as the strict mode in other decoders.

There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for _encoders_ to make the job of _decoding_ simpler, but otherwise does not speak to the well-formedness of the encoding).

  • Unsupported character sets.
  • Invalid encodings (B and Q are supported).
  • Invalid byte sequences in the quoted-printable encoding, such as =. or =6f (only upper-case hex digits are allowed).
  • Invalid base64-decoding in the binary encoding.
  • Invalid character re-encoding on the decoded byte stream.

Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.

#2 @dmsnell
5 months ago

  • Description modified (diff)
Note: See TracTickets for help on using tickets.