Opened 5 months ago
Last modified 5 months ago
#63864 new enhancement
Support RFC 2047 MIME-decoding / improve `wp_iso_descrambler()`
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Future Release | Priority: | low |
| Severity: | normal | Version: | 6.9 |
| Component: | Formatting | Keywords: | has-patch has-unit-tests |
| Focuses: | Cc: |
Description (last modified by )
The existing wp_iso_descrambler() supports an extremely limited subset of MIME-encoded data. Specifically, it supports only the Q encoding and directly reads bytes from the encoded string instead of converting those bytes. It was added to fix an issue where subjects from inbound emails were “scrambled.”
While this surely improved the situation in 2004 when many systems were sending latin1 and where the system locale was latin1, it’s pretty insufficient today. WordPress could benefit from improving its support for RFC 2047 enabling proper reading of things like email subjects containing emoji.
Proposal
- Introduce
wp_decode_rfc2047()for focus and clarity around the intention of what is happening. This communicates more clearly to developers and provides more opportunity to test and improve support for the function. - Deprecate
wp_iso_descrambler()and delegate its responsibility towp_decode_rfc2047(). The unclear and inaccurate naming and description of this function leaves little room to substantially improve it. - Require calling-code to indicate how to handle parsing errors for explicit recovery.
RFC 2047 / MIME decoding is not too complicated in the “happy path.” It indicates the encoding of the escaped bytes and whether the escaping is via replacing certain bytes with their hex equivalent (the “quoted” or Q encoding) or replacing the whole byte sequence with a base64 representation (the “binary” or B encoding).
What remains is uncertainty in the path of invalid encodings. There may be standardized behaviors for handling parse errors and that would be ideal to incorporate into this enhancement.
What about iconv_mime_decode()?
PHP provides iconv_mime_decode() whose purpose is the same as for this ticket. In supported environments it may be useful, though it’s not clear how it resolves parsing errors or what changes exactly are made by its options.
If a PHP implementation is going to be required anyway to support runtimes lacking the iconv support then it makes sense to lean into a custom solution where WordPress can define the error-handling behaviors and change them as is appropriate, retaining full control over the behavior and specification.
A short background
For those unfamiliar, email systems were based on 7-bit ASCII interchange. This posed challenges when attempting to communicate between systems which relied on 8-bit or multi-byte encodings. MIME encoding was introduced as a way of incorporating other character sets within the existing supported domain of 7-bit US-ASCII. The syntax was chosen with an attempt to minimize the chance of conflating intended plaintext with encoded text.
- Certain email headers may contain MIME-encoded strings.
- Spans of encoded text MUST not exceed 75 characters, but a single header may contain multiple sections of encoded text.
- When unable to decode the spans, it’s permitted to display the raw text of the encoding.
Examples
Before
<?php var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) ); string(4) "��d�" var_dump( wp_iso_descrambler( '=?UTF-8?Q?Caf=C3=A9?= and =?US-ASCII?B?SGVsbG8=?=' ) ); string(33) "Café?= and =?US-ASCII?B?SGVsbG8=" var_dump( wp_iso_descrambler( '=?UTF-8?B?4q2QIOKtkA==?=' ) ); string(24) "=?UTF-8?B?4q2QIOKtkA==?="
After
<?php var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) ); string(6) "Łódź" var_dump( rfc2047_decode( '=?UTF-8?Q?Caf=C3=A9?= and =?US-ASCII?B?SGVsbG8=?=' ) ); string(15) "Café and Hello" var_dump( rfc2047_decode( '=?UTF-8?B?4q2QIOKtkA==?=' ) ); string(7) "⭐ ⭐"
Trac ticket: Core-63864
## Status
Please feel free to ignore this for now.
## Description
The existing
wp_iso_descrambler()was added in 2004 because certain email subjects were appearing with funny-looking string spans. The following note was left as a comment:But even so, it’s only likely to truly work with
US-ASCII, which is rare to find in such a MIME-encoded string. In 2004 it might have been more common for PHP systems to operate on ISO-8859-1 (latin1) as their default, but today UTF-8 is the predominant encoding and because the function return the bytes as they are directly encoded, it fails to perform its main function which is to translate non-ASCII encodings.The above image illustrates how the bytes print as an invalid UTF-8 sequence in
trunkafter decoding. The 0x80 byte was chosen for this demonstration because inlatin1it’s a control character, incp1252and in HTML it’s remapped to the Euro sign, and in UTF-8 it’s an invalid sequence.Without additional conversion calling code has to know the additional details of what the encoding is of the running PHP system and what other code will perform re-encoding. It’s likely to mess up. Worse, if the encoding is _not_
ISO-8859-1(latin1) then the decoding is wrong for all character sets.var_dump( wp_iso_descrambler( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) ); string(4) "��d�"---
This patch implements a compliant RFC2047 MIME text decoder, and decodes the text into UTF-8. Decoding into a single encoding normalizes the output and gives calling code the freedom to change the encoding if it wants without needing to make any assumptions or inquire about what it gets.
var_dump( rfc2047_decode( '=?ISO-8859-2?Q?=A3=F3d=BC?=' ) ); string(6) "Łódź"With the same input as above we can see that the default output is now converted from the indicated input encoding. In this example, that decodes to a control character in UTF-8 but that is authentic to the given input. The re-encodings are now invalid because the returned data is already in UTF-8.
### Supported encodings
This implementation attempts to support as many encodings as are practical based on the availability of decoding logic on the running server.
If
mb_convert_encoding()is available it will be preferred, followed byiconv(), followed by direct conversion from US-ASCII or UTF-8 byte streams. Nuances and peculiarities of the PHP text-encoding functions are left as artifacts of PHP and not addressed in this function.### Error handling
Unfortunately, even where
iconv_mime_decode()is available, its error-handling options are limited and unclear. By implementing the encoder in user-space the error cases can be explicitly handled, and this implementation provides configurable error handling:preserve-errorsflag. The input text will appear in the output and look jumbled, but perhaps a human can make sense of the data in it. This is how most decoders handle errors.replace-errorswill remove the entire encoded word and replace it with the replacement character U+FFFD�. This discards information from the input, but leaves a placemarker indicating that it was there before.bail-on-errorwill cause the function to return early and returnnull, effectively the same as thestrictmode in other decoders.There are multiple classes of potential errors and error behavior is not defined in the RFC. This implementation treats all classes in the same way, except for the rule that encoded words must be 75 characters or shorter (as this rule was clearly intended for _encoders_ to make the job of _decoding_ simpler, but otherwise does not speak to the well-formedness of the encoding).
BandQare supported).=.or=6f(only upper-case hex digits are allowed).Of note, the RFC implies no possible syntax errors. Instead, anything which appears as a syntax error indicates that the span of text which looks like an encoded word is actually just plain text and the parser will skip over it to look for the next well-formed encoded word.