Ticket #4457 (closed defect (bug): fixed)

Opened 5 years ago

Last modified 3 years ago

WP does not properly encode UTF-8 mail per RFC 2047

Reported by: trauschus Owned by: westi
Priority: high Milestone: 2.7
Component: General Version: 2.2
Severity: critical Keywords: rfc2047 mail has-patch phpmailer 2nd-opinion early
Cc:

Description

RFC2047, which is MIME Part 3, specifies that when sending non-ASCII information in headers such as the RFC(2)822 Subject header, it must be properly encoded. WordPress gets it *mostly* right, however, it violates one very important rule (quoted from RFC2047):

Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, "=?charset?Q?=?= =?charset?Q?AB?=" would be illegal, because the two hex digits "AB" must follow the "=" in the same 'encoded-word'.)

Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's.

However, I just received a mail from WordPress with the following subject header:

Subject:

=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2?= =?UTF-8?Q?=80=99s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=

The ’ (Unicode 0x2019) is split in mid-character, which is incorrect. The sequence of hex characters E2 80 99 cannot be split per the standard, and this causes RFC 2047-compliant mailers such as Evolution to display the subject as-transmitted (e.g., in quite an ugly manner).

I used some code in a C# application to avoid this situation:

c is a byte representing an octet of a UTF-8 Character. if(RetVal.Length > wrapLength) {

if(((c & 0xC0) == 0xC0)
((c & 0xC0) == 0x80)) {

Do Nothing -- We cannot split here.

} else {

RetVal += Ending; Lines.Add(RetVal); RetVal = "\n " + Preamble;

}

}

Basically, if the character ANDed with 0xC0 is equal to 0xC0 or 0x80, the string should not be split at that location. It should not be terribly hard to express that in PHP, as well. This is most likely not a potential security issue, though it could cause strange behavior in mail user agents (MUAs) which attempt to parse the quoted-words anyway. Evolution is following the standard by choosing not to parse the quote-words. From RFC 2047:

6.3. Mail reader handling of incorrectly formed 'encoded-word's

It is possible that an 'encoded-word' that is legal according to the syntax defined in section 2, is incorrectly formed according to the rules for the encoding being used. For example:

(1) An 'encoded-word' which contains characters which are not legal

for a particular encoding (for example, a "-" in the "B" encoding, or a SPACE or HTAB in either the "B" or "Q" encoding), is incorrectly formed.

(2) Any 'encoded-word' which encodes a non-integral number of

characters or octets is incorrectly formed.

A mail reader need not attempt to display the text associated with an 'encoded-word' that is incorrectly formed. However, a mail reader MUST NOT prevent the display or handling of a message because an 'encoded-word' is incorrectly formed.

I have chosen the pri/sev high/crit because this is a standards-compliance issue, and might in remote situations be a security issue if there are particularly borked MUAs out there that do strange things with the header.

Attachments

4457.diff Download (3.2 KB) - added by takayukister 4 years ago.

Change History

  • Milestone changed from 2.2.1 to 2.2.2

That functionality is being handled by the PHPMailer class:  http://phpmailer.sourceforge.net/

Check to see if the latest version of the class from sourceforge's SVN has the same problem.

I'm +1. Japanese users are also reporting the same problem.

I found that mb_encode_mimeheader() treats multi-octet header encoding better than PHPMailer. So I wrote a  plugin using the function. Since WP 2.2 released, many users have tried this plugin and for now they reports it solves the problem.

It's not a good approach if other than UTF-8 is used or mbstring library is not installed, though.

  • Milestone changed from 2.2.2 to 2.2.3
  • Keywords reporter-feedback added

Can someone please check practically against phpmailer SVN? I am not able to do so. That would be great to improve phpmailer here as well! takayukister have you tried that? have you reported feedback to phpmailer devs as well?

I think this problem is related to the fact that your wordpress-users submit their data as UTF-8 (that's what standard wordpress settings is). But then wordpress uses phpmailer. Within phpmailer documentation and source I did not find a word that it supports multi-byte character sets/encoding with the header encoding. Did you? So passing strings are encoded as ASCII strings.

The point of this issue is not about encoding itself but about how encoded strings split into encoded-word chunks. 'encoded-word' is defined in  RFC 2047 as:

Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between.

And as trauschus quoted from RFC 2047, "a multi-octet character may not be split across adjacent encoded-words". So the header of his received mail should by right be:

=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2=80=99?= =?UTF-8?Q?s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=

because '=E2=80=99' is a character (’) [0x2019], not splittable.

I think this issue is solved if someone fix  EncodeHeader() in class-phpmailer.php to treat this splitting properly. I'm trying but not yet succeeded.

For easy reproduction of this issue, I prepared example text for mail subject.

1234日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル

This is Japanese text. This will be encoded with  Base64 encoding.

[Trausch’s Little Home] Please moderate: "Well, it’s good I don’t use IE…"

And this is trauschus's original subject text (trauschus, sorry for my unauthorized use). This will be encoded with  Quoted-printable encoding.

On current WordPress (+phpMailer), you see garbage characters in both cases of above example.

To make testing easy, I wrote simple plugin for sending mails. Please use it.  http://ideasilo.wordpress.com/2007/08/29/wp-mail-tester/

  • Keywords needs-patch added; reporter-feedback removed
  • Owner changed from anonymous to westi
  • Status changed from new to assigned
  • Milestone changed from 2.2.3 to 2.3

Thank you for the feedback.

I will take a look at this issue.

I finally wrote a patch. This works for me but more test is strongly needed.

This fix covers UTF-8 only. Covering more encodings is better. I don't know them well, though.

  • Keywords has-patch phpmailer 2nd-opinion added; needs-patch removed

I think this issue should really be pushed upstream.

I don't think we should deviate too far from upstream.

I'm not sure this will be fixed in time for 2.3.

I don't think there is a phpmailer upstream anymore.

  • Milestone changed from 2.3 to 2.4 (next)
  • Keywords early added
  • Milestone changed from 2.5 to 2.6

Moving to 2.6 too late to change this for 2.5

The ChangeLog of phpMailer states:

Version 2.0.2 (June 04 2008) ...

  • addressed issue of multibyte characters in subject line and truncating

...

So it should be easy to fix now?

  • Status changed from assigned to closed
  • Resolution set to fixed
  • Milestone changed from 2.9 to 2.7

Trunk has 2.0.2 in already

Closing as Fixed in 2.7 please re-open if issues still exist.

Note: See TracTickets for help on using tickets.