Make WordPress Core

Opened 18 years ago

Closed 16 years ago

#4457 closed defect (bug) (fixed)

WP does not properly encode UTF-8 mail per RFC 2047

Reported by: trauschus's profile trauschus Owned by: westi's profile westi
Milestone: 2.7 Priority: high
Severity: critical Version: 2.2
Component: General Keywords: rfc2047 mail has-patch phpmailer 2nd-opinion early
Focuses: Cc:

Description

RFC2047, which is MIME Part 3, specifies that when sending non-ASCII information in headers such as the RFC(2)822 Subject header, it must be properly encoded. WordPress gets it *mostly* right, however, it violates one very important rule (quoted from RFC2047):

Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

However, I just received a mail from WordPress with the following subject header:

Subject:

=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2?=
=?UTF-8?Q?=80=99s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=

The ’ (Unicode 0x2019) is split in mid-character, which is incorrect. The sequence of hex characters E2 80 99 cannot be split per the standard, and this causes RFC 2047-compliant mailers such as Evolution to display the subject as-transmitted (e.g., in quite an ugly manner).

I used some code in a C# application to avoid this situation:

c is a byte representing an octet of a UTF-8 Character.
if(RetVal.Length > wrapLength) {

if(((c & 0xC0) == 0xC0)
((c & 0xC0) == 0x80)) {

Do Nothing -- We cannot split here.

} else {

RetVal += Ending;
Lines.Add(RetVal);
RetVal = "\n " + Preamble;

}

}

Basically, if the character ANDed with 0xC0 is equal to 0xC0 or 0x80, the string should not be split at that location. It should not be terribly hard to express that in PHP, as well. This is most likely not a potential security issue, though it could cause strange behavior in mail user agents (MUAs) which attempt to parse the quoted-words anyway. Evolution is following the standard by choosing not to parse the quote-words. From RFC 2047:

6.3. Mail reader handling of incorrectly formed 'encoded-word's

It is possible that an 'encoded-word' that is legal according to the
syntax defined in section 2, is incorrectly formed according to the
rules for the encoding being used. For example:

(1) An 'encoded-word' which contains characters which are not legal

for a particular encoding (for example, a "-" in the "B"
encoding, or a SPACE or HTAB in either the "B" or "Q" encoding),
is incorrectly formed.

(2) Any 'encoded-word' which encodes a non-integral number of

characters or octets is incorrectly formed.

A mail reader need not attempt to display the text associated with an
'encoded-word' that is incorrectly formed. However, a mail reader
MUST NOT prevent the display or handling of a message because an
'encoded-word' is incorrectly formed.

I have chosen the pri/sev high/crit because this is a standards-compliance issue, and might in remote situations be a security issue if there are particularly borked MUAs out there that do strange things with the header.

Attachments (1)

4457.diff (3.2 KB) - added by takayukister 17 years ago.

Download all attachments as: .zip

Change History (17)

#1 @foolswisdom
18 years ago

  • Milestone changed from 2.2.1 to 2.2.2

#2 @Otto42
18 years ago

That functionality is being handled by the PHPMailer class: http://phpmailer.sourceforge.net/

Check to see if the latest version of the class from sourceforge's SVN has the same problem.

#3 @takayukister
18 years ago

I'm +1. Japanese users are also reporting the same problem.

I found that mb_encode_mimeheader() treats multi-octet header encoding better than PHPMailer. So I wrote a plugin using the function. Since WP 2.2 released, many users have tried this plugin and for now they reports it solves the problem.

It's not a good approach if other than UTF-8 is used or mbstring library is not installed, though.

#4 @foolswisdom
18 years ago

  • Milestone changed from 2.2.2 to 2.2.3

#5 @Nazgul
18 years ago

  • Keywords reporter-feedback added

#6 @hakre
17 years ago

Can someone please check practically against phpmailer SVN? I am not able to do so. That would be great to improve phpmailer here as well! takayukister have you tried that? have you reported feedback to phpmailer devs as well?

I think this problem is related to the fact that your wordpress-users submit their data as UTF-8 (that's what standard wordpress settings is). But then wordpress uses phpmailer. Within phpmailer documentation and source I did not find a word that it supports multi-byte character sets/encoding with the header encoding. Did you? So passing strings are encoded as ASCII strings.

#7 @takayukister
17 years ago

The point of this issue is not about encoding itself but about how encoded strings split into encoded-word chunks. 'encoded-word' is defined in RFC 2047 as:

Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between.

And as trauschus quoted from RFC 2047, "a multi-octet character may not be split across adjacent encoded-words". So the header of his received mail should by right be:

=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2=80=99?= =?UTF-8?Q?s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=

because '=E2=80=99' is a character (’) [0x2019], not splittable.

I think this issue is solved if someone fix EncodeHeader() in class-phpmailer.php to treat this splitting properly. I'm trying but not yet succeeded.

#8 @takayukister
17 years ago

For easy reproduction of this issue, I prepared example text for mail subject.

1234日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル

This is Japanese text. This will be encoded with Base64 encoding.

[Trausch’s Little Home] Please moderate: "Well, it’s good I don’t use IE…"

And this is trauschus's original subject text (trauschus, sorry for my unauthorized use). This will be encoded with Quoted-printable encoding.

On current WordPress (+phpMailer), you see garbage characters in both cases of above example.

To make testing easy, I wrote simple plugin for sending mails. Please use it.
http://ideasilo.wordpress.com/2007/08/29/wp-mail-tester/

#9 @westi
17 years ago

  • Keywords needs-patch added; reporter-feedback removed
  • Milestone changed from 2.2.3 to 2.3
  • Owner changed from anonymous to westi
  • Status changed from new to assigned

Thank you for the feedback.

I will take a look at this issue.

@takayukister
17 years ago

#10 @takayukister
17 years ago

I finally wrote a patch. This works for me but more test is strongly needed.

This fix covers UTF-8 only. Covering more encodings is better. I don't know them well, though.

#11 @westi
17 years ago

  • Keywords has-patch phpmailer 2nd-opinion added; needs-patch removed

I think this issue should really be pushed upstream.

I don't think we should deviate too far from upstream.

I'm not sure this will be fixed in time for 2.3.

#12 @ryan
17 years ago

I don't think there is a phpmailer upstream anymore.

#13 @ryan
17 years ago

  • Milestone changed from 2.3 to 2.4 (next)

#14 @westi
17 years ago

  • Keywords early added
  • Milestone changed from 2.5 to 2.6

Moving to 2.6 too late to change this for 2.5

#15 @thenlich
16 years ago

The ChangeLog of phpMailer states:

Version 2.0.2 (June 04 2008)
...

  • addressed issue of multibyte characters in subject line and truncating

...

So it should be easy to fix now?

#16 @westi
16 years ago

  • Milestone changed from 2.9 to 2.7
  • Resolution set to fixed
  • Status changed from assigned to closed

Trunk has 2.0.2 in already

Closing as Fixed in 2.7 please re-open if issues still exist.

Note: See TracTickets for help on using tickets.