Opened 18 years ago
Closed 16 years ago
#4457 closed defect (bug) (fixed)
WP does not properly encode UTF-8 mail per RFC 2047
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 2.7 | Priority: | high |
Severity: | critical | Version: | 2.2 |
Component: | General | Keywords: | rfc2047 mail has-patch phpmailer 2nd-opinion early |
Focuses: | Cc: |
Description
RFC2047, which is MIME Part 3, specifies that when sending non-ASCII information in headers such as the RFC(2)822 Subject header, it must be properly encoded. WordPress gets it *mostly* right, however, it violates one very important rule (quoted from RFC2047):
Each 'encoded-word' MUST encode an integral number of octets. The
'encoded-text' in each 'encoded-word' must be well-formed according
to the encoding specified; the 'encoded-text' may not be continued in
the next 'encoded-word'. (For example, "=?charset?Q?=?=
=?charset?Q?AB?=" would be illegal, because the two hex digits "AB"
must follow the "=" in the same 'encoded-word'.)
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
However, I just received a mail from WordPress with the following subject header:
Subject:
=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2?=
=?UTF-8?Q?=80=99s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=
The ’ (Unicode 0x2019) is split in mid-character, which is incorrect. The sequence of hex characters E2 80 99 cannot be split per the standard, and this causes RFC 2047-compliant mailers such as Evolution to display the subject as-transmitted (e.g., in quite an ugly manner).
I used some code in a C# application to avoid this situation:
c is a byte representing an octet of a UTF-8 Character.
if(RetVal.Length > wrapLength) {
if(((c & 0xC0) == 0xC0) ((c & 0xC0) == 0x80)) { Do Nothing -- We cannot split here.
} else {
RetVal += Ending;
Lines.Add(RetVal);
RetVal = "\n " + Preamble;
}
}
Basically, if the character ANDed with 0xC0 is equal to 0xC0 or 0x80, the string should not be split at that location. It should not be terribly hard to express that in PHP, as well. This is most likely not a potential security issue, though it could cause strange behavior in mail user agents (MUAs) which attempt to parse the quoted-words anyway. Evolution is following the standard by choosing not to parse the quote-words. From RFC 2047:
6.3. Mail reader handling of incorrectly formed 'encoded-word's
It is possible that an 'encoded-word' that is legal according to the
syntax defined in section 2, is incorrectly formed according to the
rules for the encoding being used. For example:
(1) An 'encoded-word' which contains characters which are not legal
for a particular encoding (for example, a "-" in the "B"
encoding, or a SPACE or HTAB in either the "B" or "Q" encoding),
is incorrectly formed.
(2) Any 'encoded-word' which encodes a non-integral number of
characters or octets is incorrectly formed.
A mail reader need not attempt to display the text associated with an
'encoded-word' that is incorrectly formed. However, a mail reader
MUST NOT prevent the display or handling of a message because an
'encoded-word' is incorrectly formed.
I have chosen the pri/sev high/crit because this is a standards-compliance issue, and might in remote situations be a security issue if there are particularly borked MUAs out there that do strange things with the header.
Attachments (1)
Change History (17)
#3
@
18 years ago
I'm +1. Japanese users are also reporting the same problem.
I found that mb_encode_mimeheader() treats multi-octet header encoding better than PHPMailer. So I wrote a plugin using the function. Since WP 2.2 released, many users have tried this plugin and for now they reports it solves the problem.
It's not a good approach if other than UTF-8 is used or mbstring library is not installed, though.
#6
@
17 years ago
Can someone please check practically against phpmailer SVN? I am not able to do so. That would be great to improve phpmailer here as well! takayukister have you tried that? have you reported feedback to phpmailer devs as well?
- http://phpmailer.sourceforge.net/
- http://phpmailer.svn.sourceforge.net/viewvc/phpmailer/trunk/phpmailer/class.phpmailer.php?view=markup
I think this problem is related to the fact that your wordpress-users submit their data as UTF-8 (that's what standard wordpress settings is). But then wordpress uses phpmailer. Within phpmailer documentation and source I did not find a word that it supports multi-byte character sets/encoding with the header encoding. Did you? So passing strings are encoded as ASCII strings.
#7
@
17 years ago
The point of this issue is not about encoding itself but about how encoded strings split into encoded-word chunks. 'encoded-word' is defined in RFC 2047 as:
Generally, an "encoded-word" is a sequence of printable ASCII characters that begins with "=?", ends with "?=", and has two "?"s in between.
And as trauschus quoted from RFC 2047, "a multi-octet character may not be split across adjacent encoded-words". So the header of his received mail should by right be:
=?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_"Well,_it=E2=80=99?= =?UTF-8?Q?s_good_I_don=E2=80=99t_use_IE=E2=80=A6"?=
because '=E2=80=99' is a character (’) [0x2019], not splittable.
I think this issue is solved if someone fix EncodeHeader() in class-phpmailer.php to treat this splitting properly. I'm trying but not yet succeeded.
#8
@
17 years ago
For easy reproduction of this issue, I prepared example text for mail subject.
1234日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル日本語テキストのサンプル
This is Japanese text. This will be encoded with Base64 encoding.
[Trausch’s Little Home] Please moderate: "Well, it’s good I don’t use IE…"
And this is trauschus's original subject text (trauschus, sorry for my unauthorized use). This will be encoded with Quoted-printable encoding.
On current WordPress (+phpMailer), you see garbage characters in both cases of above example.
To make testing easy, I wrote simple plugin for sending mails. Please use it.
http://ideasilo.wordpress.com/2007/08/29/wp-mail-tester/
#9
@
17 years ago
- Keywords needs-patch added; reporter-feedback removed
- Milestone changed from 2.2.3 to 2.3
- Owner changed from anonymous to westi
- Status changed from new to assigned
Thank you for the feedback.
I will take a look at this issue.
#10
@
17 years ago
I finally wrote a patch. This works for me but more test is strongly needed.
This fix covers UTF-8 only. Covering more encodings is better. I don't know them well, though.
#11
@
17 years ago
- Keywords has-patch phpmailer 2nd-opinion added; needs-patch removed
I think this issue should really be pushed upstream.
I don't think we should deviate too far from upstream.
I'm not sure this will be fixed in time for 2.3.
#14
@
17 years ago
- Keywords early added
- Milestone changed from 2.5 to 2.6
Moving to 2.6 too late to change this for 2.5
That functionality is being handled by the PHPMailer class: http://phpmailer.sourceforge.net/
Check to see if the latest version of the class from sourceforge's SVN has the same problem.