id,summary,reporter,owner,description,type,status,priority,milestone,component,version,severity,resolution,keywords,cc,focuses 4457,WP does not properly encode UTF-8 mail per RFC 2047,trauschus,westi,"RFC2047, which is MIME Part 3, specifies that when sending non-ASCII information in headers such as the RFC(2)822 Subject header, it must be properly encoded. WordPress gets it *mostly* right, however, it violates one very important rule (quoted from RFC2047): Each 'encoded-word' MUST encode an integral number of octets. The 'encoded-text' in each 'encoded-word' must be well-formed according to the encoding specified; the 'encoded-text' may not be continued in the next 'encoded-word'. (For example, ""=?charset?Q?=?= =?charset?Q?AB?="" would be illegal, because the two hex digits ""AB"" must follow the ""="" in the same 'encoded-word'.) Each 'encoded-word' MUST represent an integral number of characters. A multi-octet character may not be split across adjacent 'encoded- word's. However, I just received a mail from WordPress with the following subject header: Subject: =?UTF-8?Q?[Trausch=E2=80=99s_Little_Home]_Please_moderate:_""Well,_it=E2?= =?UTF-8?Q?=80=99s_good_I_don=E2=80=99t_use_IE=E2=80=A6""?= The ’ (Unicode 0x2019) is split in mid-character, which is incorrect. The sequence of hex characters E2 80 99 cannot be split per the standard, and this causes RFC 2047-compliant mailers such as Evolution to display the subject as-transmitted (e.g., in quite an ugly manner). I used some code in a C# application to avoid this situation: // c is a byte representing an octet of a UTF-8 Character. if(RetVal.Length > wrapLength) { if(((c & 0xC0) == 0xC0) || ((c & 0xC0) == 0x80)) { // Do Nothing -- We cannot split here. } else { RetVal += Ending; Lines.Add(RetVal); RetVal = ""\n "" + Preamble; } } Basically, if the character ANDed with 0xC0 is equal to 0xC0 or 0x80, the string should not be split at that location. It should not be terribly hard to express that in PHP, as well. This is most likely not a potential security issue, though it could cause strange behavior in mail user agents (MUAs) which attempt to parse the quoted-words anyway. Evolution is following the standard by choosing not to parse the quote-words. From RFC 2047: 6.3. Mail reader handling of incorrectly formed 'encoded-word's It is possible that an 'encoded-word' that is legal according to the syntax defined in section 2, is incorrectly formed according to the rules for the encoding being used. For example: (1) An 'encoded-word' which contains characters which are not legal for a particular encoding (for example, a ""-"" in the ""B"" encoding, or a SPACE or HTAB in either the ""B"" or ""Q"" encoding), is incorrectly formed. (2) Any 'encoded-word' which encodes a non-integral number of characters or octets is incorrectly formed. A mail reader need not attempt to display the text associated with an 'encoded-word' that is incorrectly formed. However, a mail reader MUST NOT prevent the display or handling of a message because an 'encoded-word' is incorrectly formed. I have chosen the pri/sev high/crit because this is a standards-compliance issue, and might in remote situations be a security issue if there are particularly borked MUAs out there that do strange things with the header.",defect (bug),closed,high,2.7,General,2.2,critical,fixed,rfc2047 mail has-patch phpmailer 2nd-opinion early,,