Opened 5 months ago
Last modified 5 months ago
#63974 new enhancement
.mo file loaded as UTF-8 by default - non-standard and ignoring Content-Type headers
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | I18N | Keywords: | |
| Focuses: | Cc: |
Description
wp-includes/l10n/class-wp-translation-file-mo.php
protected function parse_file(): bool {
Create a.po with content
`
msgid ""
msgstr ""
msgid "Foo bar"
msgstr "Foo bär"
`
Run msgfmt --check-format --no-hash a.po -o a.mo
Now read a.mo with parse_file() - you'll get "Foo bär" - which is wrong, since there is no "Content-Type: text/plain; charset=UTF-8\n" header in the file, therefore the .mo should not be read as if it were UTF-8
This can be seen with:
msgunfmt --no-wrap a.mo -o b.po
Not only will you get an error for "invalid multibyte sequence" here, that invalid multibyte sequence will also be removed.
If user data ends up in a .mo (which it does, since some multilingual plugins will save translatable strings - including customer reviews - in .mo files), this would potentially allow for an attack vector?
Otherwise, this behavior at least leads to non-interoperable .mo files being silently accepted (e.g. plugins providing .mo files that cannot be converted back to .po files with basic unix tools, since any multibyte character will fail msgunfmt)
Change History (11)
#2
follow-up:
↓ 5
@
5 months ago
Mind sharing a link to the spec where the content content header is discussed?
e.g. https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html
The character encoding of the strings can be any standard ASCII-compatible encoding, such as UTF-8, ISO-8859-1, EUC-JP, etc., as long as the encoding’s name is stated in the header entry (see Filling in the Header Entry)
This is not specific to .mo files though, but in general text/plain content files by default always have ANSI encoding unless otherwise specified
Are you talking about finishing user generated files? Because many are read from the file system which contain no headers.
What do you mean?
That class doesn’t read in UTF-8 so much as it reads in bytes without performing any kind of character decoding. What are you suggesting the input bytes should be if not UTF-8? How should one represent an inherently multibyte character such as one of the thousands of CJK characters?
If no Content-Encoding header is specified, it should be treated as ANSI. Since ANSI does not support multibyte characters, this means those should be removed. This is how msgunfmt handles it.
Since there can be all kinds of encodings, I'd suggest to essentially limit support to UTF-8 (which is what 99.99% is anyway) and treat everything else as ANSI (possibly with a doing_it_wrong if there is a Content-Type header specified, which WP doesn't support)
Also, I’m not sure if you were meaning to write non-UTF-8 in your examples
No, I meant those literally as UTF-8. Just copy/paste them for repro.
#3
@
5 months ago
- Keywords reporter-feedback added
@dmsnell It's not about HTTP headers, it's about the Content-Type header line in .mo translation files.
@kkmuffme This sounds all a bit theoretical to me. Can you help me clarify a few things:
- Is the behavior the same for
WP_Translation_File_MOand the "old"MOclass? I assume yes, because I don't see any UTF-8 references in that code.
- "some multilingual plugins will save translatable strings" -> are any of them saving files not as UTF-8?
- "leads to non-interoperable .mo files being silently accepted" -> what's the downside of being a bit less strict?
If your suggestion is to reject any MO files with Content-Type other than UTF-8, then I'd classify this ticket as an enhancement.
#4
@
5 months ago
- Keywords reporter-feedback removed
Thanks for the link. I thought you were talking about HTTP headers. I don’t know the .mo format that well, other than just last week I happened to rewrite the parser locally to avoid CPU cache thrashing it currently does when reading the strings.
in general text/plain content files by default always have ANSI encoding unless otherwise specified
Since .mo files contain headers I guess we can talk about this, but alluded to in my comment earlier, files are just files and carry no encoding metadata or headers or content type. Regardless, most of what I’ve seen across various software ecosystems is an assumption of UTF-8, not of US-ASCII — at least, not in over a decade has anything other than UTF-8 been the assumption.
Also one glaring break from this recommendation to use UTF-8 as the sole encoding of interchange is the way Excel handles CSV files, in which case it assumes that the CSV file was encoded with whatever was the default system encoding on the platform Excel is running on during the 90s or early 00s. That’s another story though.
What do you mean?
This was a typo I have since corrected; I was asking if your report was in context of downloading files where HTTP headers would be sent alongside the file data.
If no Content-Encoding header is specified, it should be treated as ANSI. Since ANSI does not support multibyte characters, this means those should be removed. This is how msgunfmt handles it.
I’d love to hear some varied opinion on this. At a mimimum we could check for an encoding indication, and we could check if the strings are valid as UTF-8. It looks like gettext started defaulting to UTF-8 production in mid-2023 which feels recent. Here is why I propose this instead:
- Most tools which are unaware of character encodings (which seems to be most that I encounter) actually produce or assume UTF-8.
- If validates at UTF-8 it’s highly improbably that it’s any other encoding.
- UTF-8 is US-ASCII compatible so treating something as UTF-8 that was written with bytes 0x00–0x7F is decoding into literally the same text.
Also as @swissspidy points out there’s another complication, which is that WordPress is largely encoding-agnostic. There’s a very good chance that if we get a .mo which bytes that are not the valid according to the listed heading, they may still be relevant. I am not a huge fan of this, but it’s part of the legacy.
I think at a minimum though we could attempt to answer the questions:
- Do we read an explicit encoding?
- Do the bytes validate in that encoding?
- Are we able to convert that encoding into UTF-8?
That might be a nice enhancement.
#5
in reply to:
↑ 2
;
follow-up:
↓ 7
@
5 months ago
Replying to kkmuffme:
If no Content-Encoding header is specified, it should be treated as ANSI. Since ANSI does not support multibyte characters, this means those should be removed.
This seems like a bad idea to me, since silently deleting (non-)characters often leads to security vulnerabilities:
https://www.unicode.org/reports/tr36/tr36-15.html#Deletion_of_Noncharacters
#6
follow-up:
↓ 10
@
5 months ago
are any of them saving files not as UTF-8?
What do you mean? A file does not have an inherent encoding, which is why the "Content-Type" header should be used.
what's the downside of being a bit less strict?
One (and the biggest pain point for me) is that .mo files aren't valid and can't be converted back to .po files with basic msgunfmt tool (since all UTF-8 will get stripped).
If your suggestion is to reject any MO files with Content-Type other than UTF-8, then I'd classify this ticket as an enhancement.
The suggestion is:
- if no Content-Type header with charset=UTF-8 is provided, treat the file as ANSI (= remove all multibyte)
- if Content-Type header with charset=UTF-8 is provided, no change
- any other charset= provided, add a doing_it_wrong that WP does not currently support this charset, and will instead treat this file as UTF-8
This change won't affect most sites at all and for those plugins that are affected (= they provide .mo files with the Content-Type header missing) it's a minimal, simple fix in their release pipeline which ensures that their .mo files adhere to the standard more exactly and prevent encoding issues.
#7
in reply to:
↑ 5
;
follow-up:
↓ 9
@
5 months ago
Replying to siliconforks:
Replying to kkmuffme:
If no Content-Encoding header is specified, it should be treated as ANSI. Since ANSI does not support multibyte characters, this means those should be removed.
This seems like a bad idea to me, since silently deleting (non-)characters often leads to security vulnerabilities:
https://www.unicode.org/reports/tr36/tr36-15.html#Deletion_of_Noncharacters
Good point! What do you suggest?
The current behavior allows for e.g. https://www.unicode.org/reports/tr36/tr36-15.html#Security_Levels_and_Alerts
For .mo files - for obvious performance reasons - all sanitizing,... happens upon creation, so that the read/access process is as fast as possible (hash tables,...). This is why, when reading, one should adhere stricter to the standard.
Which is also e.g. https://www.gnu.org/software/gettext/manual/html_node/msgfmt-Invocation.html
By default, messages are converted to UTF-8 encoding before being stored in a MO file; this helps avoiding conversions at run time, since nowadays most locales use the UTF-8 encoding.
#8
@
5 months ago
If we want to avoid all security issues we can use the upcoming mb_scrub_utf8() being prepared in #63863
Good point! What do you suggest?
If it validates as UTF-8 it’s probably UTF-8, even if it reports another encoding (at least this has been the case for the top 300,000 domains I scanned on the Internet).
If it contains no header, we can call mb_scrub_utf8(). If it contains a header and we can understand the encoding and validate it, then we can convert.
Otherwise a _doing_it_wrong() sounds great, and we can mb_scrub_utf8() the data to ensure it doesn’t introduce any invalid or malicious content.
It may be helpful to avoid the term “ANSI” in context of text encoding. Most common encodings are US-ASCII compatible (bytes 0x00–0x7F all mean the same thing) but all of the upper range (bytes 0x80–0xFF) are mutually exclusive between the family of encoding commonly-referred to as “ANSI”
#9
in reply to:
↑ 7
@
5 months ago
Replying to kkmuffme:
Good point! What do you suggest?
I don't know - if I were writing a brand new function today to parse .mo files, I would probably just return false if there is no Content-Type header. That would effectively force everyone to use a Content-Type header. But changing an existing function would break backward compatibility and cause problems for anyone currently using .mo files without Content-Type.
It might be reasonable to add a _doing_it_wrong() warning.
#10
in reply to:
↑ 6
@
5 months ago
Replying to kkmuffme:
One (and the biggest pain point for me) is that .mo files aren't valid and can't be converted back to .po files with basic msgunfmt tool (since all UTF-8 will get stripped).
Note that you can use msgunfmt --escape to output the bytes as C escape characters.
#11
@
5 months ago
- Type changed from defect (bug) to enhancement
are any of them saving files not as UTF-8?
What do you mean? A file does not have an inherent encoding, which is why the "Content-Type" header should be used.
This change won't affect most sites at all and for those plugins that are affected (= they provide .mo files with the Content-Type header missing) it's a minimal, simple fix in their release pipeline which ensures that their .mo files adhere to the standard more exactly and prevent encoding issues.
You keep mentioning plugins that are doing it wrong by not including this header. Can you name some of these affected plugins? It would be great to have some examples to gauge impact and use for testing and outreach.
Mind sharing a link to the spec where the content content header is discussed?
Are you talking about downloading user generated files? Because many are read from the file system which contain no headers.
That class doesn’t read in UTF-8 so much as it reads in bytes without performing any kind of character decoding. What are you suggesting the input bytes should be if not UTF-8? How should one represent an inherently multibyte character such as one of the thousands of CJK characters?
Also, I’m not sure if you were meaning to write non-UTF-8 in your examples, but it could help if you specify the byte values you are writing instead of the text, which here will be UTF-8 by nature of the browser you used to write it.