Opened 4 months ago
Last modified 4 months ago
#64427 assigned enhancement
Implement WHATWG MIME Sniffing
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | Awaiting Review | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | General | Keywords: | has-patch |
| Focuses: | Cc: |
Description (last modified by )
WordPress interacts with MIME media types in a number of places in ad-hoc ways. These tend to perform unique parsing, which means they lack consistency and correspondence with how browsers parse MIME types.
For reference, the MIME media type is often derived from the HTTP Content-Type header, which contains a type, a subtype, and an optional list of parameters. A common example is text/html indicating HTML, image/png indicating a PNG image, and application/xhtml+xml indicating the XML serialization of HTML.
Discrepancies arise when the supplied type string doesn’t match exactly the anticipated forms. For example, with parameters or whitespaces:
text/html; charset=utf8text/plain ; charset="utf-8;iso-2022-jp
For reliable, consistent, and secure parsing, WordPress should implement the WHATWG MIME Sniffing specification, which the browsers will implement. This will ensure agreement between the server and clients on what content type strings refer to which media types.
The MIME Sniffing specification governs Content-Type values and also a limited number of media types from files based on byte sequences in the “resource header” — the first 1445 bytes of the binary data. These binary sniffs are mostly limited to media types relevant to browsers.
Ad-hoc parsers
wp_finalize_template_enhancement_output_bufferintemplate.phpwp_mailinpluggable.php- Global code in
wp-mail.php wp_staticize_emoji_for_emailinformatting.phpdownload_urlinwp-admin/includes/file.phpdiscover_pingback_server_uriincomment.phpdo_encloseinfunctions.phpget_file_extension_by_mime_typein
Related tickets
Change History (8)
This ticket was mentioned in PR #10640 on WordPress/wordpress-develop by @dmsnell.
4 months ago
#2
- Keywords has-patch added
@dmsnell commented on PR #10640:
4 months ago
#5
Once again WPCS thinks vertical alignment is a crime…
Thoughts after letting this stew for a bit:
- Perhaps we rename this to something more specific, like
WP_MIME_Sniffer?
@dmsnell commented on PR #10640:
4 months ago
#6
I’ve made some large updates to the interface and I’m overall much happier with the new primary methods. There are still some parts of the sniffing algorithm left to implement (webm and mp3 without id3) but those are trivial to add.
Still need to think about essence() as a term, because while it’s specific and matches the specification, it may not be widely recognized or understood. Still also need to think about levels of confidence and the relationship with the detected Apache bug.
@dmsnell commented on PR #10640:
4 months ago
#7
Finishing up today’s work I have made substantial-enough changes that I am marking this ready for review:
- _All_ of the
web-platform-testsMIME sniffing tests pass for parsing of _supplied MIME types_. These are things like HTTPContent-Typeheaders, email (RFC822) headers, and certain HTML attributes. - The
from_content_type()method is something I added when I thought I was misunderstanding the spec and overlooked a typo. It goes beyond what is required by the MIMESNIFF spec but could be useful, especially since we occasionally see multipleContent-Typeheaders and are likely to ingest malicious values. It has no tests currently but there might be tests covering the “getting, decoding, and splitting” part of the algorithm in the HTTP Structured Fields spec. - Minimization needs a better name, such as
get_privacy_sensitive_type_string(). It’s specifically designed in response toPerformanceResourceTimingaddingContent-Typeand meant to strip away things that might fingerprint users. essence()is something I don’t understand yet. I think it’s only relevant within the spec and may not be something outside users need or want. It’s effectively the _MIME type string_ returned fromserialize()when all parameters are stripped away. Unfortunately we cannot call itget_mime_type_string()because those _may_ contain parameters.- The parsing from binaries isn’t complete and I don’t know if it needs to have a place in 7.0.0. We _can_ of course finish it and verify it with tests, but the
web-platform-testssuite isn’t bulky for them: I think they have one test file per type. They _could_ provide a nice security uplift for server-side content/media type sniffing though, far better than file-extension-based methods.
The tests cover a lot of edge cases.
OK (987 tests, 1676 assertions)
A few related specifications integrate here and it makes me think about creating a WHATWG spec family inside wp-includes/whatwg, though it’s not limited to WHATWG in this case.
- The Encoding spec brings the concept of names and labels for character encoding schemes (i.e. “charsets”) and I think that could be very helpful for the HTML API and other places where Core attempts to make sense of user-supplied encodings. Curiously,
ascii,latin1, andiso-8859-1converge into the canonicalwindows-1252enshrining the reuse of the C1 controls even when that should cause a fatal. - The Fetch spec is full of HTTP semantics and I’d like to read more into it; I’m guessing there are valuable insights in there for WordPress.
- Fetch of course leans on RFC9651 for Structured Field Parsing, which should answer all of our questions on how to handle unexpected HTTP headers, duplicates, line-wrapping, comma-lists, etc… I find this spec a little less clear on the byte-level parsing of HTTP headers but it probably just requires more concentrated reading.
## What should be reviewed?
Please focus on _design review_: how will this be used? how _should_ it be used? Are there better names? Are there different constructions that will lead to more responsible use?
For example, I renamed from_string() to from_declaration() for two reasons:
- There are multiple conflicting contexts for how to sniff a MIME type. This involves the elephant in the PHProom that binary data is a
string!from_string()doesn’t help anyone understand _when_ they should call it or _with what kind of data_. - The “declaration” terminology hints more at the idea that some external source or metadata has made a _claim_ about a MIME type, yet that claim requires parsing.
Another example is the split in from_binary() where I wanted to emphasize the primacy of from_file() which only reads in up to 1,445 bytes rather than reading in a file which may potentially be multiple gigabytes in size. If I were looking for a method to sniff a type of a file I would start with ->file and see what auto-completes. Renaming from_binary() to from_binary_file_contents() made sense to me that it would convey the sense that it’s operating on file data _and_ hint that there could be a simpler function, from_file(), to reach for.
Anyway, this is a bit of rambling but there is a risk in implementing specs like this that things are simpler than they appear. The spec provides us instructions on how to look at something and give the same answer that a browser would to the question, “what is this?” But the specs are general and not built for our unique or special use cases. We may choose to diverge where something better-meets the needs of WordPress developers, as we collectively see fit, as long as it doesn’t violate spec-compliance.
Thank you for your time and consideration on this!
@westonruter commented on PR #10640:
4 months ago
#8
how will this be used? how _should_ it be used?
Could this PR (or a new sub-PR) implement this new MIME type parser to replace the ad-hoc parsers listed in the ticket? This would more easily demonstrate that it satisfies the current use cases.
Trac ticket: Core-64427
Introduces the
WP_Mime_Typeclass for parsing MIME types from sources such as HTTPContent-Typeheaders.