WordPress.org

Make WordPress Core

Opened 4 weeks ago

Last modified 28 hours ago

#39963 new enhancement

MIME Alias Handling

Reported by: blobfolio Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Media Keywords:
Focuses: Cc:

Description

WordPress currently only supports a single MIME type for a given extension. This is good enough in many cases, but begins to cause problems like #39550 when a library references an alternative MIME type (like audio/mpeg vs audio/mp3).

We probably don't want to introduce changes to the basic MIME functions as that could have crazy-far-reaching consequences, but we could add a wp_check_mime_alias($filename, $mime, $allowed_mimes) function, and query it in places where validation-by-MIME is happening.

This would require a comprehensive collection of MIMEs, past, present, vernacular, etc. I am actually already maintaining such a database for another PHP library that combines results from Nginx, Apache, freedesktop.org, and IANA. So let's start from the assumption that the necessary data already exists for retrieving all possible MIME types for a given extension.

I can whip up a patch with the added functionality, but wanted to first get some feedback on where/how the data should be incorporated (it is a rather big list, a PHP array would go on for miles.. I'm keeping it in a JSON file currently), and what existing areas might need to plug into that.

@joemcgill, do you have any initial thoughts?

Attachments (2)

mimes.diff (145.2 KB) - added by blobfolio 6 days ago.
proof of concept
mimes.2.diff (1.0 MB) - added by blobfolio 28 hours ago.
handle application/CDFV2* types, implicit match of vnd.* variations, unit tests for variable XLS files

Download all attachments as: .zip

Change History (18)

#1 @blobfolio
3 weeks ago

To get the ball rolling, here's an example of what I was thinking: https://github.com/Blobfolio/blob-mimes/blob/master/build/WordPress/media-mimes.php

This script includes all the necessary data, and two functions.

First, a function to return all known MIME types for a given file extension:

<?php
/**
 * Return MIME aliases for a particular file extension.
 *
 * @since xxx
 *
 * @see {https://www.iana.org/assignments/media-types}
 * @see {https://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/mime.types}
 * @see {http://hg.nginx.org/nginx/raw-file/default/conf/mime.types}
 * @see {https://cgit.freedesktop.org/xdg/shared-mime-info/plain/freedesktop.org.xml.in}
 * @see {https://github.com/Blobfolio/blob-mimes}
 *
 * @param string $ext File extension.
 * @return array|bool MIME types. False on failure.
 */
function wp_get_mime_aliases( $ext = '' ) { ... }

And second, a function to check whether an arbitrary file extension/MIME pairing is valid. This also automatically checks for both type/subtype and type/x-subtype variants, the x- prefix being something commonly used during a file type's transformative years, but possibly missing from any official database.

<?php
/**
 * Check extension and MIME pairing.
 *
 * @since xxx
 *
 * @param string $ext File extension.
 * @param string $mime MIME type.
 * @return bool True/false.
 */
function wp_check_mime_alias( $ext = '', $mime = '' ) { ... }

This second function could be added to places like wp_check_filetype_and_ext where MIME data is pulled from somewhere variable (e.g. finfo). Rather than comparing $real_mime to the results of wp_check_filetype(), we can evaluate with wp_check_mime_alias( $type, $real_mime ).

This ticket was mentioned in Slack in #core-media by blobfolio. View the logs.


3 weeks ago

This ticket was mentioned in Slack in #core-media by blobfolio. View the logs.


3 weeks ago

#4 @blobfolio
2 weeks ago

The MIME list has been extended to include more results:

-- MacOS file types, namely those Apple didn't register with IANA)
-- parent classes (like application/ogg, which some systems return instead of specific types like video/ogg)

There's now a third function:

<?php
/**
 * Retrieve the "real" file type from the file.
 *
 * This extends `wp_check_filetype()` to additionally
 * consider content-based indicators of a file's
 * true type.
 *
 * The content-based type will override the name-based
 * type if available and included in the $mimes list.
 *
 * @since xxx
 *
 * @see wp_check_filetype()
 * @see wp_check_filetype_and_ext()
 *
 * @param string $file Full path to the file.
 * @param string $filename The name of the file (may differ from $file due to $file being in a tmp directory).
 * @param array $mimes Optional. Key is the file extension with value as the mime type.
 * @return array Values with extension first and mime type.
 */
function wp_check_real_filetype( $file, $filename = null, $mimes = null ) { ... }

The idea is to abstract the type-checking from other sanitizing/handling being performed in wp_check_filetype_and_ext() so that it can be called from other functions that might need a more robust type determination. wp_check_filetype_and_ext() could then be reduced so that it calls this function, then handles its specific validation checks.

#5 @blobfolio
13 days ago

Even with the patch that landed in 4.7.3 (#39550) a lot of users are still running into validation issues. When fileinfo returns an incorrect/old type, chances are it will be application/*, which wp_check_filetype_and_ext() subjects to the extra testing.

Thus landing MIME alias handling as a mitigation sooner rather than later would be good.

The database in wp_get_mime_aliases() is refreshed at least once monthly, and by its nature will need to be re-pushed to the WP Core indefinitely (gotta keep pace!). So any edge cases that might still be missing now should not be a blocking concern; with 1751 MIMEs at the time of this writing, it already addresses most problems people are likely to hit.

What I see as blocking:

(1) We need eyes on the functions themselves. Does anyone see any issues in format, structure, technique?

(2) The quickie patch at #40017 needs to land so exif fallbacks (for environments without fileinfo) can be added to wp_check_real_filetype().

(3) Layout feedback. Where should these functions ultimately go? I suggest we place the main payload function (wp_get_mime_aliases()) in its own file because of its size, while moving the others to wp-includes/functions.php. Those two functions can require_once the standalone MIME file so we aren't reading such a large file into memory unnecessarily.

With #1-3 out of the way, we can:

(4) Get a patch together, including a reduction of wp_check_filetype_and_ext() to remove duplicate behaviors, utilize alias checking.

(5) Add unit tests for the core.

(6) Profit!

This ticket was mentioned in Slack in #core-media by blobfolio. View the logs.


13 days ago

#7 follow-up: @eddiemcham
9 days ago

Would this also cover web fonts (x-font-woff => font-woff)? That would really help my team.

~ thx, Eddie

#8 in reply to: ↑ 7 @blobfolio
9 days ago

Replying to eddiemcham:

Would this also cover web fonts (x-font-woff => font-woff)? That would really help my team.

Yup, although application/font-woff isn't technically correct either (see e.g. https://www.iana.org/assignments/media-types/font/woff).

In a nutshell, the point of this enhancement is to deprovincialize WordPress' singular concept of MIMEs (in both time and space). Aside from increasing the explicit MIME relational data by over 17x what WordPress has currently (making it the largest single collection on the planet, haha), it also has automatic detection for whatever/x-variants (e.g. your x-font-woff vs font-woff, non-MIME parent class nonsense (like application/ms-office, which is not, nor has never been, a valid media type), and won't, by default, penalize generic application/octet-stream associations (which, while it can be a valid type, is more often just fileinfo's equivalent of a shrug).

WordPress isn't likely going to rollback or narrow the security fix that is causing all the collateral damage, but at least with MIME alias support, we can seriously mitigate almost all of the incorrect identifications people are seeing on various server environments (while, crucially, not defeating the purpose of the security fix in the first place!).

If you need a fix sooner rather than later, shoot me a message on Slack or email me (contact info can be found at the URL linked in my profile; it gets stripped if I try to post it directly here) and I can package this up as a quickie plugin for you. One of my clients is a font foundry and has had no problem uploading WOFF, WOFF2, TTF, OTF, SVG, etc., files to their site with this workaround.

I know you said you didn't want to add more overhead to your sites, which is understandable, but a surgical workaround is going to be a lot safer than allowing unfiltered uploads or disabling the security checks altogether. ;)

#9 @eddiemcham
8 days ago

Thanks, @blobfolio, for the offer. Let me talk this over with my colleagues and team leads before I dig any deeper. They (and/or our IT dept) may have additional ideas. If they give me the go-ahead, I'll reach out to you via Slack or some other means. :)

Last edited 8 days ago by eddiemcham (previous) (diff)

#10 follow-up: @eddiemcham
8 days ago

@blobfolio,

I'm testing a workaround where colleagues upload their WOFFs to an FTP folder I've created for them. Then I'll modify our CSS rendering code to look there for the WOFFs instead of the Media Library. No more MIME types, (super)admin lockouts, or security error messages.

Thanks for all your help, though! :)

#11 in reply to: ↑ 10 @blobfolio
8 days ago

Replying to eddiemcham:

@blobfolio,

I'm testing a workaround where colleagues upload their WOFFs to an FTP folder I've created for them. Then I'll modify our CSS rendering code to look there for the WOFFs instead of the Media Library. No more MIME types, (super)admin lockouts, or security error messages.

Glad to hear you found your own workaround! The only caution I'd throw out there is that unbridled FTP access could pose an even greater danger than anything conducted through the WordPress UX. If possible, I recommend using SSH keys instead of passwords for your users and jailing access to that one directory. :)

#12 @eddiemcham
8 days ago

Thanks for the heads-up. I'll note the SSH part.

This ticket was mentioned in Slack in #core-media by blobfolio. View the logs.


7 days ago

@blobfolio
6 days ago

proof of concept

#14 @blobfolio
6 days ago

I put together a proof of concept patch:
wp-includes/functions.php now contains:

-- wp_check_real_filetype() (function and filter)
-- wp_check_mime_alias() (function and filter)
-- wp_check_application_octet_stream (filter)
-- updated wp_check_filetype_and_ext() (see below)

wp-includes/media-mimes.php

-- wp_get_mime_aliases() (function and filter)

tests/phpunit/tests/functions.php

-- updated big5.txt test

This demonstrates the benefits of MIME alias handling by allowing for more robust type matching, increased upload file validation (ALL files are subject to type evaluation when possible), and provides some UX improvements (ALL incorrectly named files, if otherwise valid, are renamed with the correct extension).

wp_check_real_filetype() begins with a name-based approach (i.e. wp_check_filetype()). If that fails, the failure is passed on. If it succeeds, it attempts to evaluate the "real" type using EXIF (not yet implemented, waiting on #40017), or that failing, FILEINFO. If either evaluation succeeds, the "real" type is compared against the known aliases for the file extension. If the alias is good, the name-based type is returned (i.e. WordPress' hardcoded definitions take priority). If the real MIME does not match the extension, but it is whitelisted, that MIME and the *correct* extension are returned. If the "real" MIME is not whitelisted, false is returned. If no content-based evaluation can be performed, the name-based results are returned.

wp_check_mime_alias() has automatic handling for temporary x-subtype/subtype variations (e.g. application/font-woff and application/x-font-woff are considered equivalent). By default it also soft-matches application/octet-stream against any extension, as that tends to be the response returned by a server when it doesn't know what a file is. That behavior can be overridden using the wp_check_application_octet_stream filter. All checks are case-insensitive and strip out invalid characters.

wp_check_filetype_and_ext() is updated to call wp_check_real_filetype() instead of wp_check_filetype(). This covers whitelist checks and type-based evaluation. Renaming is now applied to all files, not just the small subset of image types in the original version. image/* and application/* checks are removed as unnecessary (all content is evaluated now). The ultimate determinations are still filterable as before.

All original PHPUnit tests pass, with the exception of test which passes big5.txt as a JPEG; because of the improvements, the result is correctly identified and accepted as a text/plain file. ;) This patch updates the test accordingly.

$ phpunit tests/phpunit/tests/functions.php 
Installing...
Running as single site... To run multisite, use -c tests/phpunit/multisite.xml
Not running ajax tests. To execute these, use --group ajax.
Not running ms-files tests. To execute these, use --group ms-files.
Not running external-http tests. To execute these, use --group external-http.
PHPUnit 5.4.6 by Sebastian Bergmann and contributors.

................................................................. 65 / 83 ( 78%)
..................                                                83 / 83 (100%)

Time: 918 ms, Memory: 24.00Mb

OK (83 tests, 565 assertions)

It is a lot to digest, I know. But the benefits are numerous. It mitigates the issues in #40175, but happens to do so in a way that increases upload security, and file integrity more generally (for example, users won't accidentally hear a screeching "MP3" because that MP3 is really an OGG).

The MIME database (media-mimes.php) will need to be indefinitely maintained, as MIME data is always changing. That data, however, is automatically, regularly re-built independently of WordPress. I would propose we aim to update the data once per major release, a task I am more than happy to adopt.

Speaking of, the MIME data in this patch is at roughly 1750 entries. While that data will never be 100% complete, as-is it already improves WordPress' chances of correctly identifying a file by over 2000%.

This ticket was mentioned in Slack in #core-media by blobfolio. View the logs.


6 days ago

@blobfolio
28 hours ago

handle application/CDFV2* types, implicit match of vnd.* variations, unit tests for variable XLS files

#16 @blobfolio
28 hours ago

New patch up! Changes include:

-- Rebuilt against current trunk
-- Catch generic application/CDFV2* MS types https://core.trac.wordpress.org/ticket/40175#comment:8
-- Implicit type checking now includes vnd.* variants
-- Added wp_check_mime_alias unit tests covering all the different types of XLS files found to date (now up to *4* different MIME types returned by any single instance of fileinfo, haha)

Note: See TracTickets for help on using tickets.