WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 2 years ago

#20057 new defect (bug)

Media upload for multi-webserver setups introduces a nasty race condition that could corrupt uploaded files

Reported by: archon810 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: major Version: 3.3.1
Component: Media Keywords:
Focuses: Cc:

Description

I am in the process of scaling a Wordpress blog with several million monthly pageviews, and I have designed a plan to scale it in the highest availability manner possible.

The setup
Each web server that hosts HTML/PHP, static files, and uploads will be replicated via rsync from time to time and sitting behind a load balancer. Rsync will run every 5min to 1 hour, and to mitigate the 404s in the uploads, I put together an nginx setup that automatically tries a different upstream server in its configuration when it encounters 404s. This allows any web server to go down at any time, and the system to run as if nothing happened. This also gives me freedom for rsyncing periodically rather than immediately and avoids 404s completely.

The problem
Now, the bug (note, I'm using Windows Live Writer which automatically names uploads image.png, but I could see this potentially happen without WLW too since WP seems to automatically name files on disk in case of collisions).

Let's say we have server A and B. The site name is foo.com. Let's also say B is out of date by an hour and a bunch of files got uploaded to A that aren't on B. Say, A has image.png and image2.png and B has only image.png.

Now, the issue is that if the load balancer directs the new post uploader (the new post contains a single image) to server B, the file that it will create on disk will be named image2.png rather than image3.png. So now B will have a file that's different from A's but is named the same way.

The main problem is that the file name is given based on what's available on the disk rather than according to the database. It's easy for this race condition (between rsyncs) to destroy the integrity of the files. Furthermore, and I've experienced this first hand), if you delete the attachment from the Wordpress UI, it could actually delete the wrong file from the wrong server as a result.

Seeing this, I can't continue with my scaling plan until file names are assigned by the database rather than the file system or I figure out how to mitigate that. I don't want to force all writers to use only server A for uploads using its direct IP, as I want HA (high availability).

Thank you.

Change History (20)

comment:1 archon8102 years ago

  • Cc admin@… added

comment:2 follow-up: dd322 years ago

I'm not sure checking the database for the existing files is more efficient than simply checking that the file exists on disk.

There's a filter on the unique filename handler (wp_unique_filename() in wp-admin/includes/file.php from memory) which would allow you to check the database for filename collisions (the filepath should be stored as a meta field for each attachment).

I'm not sure that it's something that is absolutely required for core itself, simply due to the unique nature of your setup (Not many others would expect a 404 failover, infact, most doing your style HA setup, would have static files served from a separate server to limit the load on the PHP servers, or utilise a shared network filestore).

comment:3 archon8102 years ago

@dd32, I think you're misunderstanding the issue. The problem is not that there's a collision, but that depending on the state of the rsync between the 2 servers, the file name that is going to be picked as the result of the collision detection is based on what's on the current server's disk. So, if it's missing a file (or 10), the name may get selected that was already assigned on server A, thus kind of poisoning the integrity of the uploads.

A file with the same name will then exist on both servers but will represent different images.

In fact, upon the next rsync, one of them will win and eat the other, depending on the direction of the sync. This is why keeping track of file names in the database and using the centralized database for name conflict resolution is crucial. There's only 1 database and many servers.

I think this request is very reasonable and has merits of a true HA setup. A single separate server becomes both your point of failure and point of scaling issues (in case it alone can't handle the traffic). Having a network filestore introduces high complexity and goes against KISS that I'm trying to use here (a good old fashioned proven rsync vs a network file system or a file system that replicates have a lot of idiosyncrasies and pitfalls. Debugging them could turn into a nightmare). I don't have the necessary means to set up a SAN or NAS in the colo, and file systems like gluster are prone to getting out of sync the same way rsync would, resulting in the same issue.

I also don't want to rely on plugins, like W3TC, to upload all files to a push CDN, since that makes me 100% reliant on that plugin and potentially destroys the site infrastructure if I ever need to switch it off.

So with that in mind, I first need to make sure you guys understand what the core issue is, and then urge you to consider a solution in the core.

As for the solution, naming uploaded files on disk based on the same algorithm as the post-name column rather than the file on disk would be ideal. For example, a recent upload I am looking at has post-name in the database "image-png-7096" while its on-disk name is uploads/2012/02/image75.png. If it were named image-7096.png or image-png-7096.png, it'd solve all our problems (so long as there's no race condition between querying the database and adding the new entry into it, in case 2 posts are getting uploaded at the same time - but I think it's already a solved problem).

comment:4 dd322 years ago

I completely understand your issue, and I completely understand that 99.95% of users will never need such a function, While the performance benefits of simply checking the disk are minimal over checking the database, checking the filesystem is more foolproof for the majority of users.

You can't get away from relying on a plugin or two(..or 5) in WordPress, Unless a feature is required for a large percentage of Installations, it's plugin material. If you don't want to rely on a 3rd party plugin, roll your own to serve your purpose, it's a trivial matter to add a filter to the unique filename checking procedure to also check the database, or to check with the other servers in the cluster to see if they have that file.

comment:5 archon8102 years ago

OK, thanks, I understand your stance. Other than using the filter, what are your top recommendations of how I should set this up? Perhaps you know of fool-proof plugins (preferably single-purposed) that could help, or have a recommendation on setting up a pure-software network file system that wouldn't get affected by this issue if it gets out of sync? I know you guys run wp.com, so this problem is already solved to a degree. Appreciate any hints, I've spent many months on crafting a solution, and I'm still not quite there yet.

Thanks again.

comment:6 in reply to: ↑ 2 SergeyBiryukov2 years ago

Replying to dd32:

There's a filter on the unique filename handler (wp_unique_filename() in wp-admin/includes/file.php from memory) which would allow you to check the database for filename collisions (the filepath should be stored as a meta field for each attachment).

wp_unique_filename() doesn't seem to have a filter: #19121

comment:7 dd322 years ago

wp_unique_filename() doesn't seem to have a filter: #19121

You're right, I thought you could specify $unique_filename_callback via a filter in wp_handle_upload() but seems you can't.

As an alternative, you can filter $file['name'] via the `wp_handle_upload_prefilter` filter, I would probably just prefix the files with the server it was uploaded on, but you should be able to rename the file to whatever you please at that point.

comment:8 archon8102 years ago

Finally got around to doing it - great idea about the server name.

However, my code doesn't appear to be doing anything - I believe it could be due to using xmlrpc to post/upload. If so, why is there such a discrepancy and how can it be avoided/worked around?

My code:

function my_upload_prefilter($file) {
  $file['name'] = "server." . $file['name'];
  return $file;
}
add_filter('wp_handle_upload_prefilter', 'my_upload_prefilter');

comment:9 archon8102 years ago

Looked around class-wp-xmlrpc-server.php, specifically in mw_newMediaObject(), but haven't spotted a particularly obvious way to accomplish this that would also stay in sync with the regular upload function - maintaining two versions of the file-changing code is cringe-worthy at best, and if two places are able to upload files without common bits, I'm getting nervous that there's a third one I'm not realizing at the moment.

comment:10 archon8102 years ago

Sorry to bug you guys, but is there a suggested workaround for this?

comment:11 archon8102 years ago

Are there missing filters that need to be added to the core? Please respond - this is still very much needed functionality.

comment:12 SergeyBiryukov2 years ago

Indeed, mw_newMediaObject() doesn't use wp_handle_upload_prefilter.

I'd suggest trying the sanitize_file_name filter.

comment:13 archon8102 years ago

Just tried sanitize_file_name, but it fires twice (at least during the xmlrpc method). I also tried to use the filter from https://core.trac.wordpress.org/attachment/ticket/19121/19121.2.patch, but that results in the same file being overwritten over and over (much worse) because the file name is modified too late and doesn't trigger a uniqueness check. My head is starting to spin from all the inconsistencies between the two upload functions (though it's probably three - there's one in atom-server too) and the side effects.

Any further suggestions, Sergey?

comment:14 archon8102 years ago

Upon further research, the second call to sanitize_file_name() happens via mw_newMediaObject() -> wp_upload_bits() -> wp_unique_filename() -> sanitize_file_name().

The first call is in mw_newMediaObject() -> sanitize_file_name().

Because of this, as I mentioned, sanitize_file_name() fires twice. I guess I could do a check on the prefix in my filter and skip applying it if it already exists, but it still feels a bit dirty and wrong to me.

comment:15 archon8102 years ago

Alright, this seems to work, I guess. I tested with xmlrpc and regular web uploads.

/**
* Prepend the server name to file uploads
*
* @param string $filename
* @param string $filename_raw
*/
function my_sanitize_file_name($filename, $filename_raw) {
  $prefix = trim(`hostname`)."_";

  if(!preg_match("/^$prefix/", $filename))
    $filename = $prefix . $filename;

  return $filename;
}
add_filter('sanitize_file_name', 'my_sanitize_file_name', 10, 2);

comment:16 markoheijnen2 years ago

Ticket #21292 will help this issue. The XML-RPC should use wp_handle_upload() instead of wp_upload_bits().
At this moment there are more issues in it since a lot of error checking is missing.

comment:17 follow-up: markoheijnen2 years ago

Looked into the issue more and wp_upload_bits is the right method to use and obvious it doesn't has wp_handle_upload_prefilter. It does has the filter: wp_upload_bits.

#19121 should fix this issue the correct way: add a filter to wp_unique_filename.

Maybe we should remove sanitize_file_name in the mw_newMediaObject. So it doesn't fire up twice.

comment:18 in reply to: ↑ 17 archon8102 years ago

Replying to markoheijnen:

Looked into the issue more and wp_upload_bits is the right method to use and obvious it doesn't has wp_handle_upload_prefilter. It does has the filter: wp_upload_bits.

#19121 should fix this issue the correct way: add a filter to wp_unique_filename.

Maybe we should remove sanitize_file_name in the mw_newMediaObject. So it doesn't fire up twice.

Great care needs to be taken there, see what I wrote above:

"I also tried to use the filter from https://core.trac.wordpress.org/attachment/ticket/19121/19121.2.patch, but that results in the same file being overwritten over and over (much worse) because the file name is modified too late and doesn't trigger a uniqueness check."

When I used that filter to prepend servername and uploaded 2 files with the same name (just using Windows Live Writer and not specifying a name, which results in "image.png"), I ended up with servername_image.png and that's it. It was overwritten by the 2nd upload. This is very bad and dangerous.

Ideally, both upload functions (xmlrpc and WP web) would fire the same logic and the same filter early enough to modify the file name.

comment:19 markoheijnen2 years ago

If you read my comment on #19121 you would see that the filter shouldn't be like that and that there should be a filter to define a default callback for unique_filename_callback.

Most of the functionality is the same. Missing a good filter for wp_unique_filename() is a pain in your case.

comment:20 archon8102 years ago

OK, fair enough. Looking forward to a clean solution.

Note: See TracTickets for help on using tickets.