WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 5 months ago

#42437 new defect (bug)

Thumbnails can overwrite other uploads if filename matches

Reported by: Viper007Bond Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.8.3
Component: Upload Keywords: needs-patch
Focuses: Cc:
PR Number:

Description

Imagine you're browsing some WordPress site and you find an image you want to steal you like and want to upload to your own site. Not knowing any better, you download and save a thumbnail, image-1024x768.png. Later on you upload a different image called image.png to your site. Assuming you haven't changed thumbnail sizes, the large thumbnail of the second image will overwrite the original first image.

I've attached two images that you can use to test. Notice that image-1024x768.png will become the red image instead of staying green.

One possible solution to this is to add a flag to the thumbnail generation function that toggles overwriting existing files. It should default to overwriting for backwards compatibility but not overwrite on initial upload. If not overwriting, then -2 could be added somewhere in the filename. The filename doesn't need to be predictable because it's stored in the metadata.

Attachments (3)

image-1024x768.png (3.9 KB) - added by Viper007Bond 2 years ago.
image.png (12.5 KB) - added by Viper007Bond 2 years ago.
42437.diff (1.2 KB) - added by pbiron 17 months ago.
proof-of-concept patch

Download all attachments as: .zip

Change History (28)

@Viper007Bond
2 years ago

#1 @Viper007Bond
2 years ago

Hmm, I didn't intend for image previews to show up.

This ticket was mentioned in Slack in #core-media by antpb. View the logs.


21 months ago

#3 follow-ups: @azaozz
21 months ago

Thinking we should be stripping things like -1024x768 from the ends of image names on upload. That will allow "proper" naming with -2, -3, etc. for duplicates and will prevent sub-sizes name conflicts.

Last edited 21 months ago by azaozz (previous) (diff)

#4 in reply to: ↑ 3 ; follow-up: @blobfolio
21 months ago

Replying to azaozz:

Thinking we should be stripping things like -1024x768 from the ends of image names on upload. That will allow "proper" naming with -2, -3, etc. for duplicates and will prevent sub-sizes name conflicts.

Having seen some absurdly long user filenames in my day, I like the idea of stripping trailing dimensions from new files before storing them.

Temporarily pop off the extension, then something like the following:
$filename = preg_replace('/(\-\d+x\d+)+$/', '', $filename);

Then make sure the filename isn't empty, add the extension back, and run the usual uniqueness filters.

That'll help things going forward, but won't stop collisions from media uploaded before the change. For that, we'd need to loop through all possible thumbnail names to check for collisions before settling on a source name.

#5 in reply to: ↑ 4 @azaozz
21 months ago

  • Milestone changed from Awaiting Review to Future Release

Replying to blobfolio:

Yep, thinking that will work for this edge case. Alternatively (as discussed in #core-media on Slack) we can have a more stringent name check that looks into the names for sub-sizes that will be generated from the uploaded image. It would add -1, -2, etc. to the original image name if a possible conflict with the names of the sub-sizes is detected.

This ticket was mentioned in Slack in #core-media by nosilver4u. View the logs.


21 months ago

#7 @desrosj
21 months ago

#43246 was marked as a duplicate.

This ticket was mentioned in Slack in #core-media by desrosj. View the logs.


21 months ago

This ticket was mentioned in Slack in #core-media by desrosj. View the logs.


21 months ago

#10 @desrosj
18 months ago

#44095 was marked as a duplicate.

#11 in reply to: ↑ 3 ; follow-ups: @pbiron
18 months ago

Replying to azaozz:

Thinking we should be stripping things like -1024x768 from the ends of image names on upload. That will allow "proper" naming with -2, -3, etc. for duplicates and will prevent sub-sizes name conflicts.

While stripping dimensions from filenames on upload would probably be the easiest technical solution, doing so might cause UX disconnects for some users.

For example, in the art world, it's quite common for artists to put the dimensions of their work into their filenames, e.g., a painter does an oil painting of a sunset on a 150cmx150cm canvas, takes a photo of the artwork and names it sunset-150x150.jpg (where, in their mind, the -150x150 suffix is the canvas size and not the dimensions of the photo of that painting). They upload this photo. Sometime later they go to /wp-admin/uploads.php and try to search for the photo, but they can't remember exactly what they named the file...was it sunset-150x150.jpg, sundown-150x150.jpg or here-comes-the-night-150x150.jpg (and nor what the caption was)? All they remember is that they put -150x150 in the filename and when they search for 150x150, they get the dreaded "No media files found." result...and think their image has somehow gotten deleted.

note: this example is based on the one in #44095, which I opened before being directed to this ticket.

Granted, that UX issue isn't a show stopper, but it's something to consider.

#12 in reply to: ↑ 11 @pbiron
18 months ago

I bumped into the problem behind this ticket yesterday while debugging a func I wrote to hook into attachment_url_to_postid, in an attempt to get around the problem of attachment_url_to_postid() not finding the attachment ID if the URL is for an intermediate size of the upload (I starting writing that func since there has been no movement on #41816, my attempt to fix that problem...which I opened 8 mos ago).

In particular, after uploading foo.jpg, I did

function my_attachment_url_to_postid_filter( $url ) {
    if ( $post_id ) {
        // attachment already found, so return it
        return $post_id;
    }

    // the following is simplified from my actual func, for clarity

    // strip any dimensions from the URL and try attachment_url_to_postid() again
    $modified_url = preg_replace( '/-\d+-\d+(?=\..*$)'/, '', $url );

    remove_filter( 'attachment_url_to_postid', __FUNCTION__, 10 );
    $post_Id = attachment_url_to_postid( $modified_url );
    add_filter( 'attachment_url_to_postid', __FUNCTION__, 10, 2 );

    return $post_id;
}
add_filter( 'attachment_url_to_postid', 'my_attachment_url_to_postid_filter', 10, 2 );

$id = attachment_url_to_postid( 'http://host/wp-content/uploads/foo.jpg' );
$id_for_intermediate = attachment_url_to_postid( 'http://host/wp-content/uploads/foo-150x150.jpg' );
// $id should now be === $id_for_intermediate

only to find out that $id_for_intermediate was the ID for a previously uploaded file named foo-150x150.jpg. That is, the if ( $post_id ) {} test at the top of my filter func was succeeding because there was an unknown to me attachment with _wp_attached_file postmeta with meta_value = 'foo-150x150.jpg') :-(

So, in addition to the serious data loss issue of previous uploads being overwritten identified in this ticket, the name collision problems between original image filenames containing "dimension"-like sub-strings and generated intermediate size filenames can have other undesirable "downstream" effects.

Last edited 18 months ago by pbiron (previous) (diff)

#13 in reply to: ↑ 3 @pbiron
18 months ago

Replying to azaozz:

Thinking we should be stripping things like -1024x768 from the ends of image names on upload. That will allow "proper" naming with -2, -3, etc. for duplicates and will prevent sub-sizes name conflicts.

Looking back through the Slack thread, I see this comment by @desrosj

I would not be in favor of this. While WordPress may try to use that filename, that could be a very purposeful filename.

with which I agree.

Another example of a "purposeful" filename that contains a "dimension"-like sub-string that is completely unrelated to the image dimensions is our-store-is-open-24x7.jpg :-)

#14 follow-up: @joemcgill
18 months ago

The best way I can think of avoiding these kinds of conflicts is to add a short hash to filenames of intermediate image sizes to ensure that this is unique. The main question would be whether it would be whether we needed to build that hash from something that is unchanging (e.g. the original file name) rather than something that could change (e.g., current server time) so that regenerating images would result in consistent image file names as ones already used in post content.

#15 in reply to: ↑ 14 @pbiron
18 months ago

Replying to joemcgill:

The best way I can think of avoiding these kinds of conflicts is to add a short hash to filenames of intermediate image sizes to ensure that this is unique.

How would such a hash be any better than the -1, -2, etc that wp_unique_filename() adds?

[just seeking clarification, not criticizing the idea]

#16 in reply to: ↑ 11 ; follow-up: @azaozz
17 months ago

Replying to pbiron:

...doing so might cause UX disconnects for some users.

Right, and generally there is no point in fixing one rare edge case and possibly introducing another.

The best way I can think of avoiding these kinds of conflicts is to add a short hash to filenames of intermediate image sizes

We already make the file names of intermediate image sizes longer. As far as I remember there was a problem with exceeding the allowed file name length for certain filesystems. Adding another (possibly long) string to the file names may trigger that.

How about my other idea: if the original file name ends with something like -600x800.jpg rename it and append -1, -2, etc. the same way we do for all other naming conflicts in WP.

The (basic) code to implement this would be:

if ( preg_match( '/.+-\n+x\n+$/', $file_name ) ) {
  // Always append `-1` to file names that can potentially match other subsize file names.
  $i = 1;
  $full_name = $file_name . "-$i" . $file_extension;

  while ( is_file( $uploads_dir . $full_name ) ) {
    $i++;
    $full_name = $file_name . "-$i" . $file_extension;
  }
}

This will give us file names like sunset-150x150-1.jpg, sundown-150x150-2.jpg or here-comes-the-night-150x150-1.jpg (the same names you will get if you upload the same image several times). These names are a lot more "readable" and close to the original than something like sunset-27624fea676d86f7-150x150.jpg for subsizes.

Last edited 17 months ago by azaozz (previous) (diff)

#17 in reply to: ↑ 16 @pbiron
17 months ago

Replying to azaozz:

How about my other idea: if the original file name ends with something like -600x800.jpg rename it and append -1, -2, etc. the same way we do for all other naming conflicts in WP.

This will give us file names like sunset-150x150-1.jpg, sundown-150x150-2.jpg or here-comes-the-night-150x150-1.jpg (the same names you will get if you upload the same image several times). These names are a lot more "readable" and close to the original than something like sunset-27624fea676d86f7-150x150.jpg for subsizes.

I just did a few tests on pathological cases with this idea implemented and it seems to work great. Look for a proof-of-concept patch shortly.

@pbiron
17 months ago

proof-of-concept patch

#18 follow-up: @blobfolio
17 months ago

Unfortunately new source attachments with a dimension-like suffix naming structures are only one small part of the issue. The full scope of the problem is that any thumbnail for an attachment might collide with another file in the uploads folder.

For example, say a site user uploaded an image called image-123x123.jpg a year ago. (No patch can fix the past. Haha.) Then today, they decide to upload a file called image.jpg. If we are only looking at the source file, no collision would be detected, however once WP crunches out the 123x123 thumbnail size, a collision will happen (the original file from a year ago will be blown away).

Beyond suffixes, we also have to deal with the new thumbnails WordPress is generating for non-image attachments like PDFs, as well as all the other unregistered files that end up in a site's uploads folder (manually edited thumbnails, files from plugins, etc.)

For a more complete fix, WordPress needs to pre-generate a list of possible thumbnail paths and run its collision checks in a loop. If any of the proposed files collide, -1 the source name, pre-generate a new list, and check again. Rinse and repeat until all 5 million files are proved unique. Haha.

#19 @pbiron
17 months ago

The patch I just added is just "proof-of-concept", and is clearly not ready for merge!

It implementis the idea in @azaozz's comment.

Additionally, it also appends -1, etc to original filenames that no not contain dimension-like sub-strings if their intermediate size filenames would conflict with original filenames that did contain such sub-strings that were uploaded prior to the patch being applied. I have no idea how to write unit tests for that part of it! There's got to be a better way of implementing this part than using glob() but I couldn't think of it off the top of my head.

Unfortunately, my dev environment is incapable of generating thumbnails for PDF uploads, so I can't test whether the patch causes problems with the fix in #39875.

I'd also note that with this patch applied my attachment_url_to_postid filter correctly resolves attachment IDs for URLs for intermediate sized images (provided the attachment was uploaded after the patch is applied).

#20 in reply to: ↑ 18 @pbiron
17 months ago

Replying to blobfolio:

For example, say a site user uploaded an image called image-123x123.jpg a year ago. (No patch can fix the past. Haha.) Then today, they decide to upload a file called image.jpg. If we are only looking at the source file, no collision would be detected, however once WP crunches out the 123x123 thumbnail size, a collision will happen (the original file from a year ago will be blown away).

Actually, the patch I just added does handle this case...although, as I mentioned, the way it does so is sub-optimal.

#21 follow-up: @joemcgill
17 months ago

How would such a hash be any better than the -1, -2, etc that wp_unique_filename() adds?

For intermediate image file names, I think it makes sense to try to keep the names for all images in a set consistent. In this case, the issue is that one of the intermediate size names generated by WP is conflicting with the name of an original file that was uploaded to the server. We could append a -1 only to the intermediate size that has the conflict, but then you end up with a set of images like:

  • filename.jpg
  • filename-150x150.jpg
  • filename-300x400-1.jpg
  • filename-600x800.jpg
  • etc.

If we wanted consistent names, we would first need to loop through all names that are going to be created and check if any of those files already exist on the server before we generate the intermediate image files.

How about my other idea: if the original file name ends with something like -600x800.jpg rename it and append -1, -2, etc. the same way we do for all other naming conflicts in WP.

My main concern with this approach is that there are too many unpredictable reasons people might have for intentionally naming files which match this pattern, and renaming files, as opposed to appending a -1, -2, etc. seems more destructive.

I recommend a hash (or similar) approach for intermediate image file names because it would fulfill the following requirements:

  1. Modify intermediate image file names rather than original file names to avoid collisions, since those files are generated by WP.
  2. Ensure originality without the need to calculate and check for collisions during the process of generating intermediate images (a process that is already long-running and resource intensive on shared hosting, which can cause failures. See #40439).
  3. Keep names of all intermediate image sizes in a set consistent so we can use pattern matching to detect the difference between an image that is part of an autogenerated set and ones that have been manually cropped in the WP media editor (important when calculating sources for srcset).
  4. Use a naming convention that is predictable so the same file names will be created when someone regenerates image sizes as would be created during the original upload process.

It's also worth noting that WordPress already uses hashes in filenames if someone has the IMAGE_EDIT_OVERWRITE constant set to false. See wp_save_image().

#22 in reply to: ↑ 21 @azaozz
17 months ago

Replying to joemcgill:

We could append a -1 only to the intermediate size that has the conflict, but then

Yeah, don't think this is a good idea. All sub-sizes should be named consistently.

If we wanted consistent names, we would first need to loop through all names that are going to be created and check if any of those files already exist on the server before we generate the intermediate image files.

How about my other idea: if the original file name ends with something like -600x800.jpg rename it and append -1, -2, etc. the same way we do for all other naming conflicts in WP.

My main concern with this approach is that there are too many unpredictable reasons people might have for intentionally naming files which match this pattern, and renaming files, as opposed to appending a -1, -2, etc. seems more destructive.

Sorry but I'm not sure I understand what you wanted to say here:

  • "intentionally naming files which match this pattern". Assuming you mean the filemane-600x800.jpg pattern.
  • "and renaming files, as opposed to appending a -1, -2, etc. seems more destructive". What seems more destructive? Appending -1, -2, etc. is renaming?

I recommend a hash (or similar) approach for intermediate image file names because it would fulfill the following requirements:

  1. Modify intermediate image file names rather than original file names to avoid collisions, since those files are generated by WP.

Few things:

  • We already modify the original file name when it collides with previously uploaded file. Try uploading the same image few times and watch your uploads dir.
  • The idea to append -1, -2, etc. to the file name extends that original functionality. Introducing yet another "naming convention" to handle pretty similar circumstances seems like not so good.
  • Adding a hash to the sub-size names makes pretty ugly URLs. I'm not sure what the final impact may be but readability of URLs is a major consideration for all sites, not only for SEO.
  1. Ensure originality without the need to calculate and check for collisions during the process of generating intermediate images (a process that is already long-running and resource intensive on shared hosting, which can cause failures. See #40439).

Right. No matter how we modify these names there will always be a tiny chance of name collision. We may minimize it by adding hashes, but it can never be 0, there will always be something like 0.00001%. In WP terms (tens of millions of installs) this is not such a small number :)

  1. Keep names of all intermediate image sizes in a set consistent so we can use pattern matching to detect the difference between an image that is part of an autogenerated set and ones that have been manually cropped in the WP media editor (important when calculating sources for srcset).

Using pattern matching is actually easier done when the original filename matches most of the sub-size filename, like it is at the moment. Adding a hash makes that harder/slower.

It's also worth noting that WordPress already uses hashes in filenames if someone has the IMAGE_EDIT_OVERWRITE constant set to false. See wp_save_image().

That's not exactly the case. When an image is edited, a new image is created that has a hash in the filename, then all the sub-sizes are created from that new image. If we add another hash for sub-sizes, that will mean edited images sub-sizes will have double hashes, i.e. their filenames will be something like 20170729_142638-02332t348y5767374-e1526551747486-206x300.jpg :)

#23 @azaozz
17 months ago

Ugh, sorry about the long post :)

The tl;dr version:

  • If we don't check for collisions for all file names when uploading a file (original file name plus all sub-sizes file names), we will never eliminate these collisions 100%.
  • Adding hashes to all sub-size file names makes for ugly URLs to them. "Pretty" URLs are quite important for readability and SEO.

This ticket was mentioned in Slack in #core-media by pbiron. View the logs.


9 months ago

This ticket was mentioned in Slack in #core-media by pbiron. View the logs.


5 months ago

Note: See TracTickets for help on using tickets.