WordPress.org

Make WordPress Core

Opened 18 months ago

Last modified 6 weeks ago

#49412 assigned feature request

Store filesize media metadata

Reported by: Cybr Owned by: johnwatkins0
Milestone: 5.9 Priority: normal
Severity: normal Version:
Component: Media Keywords: has-patch has-unit-tests needs-refresh
Focuses: performance Cc:

Description

Previous: #33214, comment:6:ticket:47459

When storing media files, I think it'd be helpful that the result of filesize( $file ) is stored with the media's metadata. Whereafter we can retrieve it via wp_get_attachment_metadata().

This way, we no longer have to touch the file and can work from the database-provided metadata instead. There's performance overhead involved when media is stored on slow/shared storage or even on another server.

Attachments (2)

49412.diff (12.2 KB) - added by johnwatkins0 17 months ago.
49412.2.diff (12.2 KB) - added by johnwatkins0 17 months ago.
Removed a line of debug code

Download all attachments as: .zip

Change History (34)

#1 @SergeyBiryukov
18 months ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to Future Release

#2 @SergeyBiryukov
18 months ago

For some context, WordPress does read the size from the filesize meta value in a few places, but it's only set for audio and video files from the data extracted by getID3() library. Related: comment:4:ticket:33214.

Some history:

It would indeed be beneficial to store the file size in metadata for any attachment.

#3 @johnwatkins0
17 months ago

I've been working on this. I agree it's a useful idea. I've seen and created plenty of workarounds for the fact that filesize data isn't already saved with images. It's especially tricky where files are stored remotely.

The only question I have is whether we think it would be useful to have the file size for all intermediate image sizes as well as the full-sized image, or just for the full-sized image?

I think it makes sense to store all of them, since the main reasons I can think of for having this data involve comparing filesizes -- e.g., when looping through images to find the largest one under a limit.

The drawbacks to storing all are (1) more filesize calls at the time an image is uploaded, and (2) inflated metadata.

I'm going ahead with the assumption that we want all sizes.

Last edited 17 months ago by johnwatkins0 (previous) (diff)

#4 @johnwatkins0
17 months ago

Attaching a diff that adds "filesize" to the metadata for all attachments. For images, the field is at the top level as well as in all of the sizes in the sizes array -- e.g.:

Array
(
    [width] => 426
    [height] => 640
    [file] => 2020/02/2004-07-22-DSC_0007.jpg
    [filesize] => 87348
    [sizes] => Array
        (
            [medium] => Array
                (
                    [file] => 2004-07-22-DSC_0007-200x300.jpg
                    [filesize] => 26505
                    [width] => 200
                    [height] => 300
                    [mime-type] => image/jpeg
                )

            [thumbnail] => Array
                (
                    [file] => 2004-07-22-DSC_0007-150x150.jpg
                    [filesize] => 19138
                    [width] => 150
                    [height] => 150
                    [mime-type] => image/jpeg
                )

        )

    [image_meta] => Array
        (
            ... etc.
        )

)

_wp_attachment_metadata will be created as needed -- e.g.:

Array
(
    [_wp_attached_file] => Array
        (
            [0] => 2020/02/2019-my-file.pptx
        )

    [_wp_attachment_metadata] => Array
        (
            [0] => a:1:{s:8:"filesize";i:8419858;}
        )

    [_edit_lock] => Array
        (
            [0] => 1582035593:1
        )

)

Video and audio files will still have their filesize read by getID3() if possible, but there is a fallback if for whatever reason getID3() doesn't provide the filesize.

Question

The key in the meta array should be carefully considered. I'm using filesize because it matches the PHP function and the convention I've seen in plugins and themes that store file size in meta. There's no reason it has to be this, though. If there's a more semantic key -- or perhaps one that is sure to be unique and not conflict with plugins that may hook into attachment metadata -- then it could easily be changed. Any thoughts?

Last edited 17 months ago by johnwatkins0 (previous) (diff)

@johnwatkins0
17 months ago

@johnwatkins0
17 months ago

Removed a line of debug code

#5 @johnwatkins0
17 months ago

  • Keywords has-patch added; needs-patch removed

#6 @SergeyBiryukov
9 months ago

#51650 was marked as a duplicate.

#7 @spacedmonkey
9 months ago

I just reviewed @johnwatkins0 patch. It look great to me.

The place that I feel was missed was the _wp_image_meta_replace_original function.

Otherwise, this patch is good to me.

As I noted in #51650

Many popular plugins like Yoast, ACF and Jetpack, use the function filesize to get the file size of a attachment.

So I would love to see this land in core.

#8 @swissspidy
9 months ago

since filesize can return false, any objections to casting the return value to an int (so false becomes 0)?

#9 @spacedmonkey
9 months ago

since filesize can return false, any objections to casting the return value to an int (so false becomes 0)?

+1 to this.

Also, this change takes effect on new / updated attachments, should we try to backfill this data on existing attachments?

#10 @johnwatkins0
9 months ago

Good suggestions. I'll make an update this week.

This ticket was mentioned in Slack in #core-media by spacedmonkey. View the logs.


9 months ago

This ticket was mentioned in PR #671 on WordPress/wordpress-develop by johnwatkins0.


9 months ago

  • Keywords has-unit-tests added

#13 @prbot
9 months ago

johnwatkins0 commented on PR #671:

The remaining test failures seem to be unrelated to my change. Will check back tomorrow.

#14 @johnwatkins0
9 months ago

@spacedmonkey @swissspidy I have updated via the PR linked above. Unexpectedly, I discovered that filesize was returning slightly different results in the different GHA test environments. I struggled with this a bit and ended up creating a test utility function to round file sizes in the tests to the nearest thousand bytes. I think this is okay in principle because we don't care about the exact sizes of the test files as long as the actual and expected results generally match, but I still wonder if there might be a simpler approach. LMK if you have any suggestions.

There remain some test failures on my PR, but I see the same failures on the latest commit to master. I'll check back on this tomorrow.

#15 @Cybr
9 months ago

Unexpectedly, I discovered that filesize was returning slightly different results in the different GHA test environments.

Perhaps clearstatcache will do the trick before obtaining the filesize; although, I'm not sure if this is needed outside a testing environment.

#16 @johnwatkins0
9 months ago

On revisiting, I was able to work around that whole issue by simply recalculating all the file sizes on the fly within the tests: https://github.com/WordPress/wordpress-develop/pull/671/commits/1773075b444655be65022d6371c9a2beafa180c0
No need for rounding.

Last edited 9 months ago by johnwatkins0 (previous) (diff)

#17 @prbot
9 months ago

spacedmonkey commented on PR #671:

💡 Idea

How about we add a new function called wp_filesize

function wp_filesize( $path ) {
     $size = (int) filesize( $path );
     return apply_filters( 'wp_filesize', $size, $path );
}

This has a number of benefits.

  • Saves doing casting to int everywhere.
  • Has a filter.
  • Improves support for plugins like S3 plugins that outsource files to third party file store.
  • We can add a file_exists check in this function to stop errors.

Props for the original idea @garretthyder

### Existing uses of key in core.

It is also worth noting that a couple of places in core already expect this key to exists.
See
https://github.com/WordPress/wordpress-develop/blob/7303470ecd4f736d13e354998ec885689fe5ec2e/src/wp-admin/includes/media.php#L3100
https://github.com/WordPress/wordpress-develop/blob/33caf61b8ba988da73c2e210309566f9fdb039be/src/wp-includes/media.php#L3271

These are two places we are going to have to test. But I wonder if we should trigger a meta data regeneration if the meta key doesn't exist in these places, so the next time, this screen is called, it has a "cached" version in post meta.

#18 @spacedmonkey
9 months ago

@johnwatkins0 PR is looking great. Added my thoughts here.

#19 @spacedmonkey
9 months ago

  • Owner set to johnwatkins0
  • Status changed from new to assigned

#20 @prbot
9 months ago

johnwatkins0 commented on PR #671:

How about we add a new function called wp_filesize

Great idea, @spacedmonkey. I'll add this later today.

a couple of places in core already expect this key to exists.

Interesting -- this makes me curious why those are there. I'll take a closer look.

#21 @prbot
9 months ago

johnwatkins0 commented on PR #671:

Hey, @spacedmonkey, I've pushed an update implementing your suggestion, let me know what you think.

This ticket was mentioned in Slack in #core-media by spacedmonkey. View the logs.


3 months ago

#23 @desrosj
2 months ago

Would it be beneficial to add caching for wp_filesize()? I don't think the metadata will be backfilled to attachments unless wp_generate_attachment_metadata() is called for that file, or they are run through thumbnail regeneration. having wp_cache_get()/wp_cache_add() could prevent the need for filesize() calls there.

#24 follow-up: @joemcgill
2 months ago

Good seeing this effort move forward. For context, this idea has come up before as part of #36477 and also as a part of @azaozz's general optimization efforts (see Slack conversation).

One of the questions that has come up in the past (which I can't find record of, for some reason) is how to sites that rely on plugins or services that optimize images
in the background after they're uploaded so that WordPress isn't relying on inaccurate filesize data saved in metadata.

Reviewing progress here, I like the idea of a wp_fileszie() wrapper and agree with @desrosj that adding caching to this function could help reduce the load on direct reads of the filesystem. For the benefit of sites that don't have a persistent object cache in place, I wonder if using transients for this would be useful and if doing so would reduce the need for explicit attachment metadata since wp_filesize() could check for a cached value before executing a filesize read?

Related: https://core.trac.wordpress.org/ticket/36477

#25 in reply to: ↑ 24 @azaozz
2 months ago

  • Keywords needs-refresh added

Replying to joemcgill:

One of the questions that has come up in the past (which I can't find record of, for some reason) is how to sites that rely on plugins or services that optimize images in the background after they're uploaded so that WordPress isn't relying on inaccurate filesize data saved in metadata.

Yeah that's the main drawback. Seems the best option is to rely on these plugins to do-the-right-thing and update the file size after post-processing.

I wonder if using transients for this (caching of filesize) would be useful and if doing so would reduce the need for explicit attachment metadata since wp_filesize() could check for a cached value before executing a filesize read?

That will solve the problem with plugins not updating image meta after post-processing images, and with missing filesize. However transients should not be saved/updated from the front-end. There is the rule to never write to the DB from a front-end request for unidentified users. That can bring down busy sites quite easily.

There are other cases where images are stored remotely. Then updating the meta after post-processing or trying to get the file size at runtime may not work.

As far as I see the "assumptions" for the filesize meta would be that:

  1. It's optional, does not exist for older files.
  2. May be incorrect.

Would it be useful to have it? Perhaps yes, depends... :) One use might be to drop image sub-sizes when generating srcset. Another is to speed up loading of the media modal/library.

Any other usage that would still be acceptable after considering the above two points?

#26 @joemcgill
2 months ago

However transients should not be saved/updated from the front-end

Agreed. My assumption was that we would only be priming this cache during image manipulation or via a background task, but good to make this expectation explicit.

#27 @spacedmonkey
2 months ago

It is worth noting my personal interest in this ticket.

I noticed that in my logs, I was getting lots of calls S3 api. It turns out when looking at assets in the media library, the filesize lookup, resulted in an API request for every file. As filesize is displayed in media library in multiple places, these request rack up. It is also worth noting, if assets on being offloaded in other ways like jetpack, it is likely that this will also be an issue. Even assets stored "Local" on a network share, but have a slow response to get filesize call.

Add a filter to get filesize would be extremely useful for anyone loading assets. I think that wp_fileszie() maybe useful in other parts of core, even if it was just used to help mocking in unit tests.

As for if the data is incorrect, that is up to developer to fix, but if they not fix filesize after processing, it likely that other metadata is wrong as well.

#29 @azaozz
2 months ago

After more discussion in Slack (https://wordpress.slack.com/archives/C02SX62S6/p1620500571129800), the requirements/expectations for filesize seem to be:

  • It will only be added for the "full" image size. This may not be the uploaded/original image file if the image was resized on uploading.
  • It may be missing or unreliable (if the image was optimized by a plugin after uploading), and will be used primarily as cache when displaying image info in wp_admin.
  • May be good to have some sort of a "backfill" functionality when filesize is missing. However that should happen "very sensibly". DB writes should be optimized (single write per page load) and happen only from wp-admin.
Last edited 2 months ago by azaozz (previous) (diff)

This ticket was mentioned in Slack in #core-media by spacedmonkey. View the logs.


6 weeks ago

#32 @antpb
6 weeks ago

  • Milestone changed from Future Release to 5.9

5.9 seems a great target to work towards this change. Adding to the milestone so we keep track of it in regular meetings.

Note: See TracTickets for help on using tickets.