Make WordPress Core

Opened 8 years ago

Last modified 11 months ago

#39358 reopened defect (bug)

Media search speed has been dramatically reduced

Reported by: merts's profile merts Owned by: joemcgill's profile joemcgill
Milestone: Future Release Priority: normal
Severity: normal Version: 4.7
Component: Media Keywords: needs-patch phase-3-media-triage
Focuses: performance Cc:

Description (last modified by dd32)

In the ajax-actions.php file there is function called wp_ajax_query_attachments. This function is responsible for searching images in the media library.

We see that in Wordpress 4.7, someone added a new filter to this function:

	// Filter query clauses to include filenames.
	if ( isset( $query['s'] ) ) {
		add_filter( 'posts_clauses', '_filter_query_attachment_filenames' );
	}

This filter is a performance killer and takes forever to output a result in large databases. We have tested this with db that has over 500000 posts.

Change History (22)

#1 @swissspidy
8 years ago

Introduced in [38625] / #22744

#2 @dd32
8 years ago

  • Description modified (diff)

#3 follow-up: @joemcgill
8 years ago

  • Keywords 2nd-opinion added
  • Owner set to joemcgill
  • Status changed from new to reviewing

Hi @merts,

Welcome to Trac. As @swissspidy noted, this change was an intentional one in order to allow people to search for media in the media library by file name. I noted performance concerns in comments on #22744, which is partially why this change was applied narrowly to only times when someone is searching attachments. Adding this change as a filter also gives sites the ability to disable this change by removing the filter after it has been added.

Here's one approach to disabling the filter if you so choose:

add_action( 'pre_get_posts', 'remove_query_attachment_filename_filter' );

function remove_query_attachment_filename_filter() {
	remove_filter( 'posts_clauses', '_filter_query_attachment_filenames' );
}

We also might need to experiment with adding an index for post meta that targets the _wp_attached_file key as a way of reducing the query performance concerns. I'd appreciate insight from @dd32 or @pento on possibilities there.

Last edited 8 years ago by joemcgill (previous) (diff)

#4 @joemcgill
8 years ago

Related: Addressing #38911 could help some of the perceived slowness by reducing the amount of queries that hit the DB whenever someone is doing an AJAX search from the media library.

#5 follow-up: @merts
8 years ago

@joemcgill Reducing the amount of queries can be effective but the underlying problem is the resulting sql produced by this function. If you dump the sql into to error_log, you will see that it joins two tables (which can be quite large in our case) with bunch of "or" statements. These new "or" statements come from the filter I wrote above.

Please consider the fact that the previous Wordpress was finishing ajax search pretty fast in our db even though we could keep typing in the search bar.

#6 in reply to: ↑ 5 @joemcgill
8 years ago

Replying to merts:

That's correct. We are in agreement over the cause of the increased query time for media search queries and I shared a workaround you can use to revert your site to pre-4.7 behavior. My mentioning #38911 is intended mainly as an FYI for anyone looking at this issue.

#7 @dd32
8 years ago

  • Focuses performance added
  • Keywords 2nd-opinion removed

The issue we have here, is quite simple - You cannot search postmeta on large sites, there's no way around that, it's just not possible to do it in a performant way.

The user experience gained by this change is good however, it works as an end-user may intend the search to.

IMHO there's a few options of varying usefulness
a) Drop it entirely
b) Drop it for "large sites" (ie. more than x number of post) - The UX of this is bad
c) Limit it to exact matches only for the postmeta, which with the help of a new DB index would make it much faster.
d) Store the attachment data in another way.

For reference, this is what the SQL ends up looking like:

SELECT SQL_CALC_FOUND_ROWS wp_posts.id
FROM   wp_posts
       LEFT JOIN wp_postmeta AS sq1
              ON ( wp_posts.id = sq1.post_id
                   AND sq1.meta_key = '_wp_attached_file' )
WHERE  1 = 1
       AND (( ( wp_posts.post_title LIKE '%08W0zJv.jpg%' )
               OR ( wp_posts.post_excerpt LIKE '%08W0zJv.jpg%' )
               OR ( wp_posts.post_content LIKE '%08W0zJv.jpg%' )
               OR ( sq1.meta_value LIKE '%08W0zJv.jpg%' ) ))
       AND wp_posts.post_type = 'attachment'
       AND (( wp_posts.post_status = 'inherit'
               OR wp_posts.post_status = 'private' ))
GROUP  BY wp_posts.id
ORDER  BY wp_posts.post_title LIKE '%08W0zJv.jpg%' DESC,
          wp_posts.post_date DESC
LIMIT  0, 10  
Last edited 8 years ago by dd32 (previous) (diff)

#8 @dd32
8 years ago

I should add, that I never really liked this as a standalone filter, it's likely to break at some point.

A fifth option would also be to only add the postmeta searching when the search *looked* like a filename, so ABC.def but not ABC DEF GHI (which isn't entirely accurate, but would help)

#9 @dd32
8 years ago

c) Limit it to exact matches only for the postmeta, which with the help of a new DB index would make it much faster.

To follow that up, it'd actually have to be done with the help of a new meta key.. The reason being, is that the meta values are currently stored like so: 2015/01/08W0zJv.jpg OR /home/someuser/public_html/wp-content/uploads/08W0zJv.jpg. Exact-matching and/or indexing isn't going to help here without a rethink of where the filename is available.

We could use the fact that the post_name field is going contain the basename of the filename, ie. 08w0zjv in the case of 08W0zJv.jpg, combined with the fact that the Guid field currently contains the filename. That would at least cause it to hit indexes and not require a table join, for example (note the 4th search OR in both the previous and this example)

SELECT SQL_CALC_FOUND_ROWS wp_posts.id
FROM   wp_posts

WHERE  1 = 1
       AND (( ( wp_posts.post_title LIKE '%08W0zJv.jpg%' )
               OR ( wp_posts.post_excerpt LIKE '%08W0zJv.jpg%' )
               OR ( wp_posts.post_content LIKE '%08W0zJv.jpg%' )
               OR ( wp_posts.post_name = '08W0zjv' AND wp_posts.guid LIKE '%08W0zJv.jpg%' )
        ))
       AND wp_posts.post_type = 'attachment'
       AND (( wp_posts.post_status = 'inherit'
               OR wp_posts.post_status = 'private' ))
GROUP  BY wp_posts.id
ORDER  BY wp_posts.post_title LIKE '%08W0zJv.jpg%' DESC,
          wp_posts.post_date DESC
LIMIT  0, 10 
Last edited 8 years ago by dd32 (previous) (diff)

#10 follow-up: @merts
8 years ago

@dd32 Maybe I can suggest another option:

The size of the table "Post_meta" is not just directly related to Wordpress's standard usage. Many plugins such as Advanced Custom Fields (highly used by the community) can increase the size of this table dramatically. An ordinary user may not be aware of this and he/she is probably using a lot of plugins like ACF anyway. As you know, even if you remove a plugin you may still find records in the table.

A non optimized query like the one above can result degradation in performance but it is okay because we all make mistakes. However, I believe isolating this table in the first place could at least improve the performance. Personally, I don't like any plugins to touch post_meta table.

#11 in reply to: ↑ 10 @dd32
8 years ago

Replying to merts:

@dd32 Maybe I can suggest another option:

A table per field will always be faster, but where do you draw the line? Bluntly, we're not going to take that route.
Metadata stored in postmeta is fine and having a thousand post metas per post isn't an issue - it just needs to be recognised that you can't query globally by postmeta, and it should only be used to narrow an already reduced data set.

Version 0, edited 8 years ago by dd32 (next)

#12 @joemcgill
8 years ago

@dd32 Thanks for the detailed and thoughtful feedback here. You have rightly (in my view) identified the main challenge presented by #22744, which is how to balance the desire to improve the user experience for searching the media library with the fact that doing so requires non-optimal queries of postmeta.

Before discussing the merits/downsides of each option you've proposed, I think it's helpful to keep in mind a specific use case that the change tries to address. Assume someone uploads a file with a name like "large-monarch.jpg" but then renames the title of the attachment to something like "A Butterfly". Later, someone searches for a photo of a monarch in the media library but doesn't find the file because WordPress only searched the title and content fields. This is a simplified example, but imagine a scenario where an organization uses file name conventions as an organizational scheme, where the file name maps to the date, photographer, job ID, and photo number (i.e. 20161222_jdm_5555_32.jpg). This is a case where you may need to search by parts of file names that have no relationship to the post title assigned within the library.

Now, onto the proposals:

a) If it's agreed that the cost of this change is not worth the performance hit, I have no reservations in reverting. However I would strongly prefer we explore other options first.

b) I think that the current implementation (through a filter) allows large sites to turn this off if they so choose. That said, I'm more than happy to discuss ways the developer UX could be improved (e.g., moving away from the filter approach to some other configuration). In fact, we should pursue improvement here regardless.

c) I don't think exact matches are helpful here for reasons illustrated by the use case above. Often, it's not that someone knows the exact name of a file, but needs to search by partial information included in a file name. In my mind, this is the least applicable option.

d) Storing the information in another way seems like the best option, as your example using GUID demonstrates. I was hesitant in using GUID because we've not relied on the GUID for file names since #7622 and I assume some plugins modify the filename as postmeta while ignoring the GUID altogether. Perhaps it is worth sacrificing those edge cases for the performance benefits? Here it's worth noting that there is a request to also make image alt attributes searchable from the media library, and alt information is similarly stored as postmeta. This cannot as easily be searched from our current post schema.

Last edited 8 years ago by joemcgill (previous) (diff)

#13 @dd32
8 years ago

I think it's helpful to keep in mind a specific use case that the change tries to address. Assume someone uploads a file with a name like "large-monarch.jpg" but then renames the title of the attachment to something like "A Butterfly"

I definitely understand the use-case and want to keep it, the options I presented were in a least-desired to most-desired order. Ditching it, or disabling it under certain conditions, is only really a last-ditch effort if there's no way for it to be performant - sorry if that wasn't clear from reading :)

b) I think that the current implementation (through a filter) allows large sites to turn this off if they so choose. That said, I'm more than happy to discuss ways the developer UX could be improved (e.g., moving away from the filter approach to some other configuration). In fact, we should pursue improvement here regardless.

I don't think the ability to disable something through a filter solves this at all, if it was purely a developer thing, or something people ran into with plugins, it might be okay. But we're talking about the speed and functionality of the end-users interface - a slow interface is a bad experience.

c) I don't think exact matches are helpful here for reasons illustrated by the use case above.

I hadn't considered that you were after wildcard matching within filenames, that makes it even harder as wildcard matches don't tend to utilise indexes. That suggests that we do indeed need to filter down the data using something stored within the posts table before it gets matched against postmeta.

The suggestion of using the guid was never really going to fly, but was more to encourage thinking of how to go about it.
An option that might be more realistic would be to still use the join, but to only match against a subsection of postmetas based on filtering on the posts table, for example the following would only hit a small percentage of postmeta:
( wp_posts.post_name LIKE '%zJv%' AND sq1.meta_value LIKE '%08W0zJv.jpg%' )

#14 in reply to: ↑ 3 @timbunch
8 years ago

  • Resolution set to worksforme
  • Status changed from reviewing to closed

Thank you! This has been an issue since we upgraded to 4.7. It thought my DB was tanking on me at first with CPU spiking to 100%. With over 250k images, I know we have more than the average user...

Replying to joemcgill:

Hi @merts,

Here's one approach to disabling the filter if you so choose:

add_action( 'pre_get_posts', 'remove_query_attachment_filename_filter' );

function remove_query_attachment_filename_filter() {
	remove_filter( 'posts_clauses', '_filter_query_attachment_filenames' );
}

We also might need to experiment with adding an index for post meta that targets the _wp_attached_file key as a way of reducing the query performance concerns. I'd appreciate insight from @dd32 or @pento on possibilities there.

#15 @joemcgill
8 years ago

  • Keywords needs-patch added
  • Resolution worksforme deleted
  • Status changed from closed to reopened

Glad the filter approach helps @timbunch. Let's keep this one open and evaluate other approaches of speeding up the queries if possible.

This ticket was mentioned in Slack in #core-media by joemcgill. View the logs.


8 years ago

#17 @joemcgill
8 years ago

  • Milestone changed from Awaiting Review to 4.8

Moving this to 4.8

This ticket was mentioned in Slack in #core by flixos90. View the logs.


7 years ago

#19 @flixos90
7 years ago

  • Milestone changed from 4.8 to Future Release

This needs more investigation for a proper solution.

#20 @Krstarica
3 years ago

On WordPress 5.7 media search is still unbearably slow on large websites.

# Query_time: 12.417980 Lock_time: 0.000105 Rows_sent: 20 Rows_examined: 942798

SELECT 
  SQL_CALC_FOUND_ROWS wp_posts.ID 
FROM 
  wp_posts 
  LEFT JOIN wp_postmeta AS sq1 ON (
    wp_posts.ID = sq1.post_id 
    AND sq1.meta_key = '_wp_attached_file'
  ) 
WHERE 
  1 = 1 
  AND (
    (
      (
        wp_posts.post_title LIKE '%test%'
      ) 
      OR (
        wp_posts.post_excerpt LIKE '%test%'
      ) 
      OR (
        wp_posts.post_content LIKE '%test%'
      ) 
      OR (sq1.meta_value LIKE '%test%')
    )
  ) 
  AND wp_posts.post_type = 'attachment' 
  AND (
    (
      wp_posts.post_status = 'inherit' 
      OR wp_posts.post_status = 'private'
    )
  ) 
GROUP BY 
  wp_posts.ID 
ORDER BY 
  wp_posts.post_title LIKE '%test%' DESC, 
  wp_posts.post_date DESC 
LIMIT 
  0, 20;

EXPLAIN says:

+------+-------------+----------+------+------------------+------------------+---------+-------------------------+--------+---------------------------------------------------------------------+
| id   | select_type | table    | type | possible_keys    | key              | key_len | ref                     | rows   | Extra                                                               |
+------+-------------+----------+------+------------------+------------------+---------+-------------------------+--------+---------------------------------------------------------------------+
|    1 | SIMPLE      | wp_posts | ref  | type_status_date | type_status_date | 82      | const                   | 366787 | Using index condition; Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | sq1      | ref  | post_id,meta_key | post_id          | 8       | wordpress.wp_posts.ID   |    2   | Using where                                                         |
+------+-------------+----------+------+------------------+------------------+---------+-------------------------+--------+---------------------------------------------------------------------+
Last edited 3 years ago by Krstarica (previous) (diff)

#21 @3r3ct0r
2 years ago

Hi there.

Same problem on WordPress 5.9.3.
Very slow request, and exploding CPU load (+80,000 photos for almost 100,000 blog posts).

#22 @joedolson
11 months ago

  • Keywords phase-3-media-triage added
Note: See TracTickets for help on using tickets.