Make WordPress Core

Opened 2 years ago

Closed 12 months ago

Last modified 12 months ago

#56598 closed enhancement (duplicate)

Performance optimization in _update_post_term_count()

Reported by: dimitrisv's profile dimitrisv Owned by:
Milestone: Priority: normal
Severity: normal Version: 6.0.2
Component: Taxonomy Keywords: needs-patch reporter-feedback
Focuses: performance Cc:

Description

<?php
/**
         * Filters the post statuses for updating the term count.
         *
         * @since 5.7.0
         *
         * @param string[]    $post_statuses List of post statuses to include in the count. Default is 'publish'.
         * @param WP_Taxonomy $taxonomy      Current taxonomy object.
         */
        $post_statuses = esc_sql( apply_filters( 'update_post_term_count_statuses', $post_statuses, $taxonomy ) );

        foreach ( (array) $terms as $term ) {
                $count = 0;

                // Attachments can be 'inherit' status, we need to base count off the parent's status if so.
                if ( $check_attachments ) {
                        // phpcs:ignore WordPress.DB.PreparedSQLPlaceholders.QuotedDynamicPlaceholderGeneration
                        $count += (int) $wpdb->get_var( $wpdb->prepare( "SELECT COUNT(*) FROM $wpdb->term_relationships, $wpdb->posts p1 WHERE p1.ID = $wpdb->term_relationships.object_id AND ( post_status IN ('" . implode( "', '", $post_statuses ) . "') OR ( post_status = 'inherit' AND post_parent > 0 AND ( SELECT post_status FROM $wpdb->posts WHERE ID = p1.post_parent ) IN ('" . implode( "', '", $post_statuses ) . "') ) ) AND post_type = 'attachment' AND term_taxonomy_id = %d", $term ) );
                }

                if ( $object_types ) {
                        // phpcs:ignore WordPress.DB.PreparedSQLPlaceholders.QuotedDynamicPlaceholderGeneration
                        $count += (int) $wpdb->get_var( $wpdb->prepare( "SELECT COUNT(*) FROM $wpdb->term_relationships, $wpdb->posts WHERE $wpdb->posts.ID = $wpdb->term_relationships.object_id AND post_status IN ('" . implode( "', '", $post_statuses ) . "') AND post_type IN ('" . implode( "', '", $object_types ) . "') AND term_taxonomy_id = %d", $term ) );
                }

As I am running a rather large news-site (20Million rows in wp_postmeta, 600k wp_posts, 500k wp_term_relationships, I noticed that the query:

SELECT COUNT(*) FROM wp_term_relationships, wp_posts WHERE wp_posts.ID = wp_term_relationships.object_id AND post_status IN (?) AND post_type IN (?) AND term_taxonomy_id = ?

Can take a lot of time.

Although the table is indexed Count(*) is not actually working. It needs to be made to Count(<pick your indexed fields of choice>) so that we can increase its speed 100fold (if not more).

No image "<img src="https://db3pap004files.storage.live.com/y4mxBsrn4geeB-TrkH6Siuaw5ta6jzDtEDxGoZyKPu_pF-VQuZz4wjiCeMgQ1GCgrcEhq1-zLkusPu_rGciTGtH9WMD6c8n6xr1qx3vWyz6gY0-WrR22cfEnkCsWD1fitHeYhmImTxHc1TSYaT-OQxNKyUvK_XYw-PEiJqQdcX9HVWyfHZqHVUNxkvaEp22GKJ4?width=1389&height=693&cropmode=none" width="1389" height="693" />" attached to Ticket #56598

(you will see that typical times are from 8 to 13 seconds)

Thank you.

Attachments (1)

taxonomy.png (100.0 KB) - added by dimitrisv 2 years ago.

Download all attachments as: .zip

Change History (15)

@dimitrisv
2 years ago

#1 @SergeyBiryukov
2 years ago

  • Focuses performance added
  • Keywords needs-patch added; changes-requested needs-refresh removed
  • Summary changed from Performance Optimization in file wp-includes\taxonomy.php to Performance optimization in _update_post_term_count()

Hi there, welcome to WordPress Trac! Thanks for the ticket.

Adjusting the keywords per Trac Workflow Keywords.

Last edited 2 years ago by SergeyBiryukov (previous) (diff)

#2 @dimitrisv
2 years ago

Cheers Sergey!

Looking forward to its porting!

Best regards,

Dimitris.

#3 @Hareesh Pillai
14 months ago

  • Milestone changed from Awaiting Review to 6.4

Seems like a nice performance win. We should plan to include this in the next milestone.

#4 @peterwilsoncc
14 months ago

@dimitrisv Are you able to run the current and proposed change with EXPLAIN FORMAT=TREE?

Running it locally on InnoDB/MySQL 8, I'm not seeing any cost benefit but it would be more helpful to see data from a large table such as yours.

Please include the DB version and engine you're running, it can make quite a difference.

For the record, here are the results I am seeing with a sparsely populated table:

EXPLAIN FORMAT=TREE
SELECT COUNT(*) FROM wp_term_relationships, wp_posts
WHERE wp_posts.ID = wp_term_relationships.object_id 
  AND post_status IN ('publish') 
  AND post_type IN ('post', 'page') 
  AND term_taxonomy_id = 95;
-> Aggregate: count(0)  (cost=25.60 rows=1)
    -> Nested loop inner join  (cost=22.79 rows=28)
        -> Covering index lookup on wp_term_relationships using term_taxonomy_id (term_taxonomy_id=95)  (cost=5.29 rows=50)
        -> Filter: ((wp_posts.post_status = 'publish') and (wp_posts.post_type in ('post','page')))  (cost=0.25 rows=1)
            -> Single-row index lookup on wp_posts using PRIMARY (ID=wp_term_relationships.object_id)  (cost=0.25 rows=1)

EXPLAIN FORMAT=TREE
SELECT COUNT(wp_term_relationships.term_taxonomy_id) FROM wp_term_relationships, wp_posts
WHERE wp_posts.ID = wp_term_relationships.object_id 
  AND post_status IN ('publish') 
  AND post_type IN ('post', 'page') 
  AND term_taxonomy_id = 95;
-> Aggregate: count(wp_term_relationships.term_taxonomy_id)  (cost=25.60 rows=1)
    -> Nested loop inner join  (cost=22.79 rows=28)
        -> Covering index lookup on wp_term_relationships using term_taxonomy_id (term_taxonomy_id=95)  (cost=5.29 rows=50)
        -> Filter: ((wp_posts.post_status = 'publish') and (wp_posts.post_type in ('post','page')))  (cost=0.25 rows=1)
            -> Single-row index lookup on wp_posts using PRIMARY (ID=wp_term_relationships.object_id)  (cost=0.25 rows=1)

#5 @peterwilsoncc
14 months ago

In #40351, a switch to partial term counts (using increment and decrement) was attempted but I ended up reverting my patch because it broke plugins using hooks to set additional terms.

The performance of term counting IS very bad so there's a lot of room to improve the performance by avoiding the need for scanning multiple tables, etc.

(Sorry for the double notification, premature send)

#6 follow-up: @oglekler
13 months ago

Hi @dimitrisv, possibly you've missed it. Could you please follow up on the comment:4?

#7 in reply to: ↑ 6 @dimitrisv
13 months ago

Replying to oglekler:

Hi @dimitrisv, possibly you've missed it. Could you please follow up on the comment:4?

Oups!

Indeed I will do... Please bear with me

This ticket was mentioned in Slack in #core-performance by joemcgill. View the logs.


13 months ago

#9 follow-up: @joemcgill
13 months ago

  • Keywords reporter-feedback added
  • Severity changed from critical to normal

This seems like a useful optimization, but we'll need a PR (or patch) for it to make this release. @peterwilsoncc are you interested in owning this one?

#10 in reply to: ↑ 9 @peterwilsoncc
13 months ago

Replying to joemcgill:

This seems like a useful optimization, but we'll need a PR (or patch) for it to make this release. @peterwilsoncc are you interested in owning this one?

I'll need to see some EXPALINs showing the cost goes down. count(*) is DB speak for "count the number of rows the WHERE clause returns", count(field) is DB speak for "count the number of rows the WHERE clause returns in which field is a non-null value" -- so it's possible there will be a performance hit.

While it might be possible to improve the existing queries, I think it's papering over the cracks rather than fixing the underlying issue that term counting queries the entire table rather than incrementing or decrementing the count as posts are published or unpublished or terms are added or removed.

The reverted patch on #40351 got very close to implementing partial counts, it just needs some improvement to account for a few edge cases. WPVIP use partial counts with some success.

#11 @joemcgill
13 months ago

Thanks @peterwilsoncc. In your view, is there a targeted optimization worth pursuing here, or should we close this as a duplicate of #40351 and address this performance problem in one place?

#12 @dimitrisv
13 months ago

Hello,

I am sorry for the delay - my apologies.

When I started this thread I was dealing with a WP installation of a news site, with over 70million rows in their post-meta table. Since then I have delivered the project I have only a backup of a "mere" 20million rows in postmeta, 350k in posts.

It did make a lot of difference especially, given that the default indexes were inadequate. Theoretically speaking it is the right thing to do.

Here is what chat GPT said about the comment by @peterwilsoncc

Based on your description, it seems you're discussing a database-related query optimization topic.

In SQL, `count(*)` and `count(field)` are two different ways to count rows based on certain criteria:

1. `count(*)`: This counts all rows irrespective of whether they have null values or not. It's used when you just want to know the number of rows in a result set.
   
2. `count(field)`: This counts only the rows where the specified `field` has a non-null value.


Factors that might influence the cost between `count(*)` and `count(field)`:

1. **Table Structure**: If a table doesn't have any nullable columns, then `count(*)` and `count(any_field)` would theoretically have similar costs. 
   
2. **Indexes**: If a field/column you're using in `count(field)` is indexed, the database might retrieve the count more efficiently compared to counting all rows with `count(*)`.
   
3. **Database Optimization**: Modern databases might have optimizations in place for certain operations. For instance, some databases might have a quick way to determine the row count for a table without scanning all rows.

4. **Storage Engine**: Some database storage engines might have optimizations that make one type of count faster than the other.

How to analyze the costs:
For example, in MySQL, you can prefix your query with `EXPLAIN` to see an execution plan:

```sql
EXPLAIN SELECT count(*) FROM your_table WHERE some_condition;
EXPLAIN SELECT count(field) FROM your_table WHERE some_condition;
```

... 

It's worth noting that the actual difference in execution cost between `count(*)` and `count(field)` may vary depending on the specific database, the data distribution, and other factors. So it's always a good idea to test in your own environment and use tools like EXPLAIN to understand what's going on under the hood.
 

What might have made the difference so noticeable in this case is the usage of many Advanced Custom Fields (with a lot of postmeta entries - most are empty or null).

Since then I have migrated them to pods (using "real" table storage outside postmeta) so the difference is not pronounced.

I hope this helps.

#13 @peterwilsoncc
12 months ago

  • Milestone 6.4 deleted
  • Resolution set to duplicate
  • Status changed from new to closed

@joemcgill @dimitrisv I think I'll close this as a duplicate of #40351, which also aims to optimize the term counting queries.

I don't think changing how (or what) is counted will improve the queries in any meaningful way as the fundamental problem comes from two things:

  • the WHERE clauses are inefficient on tables of any scale, and should be replaced with incrementing or decrementing queries
  • the inefficient queries run often

On some database engines modifying the count MAY improve the performance fractionally, but on others modifying the count MAY degrade performance fractionally. As term_taxonomy_id can't be null, it's likely that the SQL optimizer is counting either the rows or the primary key as it deems most efficient.

To improve the performance of every site, no matter the table size, I think it's best to figure out the edge cases hit in #40351 and fix the core problem.

#14 @joemcgill
12 months ago

Thanks for the update, @peterwilsoncc

Note: See TracTickets for help on using tickets.