Make WordPress Core

Opened 14 years ago

Closed 13 years ago

#14222 closed defect (bug) (fixed)

Improve dashboard recent comments widget performance by not fetching spam comments

Reported by: viper007bond's profile Viper007Bond Owned by: nacin's profile nacin
Milestone: 3.4 Priority: normal
Severity: normal Version: 3.0
Component: Performance Keywords: has-patch commit already
Focuses: Cc:

Description

Currently we fetch spam comments and exclude them using PHP. It's much better to just not fetch them in the the first place.

If a blog has a TON of spam comments for example, the while() will take quite a long time to find enough legit comments.

Attachments (6)

14222.patch (1.4 KB) - added by Viper007Bond 14 years ago.
Move the allowed states check from PHP into MySQL
14222.2.patch (1.5 KB) - added by Viper007Bond 14 years ago.
Remove comment, move comment_approved condition earlier (as it's most likely to exclude an item), and make sure the while() returns enough comments when done
14222.3.patch (1023 bytes) - added by SergeyBiryukov 13 years ago.
14222.4.patch (1.6 KB) - added by SergeyBiryukov 13 years ago.
14222.diff (2.6 KB) - added by nacin 13 years ago.
14222.2.diff (2.6 KB) - added by nacin 13 years ago.
Account for LIMIT logic.

Download all attachments as: .zip

Change History (34)

@Viper007Bond
14 years ago

Move the allowed states check from PHP into MySQL

#1 @Viper007Bond
14 years ago

I just noticed the inline comment. It looks like this was done on purpose for query performance reasons.

However on WordPress.com we ran into an issue where a blog's most recent 25000 comments or so were all spam. That means the while() had to run a few hundred queries before it managed to find enough non-spam comments. I would not be surprised if self-hosted blogs ran into this exact same issue.

I think it's better to have a little bit slower query rather than having to do potentially many, many queries.

@Viper007Bond
14 years ago

Remove comment, move comment_approved condition earlier (as it's most likely to exclude an item), and make sure the while() returns enough comments when done

#2 @mrmist
14 years ago

+1 to this. Find it hard to see how an IN (0,1) would be faster than an = 1 in any event, plus it's best not to transfer the duff data from sql server to web server.

#3 @azaozz
14 years ago

As far as I remember the performance hit of selecting only non-spam comments was quite significant. And this will run on each loading of the dashboard, millions of times more often than finding a site where thousands of the recent comments are all spam.

Perhaps we could limit the loop to running only 5-10 times and if there aren't enough non-spam comments we can output some kind of notice that all the recent comments are in the spam queue.

#4 follow-up: @nacin
14 years ago

Maybe we can fall back to the other query if we still don't have valid comments after X number of iterations. Again, that depends on how poor in performance this other query is.

#5 in reply to: ↑ 4 ; follow-up: @azaozz
14 years ago

Replying to nacin:
This sounds like a good idea. Perhaps we can loop 5 times and if there aren't 5 non-spam comments, can run a query to select only non-spam comments. Primarily this was affecting larger sites with many comments so the testing should be done on a comments table with at least few hundred thousands rows.

#6 in reply to: ↑ 5 @nacin
14 years ago

Replying to azaozz:

Replying to nacin:
This sounds like a good idea. Perhaps we can loop 5 times and if there aren't 5 non-spam comments, can run a query to select only non-spam comments. Primarily this was affecting larger sites with many comments so the testing should be done on a comments table with at least few hundred thousands rows.

Perhaps we can also progressively increase the limit during each iteration, to keep sites that 50 is enough running the same, but lower the chances we get exposed to the query we're dreading. i.e. first 50, then next 100, then next 150, then next 200, then next 250. We then go from checking 250 comments to three times that.

Also, we should increase the # we pull based on the setting by a factor of ten. In 3.0 we enabled the widget to be customizable up to 30. If 5, then 50, if 10, then 100, if 18, then 180. Given that Akismet says 83% of all comments are spam, this seems reasonable.

Just trying to get creative here.

#7 @westi
14 years ago

  • Cc westi added

We should be able to get the database to do this faster than php.

I guess we need to dig in more as to why the query is so slow and see how we can speed it up.

We could also cache the information so it is not request from the db constantly when people visit the dashboard even if it hasn't changed.

#8 @t31os_
14 years ago

Changeset that introduced the current query, simply for reference.

http://core.trac.wordpress.org/changeset/10079

#9 @nacin
14 years ago

  • Milestone changed from Awaiting Review to Future Release

#10 follow-up: @nacin
14 years ago

  • Milestone changed from Future Release to 3.2

New take on this, after seeing the database dump on #16846:

It was decided originally that excluding in PHP is better for performance. That makes sense in theory.

However, when trash was implemented, we added post_status != 'trash' to the query, probably without regard to the original decision. I don't think the original decision was != versus NOT IN(), it was just having the check to begin with.

Now with that said -- I haven't seen any complaints with regards to performance for the != trash.

So we should either copy or move that to != 'spam', or NOT IN for both trash and spam, which shouldn't be much slower. There's going to be more spam comments than trash comments on pretty much every site, as spamming is going to be automated (and in far greater volumes), so I'm strongly inclined to deal with this one way or another now.

I've also opined here that increasing the limits (both initial, and the jump) on successive queries makes a lot of sense. We should do that improvement as well.

Moving to 3.2. This is one of those lighter/faster tasks.

Last edited 14 years ago by nacin (previous) (diff)

#11 in reply to: ↑ 10 @technosailor
14 years ago

Replying to nacin:

So we should either copy or move that to != 'spam', or NOT IN for both trash and spam, which shouldn't be much slower. There's going to be more spam comments than trash comments on pretty much every site, as spamming is going to be automated (and in far greater volumes), so I'm strongly inclined to deal with this one way or another now.

I just ran benchmarks on a blog running locally (pristine trunk) having 46508 rows in the comments table. I've disabled the query cache just to be safe.

With the current query, it took 18.8 ms:

SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;

Changing the != to NOT IN() resulted in 19.2ms:

SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status NOT IN ('trash') ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;

Adding 'spam' in the NOT IN() results in 18.8ms:

SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status NOT IN ('trash','spam') ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;

I don't really think we're gaining any major advantage from this but happy to be wrong.

Last edited 14 years ago by technosailor (previous) (diff)

#12 @nacin
14 years ago

  • Keywords commit added; needs-testing removed

No complains for != trash since 2.9. Time to run with this.

I'd like to test it on an even larger table and also with some explains just to confirm the numbers.

#13 @markkelnar
14 years ago

I've tested the 2.patch on a site with 25k+ spam comments. It pans down the number of sql queries to get the 5 most recent allowed comments from 500 to 1.

I agree that the most efficient way to filter out spam comments is to have the comment_approved state in the database query instead of post processing that in PHP.

#14 @westi
14 years ago

  • Keywords needs-refresh dev-feedback added; commit removed

@nacin: It is not clear whether you think we should commit the .2. patch or you are referring to the queries in the comment from technosailor.

Not sure why the third query checks post_status = 'spam' either?

I think checking for comment_status != 'spam' is the best fix but it needs some explain'ing to check.

#15 @nacin
14 years ago

  • Milestone changed from 3.2 to Future Release

I was mis-reading the queries earlier. Punt.

#16 @chmac
13 years ago

I just hit this problem on a self hosted site with >30k spam comments. We get a lot of spam volume, and hitting the dashboard admin was putting PHP into a near-perpetual loop looking for enough comments. Very, very, very bad for the server. Somehow PHP was getting around the 30 second max execution time and so after hitting the page a few times, the whole server ground to a halt. :-(

Having just read the history (briefly), unless there's a major performance hit to put != spam into the query, I think that makes the most sense.

#17 @SergeyBiryukov
13 years ago

  • Keywords needs-refresh removed

One problem here is that current query (introduced in [11749]) fetches not only comments, but also full rows of parent posts, which seems unnecessary for this widget. With a lot of spam comments to skip, this makes the queries about 3 times slower. 14222.3.patch fixes just that.

14222.4.patch is a refresh of 14222.2.patch with a couple of changes:

  1. comment_approved values need to be compared as strings, not integers.
  2. Includes 14222.3.patch.

Test 1: 10,303 comments total (3 approved + 10,000 spam + 300 trash)

WP 3.2.1
Total Queries: 234, Total query time: 347,613.0 ms
SELECT * FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50
0.7350 sec

14222.3.patch
Total Queries: 234, Total query time: 107,065.7 ms
SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50
0.1640 sec

14222.4.patch
Total Queries: 27, Total query time: 38.1 ms
SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE c.comment_approved IN ('0','1') AND p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 5
0.0006 sec

Test 2: 10,303 comments total (10,003 approved + 300 trash)

WP 3.2.1
Total Queries: 38, Total query time: 1,994.9 ms
SELECT * FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50
0.8101 sec

14222.3.patch
Total Queries: 38, Total query time: 437.3 ms
SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50
0.1614 sec

14222.4.patch
Total Queries: 38, Total query time: 438.2 ms
SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE c.comment_approved IN ('0','1') AND p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 5
0.1740 sec

#18 @SergeyBiryukov
13 years ago

  • Milestone changed from Future Release to 3.4

#19 @SergeyBiryukov
13 years ago

Closed #19664 as a duplicate.

#20 @joehoyle
13 years ago

I recently ran into this issue, there was about 5000 spam comments to loop through 50 at a time, which was taking over 20 seconds, which made it pretty unusable.

+1 for SergeyBiryukov's patch, if I understand correctly, it's faster than the current implementation, and fixes the issue of handling it in PHP. No brainer?

#21 @ramiy
13 years ago

Related: #17275

#22 @mbijon
13 years ago

  • Cc mike@… added

#23 @nacin
13 years ago

Okay. It dawned on me. We never needed the join to begin with. When a post gets trashed, it gets assigned a post-trashed status to specifically avoid the need to do queries like these.

I suggest we drop the join, then add in the comment_status changes. Going to drop the join now and then do some further testing.

@nacin
13 years ago

#24 @nacin
13 years ago

  • Component changed from Administration to Performance
  • Keywords commit already added; dev-feedback removed

14222.diff introduces quite a bit of speedup.

  • We're no longer checking the post status.
  • We're now only fetching comment fields, no post fields.
  • Even better, we're not even joining the posts table.
  • We're using get_comments(), which builds the most optimized SQL possible for these cases anyway — nice, no raw SQL at no cost. get_comments() is hypothetically cached, as well.
  • We're not selecting 50 at a time, we're selecting 10 * $items at a time (the first time is 5 * $items). By default, this is 50 to show 5 comments, but 25 on the first query. With the box set to 30, this is 150 comments the first time, 300 additional times. The only way a comment does not get included is if current_user_can('read_post') fails -- as in, if the post is private and they don't have access. It's edge. We'll find enough comments very quickly.
  • We're doing a break 2, to back out of the foreach and while, once we find the number of comments we're looking for.
  • Cleans up some PHP-HTML context switching.

What is amazing is that we added a join plus a check of post_status = 'trash' — despite it not being needed — and then waited this long to A) notice, and B) realize that if we're going to do that, we might as well check comment_approved as well. Checking comment_approved and avoiding the JOIN is, obviously, quite the performance improvement.

On a giant table (I was testing with 90,000 comments, 85,000 of them are spam, all inserted in random order), the single query often takes under 10ms even under SQL_NO_CACHE. On a tiny table, the query isn't any noticeable difference. The EXPLAINS are VERY reasonable. No JOIN, a SIMPLE query, and a key is even used (though by no means necessary, per my testing).

EXPLAIN SELECT * FROM blank_comments WHERE ( comment_approved = '0' OR comment_approved = '1' ) ORDER BY comment_date_gmt DESC LIMIT 150;
+----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+
| id | select_type | table          | type  | possible_keys             | key                       | key_len | ref  | rows | Extra                       |
+----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+
|  1 | SIMPLE      | blank_comments | range | comment_approved_date_gmt | comment_approved_date_gmt | 62      | NULL | 2583 | Using where; Using filesort |
+----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+

If we were to do the comment_approved calculations in PHP, sure, we get a slightly better query, but it doesn't end up being any quicker on small tables, and on large tables we are very likely to end up making dozens of these queries (at LIMIT 50).

EXPLAIN SELECT * FROM blank_comments ORDER BY comment_date_gmt DESC LIMIT 150;
+----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+
| id | select_type | table          | type  | possible_keys | key              | key_len | ref  | rows | Extra |
+----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+
|  1 | SIMPLE      | blank_comments | index | NULL          | comment_date_gmt | 8       | NULL |  150 |       |
+----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+

After looking at this for a year, I am strongly confident in this.

@nacin
13 years ago

Account for LIMIT logic.

#25 follow-up: @SergeyBiryukov
13 years ago

There's an odd return in line 621 of 14222.2.diff.

#26 in reply to: ↑ 25 @nacin
13 years ago

Replying to SergeyBiryukov:

There's an odd return in line 621 of 14222.2.diff.

Just a piece of debug. Thanks.

#27 @nacin
13 years ago

Current query:

explain extended SELECT * FROM blank_comments c LEFT JOIN blank_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;
+----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+
| id | select_type | table | type   | possible_keys   | key              | key_len | ref                         | rows | filtered  | Extra       |
+----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+
|  1 | SIMPLE      | c     | index  | comment_post_ID | comment_date_gmt | 8       | NULL                        |   50 | 184620.00 |             |
|  1 | SIMPLE      | p     | eq_ref | PRIMARY         | PRIMARY          | 8       | wordpress.c.comment_post_ID |    1 |    100.00 | Using where |
+----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+

#28 @nacin
13 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from new to closed

In [20609]:

Make the Recent Comments dashboard widget more performant on sites with large amounts of comments, in particular with a heavy spam ratio.

Simplifies the query by avoiding a join, and leverages the API now rather than a direct query.

fixes #14222.

Note: See TracTickets for help on using tickets.