Opened 14 years ago
Closed 13 years ago
#14222 closed defect (bug) (fixed)
Improve dashboard recent comments widget performance by not fetching spam comments
Reported by: | Viper007Bond | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.4 | Priority: | normal |
Severity: | normal | Version: | 3.0 |
Component: | Performance | Keywords: | has-patch commit already |
Focuses: | Cc: |
Description
Currently we fetch spam comments and exclude them using PHP. It's much better to just not fetch them in the the first place.
If a blog has a TON of spam comments for example, the while()
will take quite a long time to find enough legit comments.
Attachments (6)
Change History (34)
#1
@
14 years ago
I just noticed the inline comment. It looks like this was done on purpose for query performance reasons.
However on WordPress.com we ran into an issue where a blog's most recent 25000 comments or so were all spam. That means the while()
had to run a few hundred queries before it managed to find enough non-spam comments. I would not be surprised if self-hosted blogs ran into this exact same issue.
I think it's better to have a little bit slower query rather than having to do potentially many, many queries.
@
14 years ago
Remove comment, move comment_approved condition earlier (as it's most likely to exclude an item), and make sure the while() returns enough comments when done
#2
@
14 years ago
+1 to this. Find it hard to see how an IN (0,1) would be faster than an = 1 in any event, plus it's best not to transfer the duff data from sql server to web server.
#3
@
14 years ago
As far as I remember the performance hit of selecting only non-spam comments was quite significant. And this will run on each loading of the dashboard, millions of times more often than finding a site where thousands of the recent comments are all spam.
Perhaps we could limit the loop to running only 5-10 times and if there aren't enough non-spam comments we can output some kind of notice that all the recent comments are in the spam queue.
#4
follow-up:
↓ 5
@
14 years ago
Maybe we can fall back to the other query if we still don't have valid comments after X number of iterations. Again, that depends on how poor in performance this other query is.
#5
in reply to:
↑ 4
;
follow-up:
↓ 6
@
14 years ago
Replying to nacin:
This sounds like a good idea. Perhaps we can loop 5 times and if there aren't 5 non-spam comments, can run a query to select only non-spam comments. Primarily this was affecting larger sites with many comments so the testing should be done on a comments table with at least few hundred thousands rows.
#6
in reply to:
↑ 5
@
14 years ago
Replying to azaozz:
Replying to nacin:
This sounds like a good idea. Perhaps we can loop 5 times and if there aren't 5 non-spam comments, can run a query to select only non-spam comments. Primarily this was affecting larger sites with many comments so the testing should be done on a comments table with at least few hundred thousands rows.
Perhaps we can also progressively increase the limit during each iteration, to keep sites that 50 is enough running the same, but lower the chances we get exposed to the query we're dreading. i.e. first 50, then next 100, then next 150, then next 200, then next 250. We then go from checking 250 comments to three times that.
Also, we should increase the # we pull based on the setting by a factor of ten. In 3.0 we enabled the widget to be customizable up to 30. If 5, then 50, if 10, then 100, if 18, then 180. Given that Akismet says 83% of all comments are spam, this seems reasonable.
Just trying to get creative here.
#7
@
14 years ago
- Cc westi added
We should be able to get the database to do this faster than php.
I guess we need to dig in more as to why the query is so slow and see how we can speed it up.
We could also cache the information so it is not request from the db constantly when people visit the dashboard even if it hasn't changed.
#10
follow-up:
↓ 11
@
14 years ago
- Milestone changed from Future Release to 3.2
New take on this, after seeing the database dump on #16846:
It was decided originally that excluding in PHP is better for performance. That makes sense in theory.
However, when trash was implemented, we added post_status != 'trash'
to the query, probably without regard to the original decision. I don't think the original decision was != versus NOT IN(), it was just having the check to begin with.
Now with that said -- I haven't seen any complaints with regards to performance for the != trash.
So we should either copy or move that to != 'spam', or NOT IN for both trash and spam, which shouldn't be much slower. There's going to be more spam comments than trash comments on pretty much every site, as spamming is going to be automated (and in far greater volumes), so I'm strongly inclined to deal with this one way or another now.
I've also opined here that increasing the limits (both initial, and the jump) on successive queries makes a lot of sense. We should do that improvement as well.
Moving to 3.2. This is one of those lighter/faster tasks.
#11
in reply to:
↑ 10
@
14 years ago
Replying to nacin:
So we should either copy or move that to != 'spam', or NOT IN for both trash and spam, which shouldn't be much slower. There's going to be more spam comments than trash comments on pretty much every site, as spamming is going to be automated (and in far greater volumes), so I'm strongly inclined to deal with this one way or another now.
I just ran benchmarks on a blog running locally (pristine trunk) having 46508 rows in the comments table. I've disabled the query cache just to be safe.
With the current query, it took 18.8 ms:
SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;
Changing the != to NOT IN() resulted in 19.2ms:
SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status NOT IN ('trash') ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;
Adding 'spam' in the NOT IN() results in 18.8ms:
SELECT * FROM wp_comments c LEFT JOIN wp_posts p ON c.comment_post_ID = p.ID WHERE p.post_status NOT IN ('trash','spam') ORDER BY c.comment_date_gmt DESC LIMIT 0, 50;
I don't really think we're gaining any major advantage from this but happy to be wrong.
#12
@
14 years ago
- Keywords commit added; needs-testing removed
No complains for != trash since 2.9. Time to run with this.
I'd like to test it on an even larger table and also with some explains just to confirm the numbers.
#13
@
14 years ago
I've tested the 2.patch on a site with 25k+ spam comments. It pans down the number of sql queries to get the 5 most recent allowed comments from 500 to 1.
I agree that the most efficient way to filter out spam comments is to have the comment_approved state in the database query instead of post processing that in PHP.
#14
@
14 years ago
- Keywords needs-refresh dev-feedback added; commit removed
@nacin: It is not clear whether you think we should commit the .2. patch or you are referring to the queries in the comment from technosailor.
Not sure why the third query checks post_status = 'spam' either?
I think checking for comment_status != 'spam' is the best fix but it needs some explain'ing to check.
#15
@
14 years ago
- Milestone changed from 3.2 to Future Release
I was mis-reading the queries earlier. Punt.
#16
@
13 years ago
I just hit this problem on a self hosted site with >30k spam comments. We get a lot of spam volume, and hitting the dashboard admin was putting PHP into a near-perpetual loop looking for enough comments. Very, very, very bad for the server. Somehow PHP was getting around the 30 second max execution time and so after hitting the page a few times, the whole server ground to a halt. :-(
Having just read the history (briefly), unless there's a major performance hit to put != spam into the query, I think that makes the most sense.
#17
@
13 years ago
- Keywords needs-refresh removed
One problem here is that current query (introduced in [11749]) fetches not only comments, but also full rows of parent posts, which seems unnecessary for this widget. With a lot of spam comments to skip, this makes the queries about 3 times slower. 14222.3.patch fixes just that.
14222.4.patch is a refresh of 14222.2.patch with a couple of changes:
comment_approved
values need to be compared as strings, not integers.- Includes 14222.3.patch.
Test 1: 10,303 comments total (3 approved + 10,000 spam + 300 trash)
WP 3.2.1 Total Queries: 234, Total query time: 347,613.0 ms SELECT * FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50 0.7350 sec 14222.3.patch Total Queries: 234, Total query time: 107,065.7 ms SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50 0.1640 sec 14222.4.patch Total Queries: 27, Total query time: 38.1 ms SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE c.comment_approved IN ('0','1') AND p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 5 0.0006 sec
Test 2: 10,303 comments total (10,003 approved + 300 trash)
WP 3.2.1 Total Queries: 38, Total query time: 1,994.9 ms SELECT * FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50 0.8101 sec 14222.3.patch Total Queries: 38, Total query time: 437.3 ms SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0 , 50 0.1614 sec 14222.4.patch Total Queries: 38, Total query time: 438.2 ms SELECT c.* FROM trunk_comments c LEFT JOIN trunk_posts p ON c.comment_post_ID = p.ID WHERE c.comment_approved IN ('0','1') AND p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 5 0.1740 sec
#20
@
13 years ago
I recently ran into this issue, there was about 5000 spam comments to loop through 50 at a time, which was taking over 20 seconds, which made it pretty unusable.
+1 for SergeyBiryukov's patch, if I understand correctly, it's faster than the current implementation, and fixes the issue of handling it in PHP. No brainer?
#23
@
13 years ago
Okay. It dawned on me. We never needed the join to begin with. When a post gets trashed, it gets assigned a post-trashed status to specifically avoid the need to do queries like these.
I suggest we drop the join, then add in the comment_status changes. Going to drop the join now and then do some further testing.
#24
@
13 years ago
- Component changed from Administration to Performance
- Keywords commit already added; dev-feedback removed
14222.diff introduces quite a bit of speedup.
- We're no longer checking the post status.
- We're now only fetching comment fields, no post fields.
- Even better, we're not even joining the posts table.
- We're using get_comments(), which builds the most optimized SQL possible for these cases anyway — nice, no raw SQL at no cost. get_comments() is hypothetically cached, as well.
- We're not selecting 50 at a time, we're selecting
10 * $items
at a time (the first time is5 * $items
). By default, this is 50 to show 5 comments, but 25 on the first query. With the box set to 30, this is 150 comments the first time, 300 additional times. The only way a comment does not get included is if current_user_can('read_post') fails -- as in, if the post is private and they don't have access. It's edge. We'll find enough comments very quickly. - We're doing a break 2, to back out of the foreach and while, once we find the number of comments we're looking for.
- Cleans up some PHP-HTML context switching.
What is amazing is that we added a join plus a check of post_status = 'trash' — despite it not being needed — and then waited this long to A) notice, and B) realize that if we're going to do that, we might as well check comment_approved as well. Checking comment_approved and avoiding the JOIN is, obviously, quite the performance improvement.
On a giant table (I was testing with 90,000 comments, 85,000 of them are spam, all inserted in random order), the single query often takes under 10ms even under SQL_NO_CACHE. On a tiny table, the query isn't any noticeable difference. The EXPLAINS are VERY reasonable. No JOIN, a SIMPLE query, and a key is even used (though by no means necessary, per my testing).
EXPLAIN SELECT * FROM blank_comments WHERE ( comment_approved = '0' OR comment_approved = '1' ) ORDER BY comment_date_gmt DESC LIMIT 150; +----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+ | 1 | SIMPLE | blank_comments | range | comment_approved_date_gmt | comment_approved_date_gmt | 62 | NULL | 2583 | Using where; Using filesort | +----+-------------+----------------+-------+---------------------------+---------------------------+---------+------+------+-----------------------------+
If we were to do the comment_approved calculations in PHP, sure, we get a slightly better query, but it doesn't end up being any quicker on small tables, and on large tables we are very likely to end up making dozens of these queries (at LIMIT 50).
EXPLAIN SELECT * FROM blank_comments ORDER BY comment_date_gmt DESC LIMIT 150; +----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+ | 1 | SIMPLE | blank_comments | index | NULL | comment_date_gmt | 8 | NULL | 150 | | +----+-------------+----------------+-------+---------------+------------------+---------+------+------+-------+
After looking at this for a year, I am strongly confident in this.
#26
in reply to:
↑ 25
@
13 years ago
Replying to SergeyBiryukov:
There's an odd
return
in line 621 of 14222.2.diff.
Just a piece of debug. Thanks.
#27
@
13 years ago
Current query:
explain extended SELECT * FROM blank_comments c LEFT JOIN blank_posts p ON c.comment_post_ID = p.ID WHERE p.post_status != 'trash' ORDER BY c.comment_date_gmt DESC LIMIT 0, 50; +----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+ | 1 | SIMPLE | c | index | comment_post_ID | comment_date_gmt | 8 | NULL | 50 | 184620.00 | | | 1 | SIMPLE | p | eq_ref | PRIMARY | PRIMARY | 8 | wordpress.c.comment_post_ID | 1 | 100.00 | Using where | +----+-------------+-------+--------+-----------------+------------------+---------+-----------------------------+------+-----------+-------------+
Move the allowed states check from PHP into MySQL