Make WordPress Core

Opened 12 years ago

Closed 11 years ago

#21435 closed enhancement (fixed)

wp-includes/comment.php line85 causes slow query due to the non-indexed column

Reported by: matsubobo's profile matsubobo Owned by: pento's profile pento
Milestone: 4.0 Priority: normal
Severity: minor Version: 3.4.1
Component: Database Keywords: has-patch commit
Focuses: performance Cc:

Description

Following query is causes slow query if the wp_comment table is huge.

$ok_to_comment = $wpdb->get_var("SELECT comment_approved FROM $wpdb->comments WHERE comment_author = '$author' AND comment_author_email = '$email' and comment_approved = '1' LIMIT 1");                                                         

In my case, I have 600 thousand records in the wp_comments table and the query takes over 10 minutes to complete.

http://matsu.teraren.com/blog/wp-content/uploads/2012/08/a40b1291fd99413dc3057fbe0b792a93.png

To fix this issue, I added index on my running wordpress and returns 0.00sec.

I'll attach the patch for create table file.

Attachments (4)

index.patch (511 bytes) - added by matsubobo 12 years ago.
patch for create table.
21435.diff (868 bytes) - added by pento 11 years ago.
21435.2.diff (973 bytes) - added by pento 11 years ago.
21435.3.diff (974 bytes) - added by nacin 11 years ago.
Updated for whitespace

Download all attachments as: .zip

Change History (22)

@matsubobo
12 years ago

patch for create table.

#1 @nacin
11 years ago

  • Component changed from General to Database
  • Milestone changed from Awaiting Review to Future Release
  • Owner set to pento
  • Status changed from new to assigned

@pento
11 years ago

#2 follow-up: @pento
11 years ago

  • Keywords commit added
  • Milestone changed from Future Release to 3.9

I like this index. I'd also expect it to speed up large import processes, which make liberal use of comment_exists().

attachment:21435.diff​ updates the patch to apply cleanly against trunk, and fixes a couple of typos in the original.

#3 @SergeyBiryukov
11 years ago

  • Focuses performance added

#4 @pento
11 years ago

  • Keywords commit removed

On reflection, attachment:21435.diff isn't exactly the right solution. An index on (comment_author,comment_author_email) would be better, but it could get pretty big, and would be slow to create on big wp_comment tables.

Maybe a 50 character prefix index would be a good trade off between size and query performance, but doesn't solve the problem of creating the index.

#5 @pento
11 years ago

Related: #14711

#6 follow-ups: @mdawaffe
11 years ago

Seems like (comment_author_email,comment_author) would be more generally useful.

#7 in reply to: ↑ 2 @matsubobo
11 years ago

Replying to pento:

I like this index. I'd also expect it to speed up large import processes, which make liberal use of comment_exists().

attachment:21435.diff​ updates the patch to apply cleanly against trunk, and fixes a couple of typos in the original.

Thank you for your feedback but It's not a typo.
The parameter '2' in wcomment_author is what I expected to avoid deep index for string.

KEY comment_author (wcomment_author(2)),

Please refer the following MySQL index logic.
https://dev.mysql.com/doc/refman/5.6/en/create-index.html

#8 in reply to: ↑ 6 @matsubobo
11 years ago

Replying to mdawaffe:

Seems like (comment_author_email,comment_author) would be more generally useful.

I don't think so.
The column, comment_author is stored almost unique and cardinality is high so I just added index for this column to avoid getting RDBMS's index size big and overhead of updating row.

Adding multi column index to character column leads to deep BTree index.

#9 @matsubobo
11 years ago

I'm currently managing 700,000 records in wp_comments table and works fine with this index setting.

Please see the my table information on production.
https://www.evernote.com/shard/s363/sh/09b52bd7-54c0-4ec1-b985-b9cf86cea3fb/6c6a5cc2aa051206c4e375a2ac0274ab

I hope this patch will be applied as soon as possible because this index increase much performance for large scale wordpress site with less cpu and data cost.

#10 in reply to: ↑ 6 ; follow-ups: @pento
11 years ago

Replying to mdawaffe:

Seems like (comment_author_email,comment_author) would be more generally useful.

Aye, that sounds like a better idea.

  • In wp_allow_comment(), comment_author_email is optional, so having comment_author first would be better. On the other hand, I assume the vast majority of sites require an email address.
  • In check_comment_flood_db(), comment_author_email is used, but comment_author isn't. I'd expect the comment_date_gmt index to be more useful here, though.
  • In WP_Comment_Query::query(), comment_author_email is an allowed param, but comment_author isn't, so having comment_author_email first would probably be more useful, depending on how WP_Comment_Query::query() is being used.

Replying to matsubobo:

Please see the my table information on production.
https://www.evernote.com/shard/s363/sh/09b52bd7-54c0-4ec1-b985-b9cf86cea3fb/6c6a5cc2aa051206c4e375a2ac0274ab

That's interesting, I'm surprised it gets that high a cardinality from 2 bytes. I'm curious, is the content of comment_author on your site generally in Japanese? I'm not at all familiar with character distribution in Japanese names or writing, but assuming a random distribution, there are certainly enough characters in Japanese to provide a better cardinality than in the English alphabet.

#11 in reply to: ↑ 10 @matsubobo
11 years ago

Replying to pento:

Replying to matsubobo:

Please see the my table information on production.
https://www.evernote.com/shard/s363/sh/09b52bd7-54c0-4ec1-b985-b9cf86cea3fb/6c6a5cc2aa051206c4e375a2ac0274ab

That's interesting, I'm surprised it gets that high a cardinality from 2 bytes. I'm curious, is the content of comment_author on your site generally in Japanese? I'm not at all familiar with character distribution in Japanese names or writing, but assuming a random distribution, there are certainly enough characters in Japanese to provide a better cardinality than in the English alphabet.

Yes. Japanese.
In this case, the column is not binary so index is constructed with 2 words.
Japanese characters are almost 50,000 words so that index tree would have maximum 50,000^2 node when using depth 2 BTree index.

If other users does not have this kind of performance problem, index length should be minimum.

Last edited 11 years ago by SergeyBiryukov (previous) (diff)

#12 in reply to: ↑ 10 @mdawaffe
11 years ago

Replying to pento:

Replying to mdawaffe:

Seems like (comment_author_email,comment_author) would be more generally useful.

Aye, that sounds like a better idea.

  • In wp_allow_comment(), comment_author_email is optional

Ah - I didn't think about that. By the way, for the use cases I was thinking about (essentially the others you mention), a single column index on comment_author_email would have been just as good.

#13 @nacin
11 years ago

  • Milestone changed from 3.9 to Future Release

I'd like to fix this. However, it is too late in the 3.9 cycle for a schema change.

I'm curious about a few things. Namely:

  • Which index is generally going to be the best balance of speed and size, for what we do now and what we may do in the future? There are some good discussions here and possibly even a consensus, but yet a lot of data to back up a decision.
  • Why haven't others reported this? 600,000 records in wp_comments isn't that unusual. Surely WordPress.com would have noticed this by now? Do giant sites just actively or incidentally avoid the comment_whitelist option? ("Comment author must have a previously approved comment")

@pento
11 years ago

#14 @pento
11 years ago

  • Milestone changed from Future Release to 4.0

All things old become new again.

As suggested by tellyworth some 4 years ago in #14711, I think a prefix index on comment_author_email would provide our best balance of performance vs. size.

For index creation, I did a rough test on 1,000,000 rows, it took 10 seconds with MyISAM, and 9 seconds with InnoDB.

This ticket was mentioned in IRC in #wordpress-dev by nacin. View the logs.


11 years ago

@nacin
11 years ago

Updated for whitespace

#17 @nacin
11 years ago

  • Keywords commit added

This patch looks good. Want to hear from batmoo, tellyworth, bazza if possible before proceeding.

#18 @nacin
11 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 29188:

DB schema change: Add comment_author_email(10) index.

props tellyworth, pento.
fixes #21435.

Note: See TracTickets for help on using tickets.