Make WordPress Core

Opened 14 years ago

Closed 13 years ago

Last modified 13 years ago

#17582 closed defect (bug) (fixed)

Problems with duplicated users

Reported by: scribu's profile scribu Owned by: ryan's profile ryan
Milestone: 3.2 Priority: normal
Severity: normal Version:
Component: Query Keywords: close
Focuses: Cc:

Description

Since WP 3.2, you can do this:

$users = get_users( array(
  'meta_query' => array(
    'relation' => 'OR',
    array(
      'key' => 'foo',
    ),
    array(
      'key' => 'bar',
    ),
  )
) );

which is great, except you can end up with a lot of duplicate users.

There's no straightforward way to tell WP_User_Query to add a distinct clause.

Sure, you can do this:

add_action( 'pre_user_query', function( $user_query ) {
  $user_query->query_fields = 'DISTINCT ' . $user_query->query_fields;
} );

but the query that populates $user_query->total_users will ignore that.

Attachments (5)

found_users_query.17582.diff (769 bytes) - added by scribu 14 years ago.
17582.diff (1.4 KB) - added by scribu 14 years ago.
17582.2.diff (1.3 KB) - added by scribu 14 years ago.
17582.3.diff (1.6 KB) - added by greuben 14 years ago.
17582.4.diff (1.5 KB) - added by scribu 14 years ago.

Download all attachments as: .zip

Change History (41)

#1 @scribu
14 years ago

  • Keywords has-patch added

found_users_query.17582.diff introduces the 'found_users_query' filter, similar to 'found_posts_query'.

It's the minimum required to get this working correctly from a plugin.

#2 @scribu
14 years ago

With that patch, you could do something like this:

add_action( 'pre_user_query', function( $user_query ) {
  $user_query->query_fields = 'SQL_CALC_FOUND_ROWS DISTINCT ' . $user_query->query_fields;
} );

add_filter( 'found_users_query', function( $sql ) {
  return 'SELECT FOUND_ROWS()';
} );

Obviously, a 'distinct' arg for WP_User_Query that takes care of this would be better.

#3 @greuben
14 years ago

ORDER BY by user_login or ID will also avoid duplicates

#4 @scribu
14 years ago

You mean GROUP BY, I assume.

It will avoid duplicates, but it will also mess with the total user count, which is what I'm having trouble with.

#5 @dd32
14 years ago

Core should be adding a GROUP BY ID which applies to both queries -- like we do in WP_Query.. I don't see a need to make this work via a plugin, Core should handle it.

#6 @scribu
14 years ago

Yes, Core should handle it, but like I said, GROUP BY ID isn't going to cut it. You won't get the correct total:

SELECT COUNT(*)
FROM wp_users
...
GROUP BY ID

will always return 1.

@scribu
14 years ago

#7 @scribu
14 years ago

  • Milestone changed from Awaiting Review to 3.2

17582.diff introduces 'distinct' query var and uses FOUND_ROWS() to calculate total.

Moving to 3.2 milestone for consideration.

@scribu
14 years ago

#8 @ryan
14 years ago

"Since WP 3.2, you can do this:"

Should that be since 3.1? If this is not a regression in 3.2, I'm inclined to punt this to 3.3.

@greuben
14 years ago

#9 @greuben
14 years ago

Should that be since 3.1?

Its from 3.2(r17699)

See 17582.3.diff for a simple fix.

#10 @ryan
14 years ago

Ah, forgot relation support came in 3.2. That patch seems okay. scribu?

#11 @scribu
14 years ago

17582.3.diff isn't significantly simpler than 17582.2.diff and there's no proof that it has better performance. Remember query_posts(): #10964

Last edited 14 years ago by scribu (previous) (diff)

#12 @scribu
14 years ago

Also, 17582.2.diff allows you to switch to the other method, if you are absolutely sure that it's faster in a particular case.

#13 follow-up: @greuben
14 years ago

and there's no proof that it has better performance

Actually, .2.diff is faster than .3.diff

Run EXPLAIN on the queries generated by two patches, you'll find that .2.diff is faster.

#14 in reply to: ↑ 13 @westi
14 years ago

Replying to greuben:

and there's no proof that it has better performance

Actually, .2.diff is faster than .3.diff

Run EXPLAIN on the queries generated by two patches, you'll find that .2.diff is faster.

In future it would be really useful to include the explain output on the ticket so that anyone reviewing the patch can see it.

#15 @greuben
14 years ago

In future it would be really useful to include the explain output on the ticket so that anyone reviewing the patch can see it.

Sorry, here it is

17582.2.diff

+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+
| id | select_type | table       | type | possible_keys    | key     | key_len | ref             | rows | Extra                           |
+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+
|  1 | SIMPLE      | wp_users    | ALL  | PRIMARY          | NULL    | NULL    | NULL            |    2 | Using temporary; Using filesort |
|  1 | SIMPLE      | wp_usermeta | ref  | user_id,meta_key | user_id | 8       | dev.wp_users.ID |    7 | Distinct                        |
|  1 | SIMPLE      | mt1         | ref  | user_id,meta_key | user_id | 8       | dev.wp_users.ID |    7 | Using where; Distinct           |
+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+

17582.3.diff

+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+
| id | select_type | table       | type | possible_keys    | key     | key_len | ref             | rows | Extra                           |
+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+
|  1 | SIMPLE      | wp_users    | ALL  | PRIMARY          | NULL    | NULL    | NULL            |    2 | Using temporary; Using filesort |
|  1 | SIMPLE      | wp_usermeta | ref  | user_id,meta_key | user_id | 8       | dev.wp_users.ID |    7 |                                 |
|  1 | SIMPLE      | mt1         | ref  | user_id,meta_key | user_id | 8       | dev.wp_users.ID |    7 | Using where                     |
+----+-------------+-------------+------+------------------+---------+---------+-----------------+------+---------------------------------+

#16 @ryan
14 years ago

.2.diff requires the caller to know when to apply distinct, yes? Whereas .3.diff does not. Can $meta_query->get_sql() be made smart enough to know when DISTINCT is required and return a clause for it?

#17 @greuben
14 years ago

Can $meta_query->get_sql() be made smart enough to know when DISTINCT is required and return a clause for it?

Yes it is possible because DISTINCT is only required for OR'ed queries.

There is one problem with using distinct. Consider two users with same display name and run this query.

new WP_User_Query( array(
		'distinct' => true,
		'fields' => array( 'display_name' ),
		'meta_query' => array(
			'relation' => 'OR',
			array( 'key' => 'wp_capabilities' ),
			array( 'key' => 'name' )
		)
) )

#18 @scribu
14 years ago

Ok, what's the problem with the above query? It looks fine to me.

#19 @scribu
14 years ago

Oh, with the same display name... I see.

#20 @scribu
14 years ago

Arguably, returning the name twice in that case isn't very useful anyway, since you can't distinguish between users.

Anyway, I'm fine with 17582.3.diff as long as it also includes the 'found_users_query' filter, so that one could switch to FOUND_ROWS() if needed.

PS: For consistency's sake, GROUP BY should be uppercase:

$this->query_groupby = 'GROUP BY ID';

#21 @ryan
14 years ago

Or .2.diff could get distinct set via get_sql().

#22 @scribu
14 years ago

I'm not sure that would be so straightforward. What if we want to add a 'distinct' query var to WP_User_Query later on?

Also, WP_Meta_Query::get_sql() is used in WP_Query too.

#23 @scribu
14 years ago

It would be better to set a possible_duplicates flag on both WP_Meta_Query and WP_Tax_Query. Based on that, WP_Query and WP_User_Query could take appropriate action.

#24 @scribu
14 years ago

Sorry, I was thinking of get_sql() returning the entire SELECT clause.

Having it return just 'DISTINCT' or an empty string sounds good.

@scribu
14 years ago

#25 @scribu
14 years ago

Found a simpler solution: 17582.4.diff just checks 'OR' == $meta_query->relation, since that's the only case when duplicates can occur.

We should be doing the same in WP_Query, but that's another ticket.

#26 @ryan
14 years ago

.4.diff looks good. Tested for regressions in wp_dropdown_user() and users.php paging. All seems well.

#27 @ryan
14 years ago

  • Owner set to ryan
  • Resolution set to fixed
  • Status changed from new to closed

In [18178]:

Use DISTINCT to eliminate duplicates when using an OR meta query relation. Props scribu, greuben. fixes #17582

#28 @Denis-de-Bernardy
14 years ago

Curious to know why a group by wasn't used, instead of a distinct.

The distinct will cause the data to be sort by all fields in their select order, at which point dupe are eliminated.

If we were to append the order by clause with the ID and then group by and order by the latter, by contrast, we'd eliminate the dups without needing to re-order at all.

#29 follow-up: @scribu
14 years ago

greuben's test suggested that DISTINCT is faster. Of course, more benchmarks are always welcome.

#30 in reply to: ↑ 29 @Denis-de-Bernardy
14 years ago

Replying to scribu:

greuben's test suggested that DISTINCT is faster.

That was because he was grouping by id. In other words it's ordered twice.

If you order by display_name, the group by/order by should be:

GROUP BY display_name, ID
ORDER BY display_name, ID

That way they'll only need to be ordered once, and the extra ID sub-ordering will strip out the dups (at a negligible cost). Best I'm aware MySQL won't do this automatically when using distinct, though I may be wrong (in which case it would certainly explain why it was faster to use distinct).

Last edited 14 years ago by Denis-de-Bernardy (previous) (diff)

#31 @greuben
14 years ago

If you order by display_name, the group by/order by should be:

GROUP BY display_name, ID
ORDER BY display_name, ID

That way they'll only need to be ordered once, and the extra ID sub-ordering will strip out the dups (at a negligible cost).

Here's some more profiling info based on the suggested query & query with patch .2.diff

+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                                                                                                                                              |
+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        1 | 0.01018900 | SELECT wp_users.* FROM wp_users INNER JOIN wp_usermeta ON (wp_users.ID = wp_usermeta.user_id) INNER JOIN wp_usermeta AS mt1 ON (wp_users.ID = mt1.user_id) group by display_name, ID ORDER BY display_name, ID ASC |
|        2 | 0.00065200 | SELECT SQL_CALC_FOUND_ROWS DISTINCT wp_users.* FROM wp_users INNER JOIN wp_usermeta ON (wp_users.ID = wp_usermeta.user_id) INNER JOIN wp_usermeta AS mt1 ON (wp_users.ID = mt1.user_id) ORDER BY display_name ASC  |
|        3 | 0.01074400 | SELECT wp_users.* FROM wp_users INNER JOIN wp_usermeta ON (wp_users.ID = wp_usermeta.user_id) INNER JOIN wp_usermeta AS mt1 ON (wp_users.ID = mt1.user_id) group by display_name, ID ORDER BY display_name, ID ASC |
|        4 | 0.00065100 | SELECT SQL_CALC_FOUND_ROWS DISTINCT wp_users.* FROM wp_users INNER JOIN wp_usermeta ON (wp_users.ID = wp_usermeta.user_id) INNER JOIN wp_usermeta AS mt1 ON (wp_users.ID = mt1.user_id) ORDER BY display_name ASC  |
+----------+------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I am not sure about the theory but DISTINCT is faster in this case.

#32 @nacin
14 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening for discussion. I'm definitely inclined to agree with Denis

#33 @markjaquith
14 years ago

  • Milestone changed from 3.2 to Future Release

No discussion was had. And no patch was provided. Revisit for 3.3?

#34 @scribu
13 years ago

  • Keywords close added; has-patch removed

#35 @scribu
13 years ago

  • Milestone changed from Future Release to 3.3
  • Resolution set to fixed
  • Status changed from reopened to closed

I'm going to close this again, since no discussion was had in a whole year. If someone has a performance problem, they can just open a new ticket.

#36 @scribu
13 years ago

  • Milestone changed from 3.3 to 3.2

I believe we were still working on 3.2 when the commit was made.

Note: See TracTickets for help on using tickets.