Opened 5 years ago
Last modified 8 days ago
#7394 assigned enhancement
Search: order results by relevance
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Future Release |
| Component: | General | Version: | 2.6 |
| Severity: | normal | Keywords: | has-patch needs-testing 3.7-early |
| Cc: | scribu, eric.andrew.lewis@…, info@…, gibrown, wordpress@…, kovshenin, tollmanz@…, pippin@…, nashwan.doaqan@…, sunnyratilal5@… |
Description
I have 35 pages in my WordPress install. My "About" page is on the second page of results when I search for "about"
We should put hits on the title first in the results list.
I'm open to suggestions for possible technical implementations.
Attachments (7)
Change History (52)
comment:1
joostdevalk — 5 years ago
- Keywords needs-post added
- Milestone changed from 2.8 to Future Release
- Milestone Future Release deleted
- Resolution set to duplicate
- Status changed from new to closed
see #9785
- Cc scribu added
- Keywords changed from search, posts, page, title, relevance needs-patch to search posts, page, title, relevance needs-patch
- Keywords changed from search posts, page, title, relevance needs-patch to search posts page, title, relevance needs-patch
- Milestone set to Future Release
- Resolution duplicate deleted
- Status changed from closed to reopened
- Summary changed from When searching posts/pages in wp-admin, give more emphasis to matches on title to Search: give more emphasis to matches on title
I think this is the low-hanging fruit of the search tickets:
- easy to implement: just play with the ORDER BY clause
- major positive effect on search experience
Therefore, re-opening as a separate task. Patch in the works.
- Keywords changed from search posts page, title, relevance needs-patch to search posts page title, relevance needs-patch
- Owner changed from anonymous to scribu
- Status changed from reopened to accepted
- Summary changed from Search: give more emphasis to matches on title to Search: order results by relevance
comment:10
scribu — 2 years ago
- Component changed from Administration to General
- Keywords changed from search posts page title, relevance needs-patch to search posts page title relevance needs-patch
comment:11
scribu — 2 years ago
Related: #16844
comment:12
scribu — 2 years ago
- Keywords has-patch added; needs-patch removed
basic.7394.diff simply puts posts that have a matching title above those that have only matching content.
comment:13
scribu — 2 years ago
It occurs to me that this should have a query var for easy disabling.
For example, it might not be well suited for an 'event' post type.
comment:14
scribu — 2 years ago
Related: #17139
comment:15
scribu — 2 years ago
Related: #17152
comment:16
scribu — 11 months ago
- Keywords search posts page title relevance removed
- Owner scribu deleted
- Status changed from accepted to assigned
comment:17
ericlewis — 11 months ago
- Cc eric.andrew.lewis@… added
comment:18
azaozz — 9 months ago
#21638 was closed as duplicate.
To continue the discussion from there:
Replying to toscho
Could this be extended to the following order:
- Results with exact matches for the search phrase.
- All words from search phrase in any order.
- Some words from the search phrase.
Yes, that would make the search even better when it's multi-word:
( SELECT * FROM wp_posts WHERE post_title LIKE '%one two three%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_content LIKE '%one two three%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_title LIKE '%one%' AND post_title LIKE '%two%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_content LIKE '%one%' AND post_content LIKE '%two%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE (post_title LIKE '%one%' OR post_title LIKE '%two%') AND ... ) LIMIT 0, 20
Not sure we should do the OR case for post_content. That usually returns tons of results (we don't do it at the moment too).
On my test install with about 2000 posts this query performs relatively well, 0.1 - 0.2 sec. with limit 50 and very common search terms.
comment:19
azaozz — 9 months ago
Thinking more about this: using UNION has some drawbacks like not being able to use ORDER BY in the individual SELECTs unless there is LIMIT. Also 5 SELECTs are getting slow when adding all ANDs and ORs from the standard query.
Modified @scribu's patch to include the same sort conditions and use CASE in the ORDER BY. That is quite faster. When searching for "test post" on the Posts page the produced query is:
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND (((wp_posts.post_title LIKE '%test%') OR (wp_posts.post_content LIKE '%test%')) AND ((wp_posts.post_title LIKE '%post%') OR (wp_posts.post_content LIKE '%post%'))) AND wp_posts.post_type = 'post' AND ( wp_posts.post_status = 'publish' OR wp_posts.post_status = 'future' OR wp_posts.post_status = 'draft' OR wp_posts.post_status = 'pending' OR wp_posts.post_status = 'private' ) ORDER BY (CASE WHEN wp_posts.post_title LIKE '%test post%' THEN 1 WHEN wp_posts.post_content LIKE '%test post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test%' AND wp_posts.post_content LIKE '%post%' THEN 4 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 5 ELSE 6 END), wp_posts.post_date DESC LIMIT 0, 20 0.0024361610412598
This seems to work well and is quite fast on my test install. Would be good to test on a site with 300 - 400k rows in wp_posts.
comment:20
follow-up:
↓ 21
tomauger — 9 months ago
Well, it appears that using REGEXP is significantly slower than a brute-force WHERE and ORDER BY clause, though the SQL is arguably more elegant (but who cares, I guess).
However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries. I would argue that a post title called "Best Post Evah" matches the search term "Post" better than "Ten Composting Tricks". See below:
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID, wp_posts.post_title
FROM wp_posts
WHERE 1=1
AND (wp_posts.post_title REGEXP 'one|two|three' OR wp_posts.post_content REGEXP 'one|two|three')
AND wp_posts.post_type IN ('post', 'page', 'attachment')
AND (wp_posts.post_status = 'publish' OR wp_posts.post_author = 1 AND wp_posts.post_status = 'private')
ORDER BY
wp_posts.post_title NOT REGEXP '[[:<:]]one two three[[:>:]]',
wp_posts.post_content NOT REGEXP '[[:<:]]one two three[[:>:]]',
wp_posts.post_title NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]',
wp_posts.post_content NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]',
wp_posts.post_title NOT REGEXP 'one|two|three',
wp_posts.post_date DESC
LIMIT 0, 10
Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.
Of course, this then excludes pluralizations and so forth, so "Post" would no longer have a high relevance rating with "Top Ten Posts of 2012" because of the "s".
comment:21
in reply to:
↑ 20
azaozz — 9 months ago
Replying to tomauger:
Well, it appears that using REGEXP is significantly slower...
However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries...
Yes, same as in my tests. Using REGEXP makes the search more precise but is quite slower. On the other hand it almost doesn't affect the highest relevance, full string match in titles. Also when using REGEXP the whole set goes through all sorting rules. When using CASE with multiple WHEN ... THEN it acts like a if()... elseif() block in PHP.
Adding sorting to the search query will slow it down in any case. Thinking best would be to try to get the "most bang for the buck", i.e. something like:
- full string in title,
- all words in title,
- any word in title,
- full string in content,
- everything else.
Assuming that most searches are for posts by title, and that 'all words' and 'any word' matches in title would also match in the content.
Of course that makes the sorting less precise but keeps it very fast and is a huge improvement over the current search.
Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.
Yes, thinking that too. Seems best to make all matches in titles better than a match in content.
comment:22
azaozz — 9 months ago
The ORDER BY part of the query with 7394-3.patch is:
... ORDER BY (CASE WHEN wp_posts.post_title LIKE '%test post%' THEN 1 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test post%' THEN 4 ELSE 5 END), wp_posts.post_date DESC LIMIT 0, 20 time as reported by SAVEQUERIES: 0.0022249221801758
And for a single term search:
... ORDER BY wp_posts.post_title LIKE '%test%' DESC, wp_posts.post_date DESC LIMIT 0, 20 0.0019049644470215
Replacing the first match with regexp doesn't affect the speed much, however it wouldn't match plural form of the terms, etc. as @tomauger mentioned above. So "test posts" would not have the highest priority when searching for "test post":
... ORDER BY (CASE WHEN wp_posts.post_title REGEXP '[[:<:]]test post[[:>:]]' THEN 1 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test post%' THEN 4 ELSE 5 END), wp_posts.post_date DESC LIMIT 0, 20 0.0022361278533936
In that terms the best speed/precision balance seems to be when using the "brute force" LIKE matching in ORDER BY.
comment:23
azaozz — 9 months ago
Related #21688.
comment:24
toscho — 9 months ago
- Cc info@… added
comment:25
gibrown — 8 months ago
- Cc gibrown added
comment:26
azaozz — 7 months ago
7394-4.patch combines 7394-3.patch with 21688/21688-5.patch as they are dependent on each other. Closing #21688 as duplicate/extension of this ticket too.
The combined patch has some enhancements:
- Sorting by relevance is only used when there's no explicit ORDER BY set for the query.
- Filter to specify whether the search ORDER BY should look into post_title and/or post_content (can be used to disable sorting by relevance).
- Looks at Unicode character properties \p{L} to filter out single letter terms but not high UTF-8 chars.
comment:27
azaozz — 7 months ago
#21688 was marked as a duplicate.
comment:28
barry — 7 months ago
This has been running on WordPress.com for a while now with no noticeable performance impact (either positive or negative).
comment:29
scribu — 7 months ago
- Keywords needs-refresh added
Could we agree to stop introducing new calls to apply_filters_ref_array()? It's not needed in PHP 5.
Also, if we have a 'posts_search_orderby_on' filter for changing the search orderby fields, it would make sense to have an equivalent filter for changing the fields the actual search is performed on, i.e. #21803
comment:30
scribu — 7 months ago
- Keywords needs-unit-tests added
comment:31
follow-up:
↓ 32
nacin — 7 months ago
Per IRC discussion:
- apply_filters() instead of apply_filters_ref_array().
- Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().
There remain three distinct concerns:
- Plugin compatibility: Does this have the potential to break plugins?
- Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?
- Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.
Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...
comment:32
in reply to:
↑ 31
azaozz — 7 months ago
Replying to nacin:
- Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().
It used to be #21688 however sanity checks and removal of one letter terms and stopwords is needed for implementing sorting by relevance.
If you mean separating the "stopwords" functionality in another function, it used to be that way in a previous patch. There might be a possibility to use stopwords somewhere else, so not merging them in _check_search_terms() makes sense.
_search_terms_tidy() was designed to be a callback for array_filter() and has limitations.
There remain three distinct concerns:
- Plugin compatibility: Does this have the potential to break plugins?
Not plugins that implement fulltext index on the posts table. Will look for other plugins that (perhaps) implement something similar.
- Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?
The results from WP.com show no change to the load of MySQL whether it's on the same server or on a dedicated DB server with SSDs, etc. Also ran quite a bit of tests on my tests server and didn't see any MySQL performance problems.
In most cases the ORDER BY would run several more LIKE on the selected rows. While at first look this seems slow, in reality it's very fast. Further the sorting uses only the whole search string if it's too specific (contains many search terms) and has some sensible "sanity limits".
- Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.
Did quite a bit of research while working on this. The sorting was modelled to mimic how the search engines work. This improvement concerns mostly the front-end searches when a visitor to the site uses our search form. The results we return should be similar to the results Google, Bing, etc. return for the site.
It puts heavy emphasis on term matches in the title with full search string matches receiving the highest priority.
In the particular case for P2s, the auto-generated title is the same as the first few words of the content and may not represent the post very well. For that case matches in the title are disabled but full search string matches in the content are still being used to improve the sorting.
Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...
That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)
The proposed patch makes it many times better both for the site visitors and for the admin. I know the SQL may look scary at first but it's just a simple MySQL functionality. It's not more complicated that a join or a subquery.
comment:33
scribu — 7 months ago
That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)
That can go both ways: if it's been broken for so long, then it's clear that users manage somehow, so waiting a few more months isn't the end of the world.
comment:34
Daedalon — 7 months ago
http://wordpress.org/extend/plugins/relevanssi/ is how some users have managed. While it would be great to have this improvement in core by 3.5, I have to agree with Scribu that those who have needed this the most are likely already using a plugin solution.
comment:35
azaozz — 4 months ago
- Milestone changed from Future Release to 3.6
This has been running on WordPress.com for a long while now. Hopefully the "scary SQL" doesn't look so scary any more :)
comment:36
emzo — 3 months ago
- Cc wordpress@… added
comment:37
kovshenin — 3 months ago
- Cc kovshenin added
- Keywords needs-refresh needs-unit-tests removed
Refreshed @azaozz's -4.patch in 7394.diff. Applies cleanly against trunk, added some spacing here and there, drops usage of _ref_array.
Also added some basic unit tests in 7394.tests.diff run with --group 7394.
comment:38
tollmanz — 5 weeks ago
- Cc tollmanz@… added
comment:39
DrewAPicture — 3 weeks ago
- Keywords needs-testing added
7394.2.diff refreshes the patch.
comment:40
mordauk — 12 days ago
- Cc pippin@… added
comment:41
alex-ye — 10 days ago
- Cc nashwan.doaqan@… added
comment:42
sunnyratilal — 10 days ago
- Cc sunnyratilal5@… added
comment:43
follow-up:
↓ 44
ryan — 8 days ago
- Milestone changed from 3.6 to Future Release
comment:44
in reply to:
↑ 43
;
follow-up:
↓ 45
toscho — 8 days ago
Replying to ryan:
Milestone changed from 3.6 to Future Release
Why? Seems to be good enough for now.
comment:45
in reply to:
↑ 44
DrewAPicture — 8 days ago
- Keywords 3.7-early added

That'd mean ordering by "relevance" instead of ordering by date...