#7394 closed task (blessed) (fixed)
Search: order results by relevance
Reported by: | markjaquith | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.7 | Priority: | normal |
Severity: | normal | Version: | 2.6 |
Component: | General | Keywords: | has-patch commit |
Focuses: | Cc: |
Description
I have 35 pages in my WordPress install. My "About" page is on the second page of results when I search for "about"
We should put hits on the title first in the results list.
I'm open to suggestions for possible technical implementations.
Attachments (14)
Change History (93)
#6
@
15 years ago
- Milestone Future Release deleted
- Resolution set to duplicate
- Status changed from new to closed
see #9785
#7
@
14 years ago
- Cc scribu added
- Keywords changed from search, posts, page, title, relevance needs-patch to search posts, page, title, relevance needs-patch
#8
@
14 years ago
- Keywords changed from search posts, page, title, relevance needs-patch to search posts page, title, relevance needs-patch
- Milestone set to Future Release
- Resolution duplicate deleted
- Status changed from closed to reopened
- Summary changed from When searching posts/pages in wp-admin, give more emphasis to matches on title to Search: give more emphasis to matches on title
I think this is the low-hanging fruit of the search tickets:
- easy to implement: just play with the ORDER BY clause
- major positive effect on search experience
Therefore, re-opening as a separate task. Patch in the works.
#9
@
14 years ago
- Keywords changed from search posts page, title, relevance needs-patch to search posts page title, relevance needs-patch
- Owner changed from anonymous to scribu
- Status changed from reopened to accepted
- Summary changed from Search: give more emphasis to matches on title to Search: order results by relevance
#10
@
14 years ago
- Component changed from Administration to General
- Keywords changed from search posts page title, relevance needs-patch to search posts page title relevance needs-patch
#12
@
14 years ago
- Keywords has-patch added; needs-patch removed
basic.7394.diff simply puts posts that have a matching title above those that have only matching content.
#13
@
14 years ago
It occurs to me that this should have a query var for easy disabling.
For example, it might not be well suited for an 'event' post type.
#16
@
12 years ago
- Keywords search posts page title relevance removed
- Owner scribu deleted
- Status changed from accepted to assigned
#18
@
12 years ago
#21638 was closed as duplicate.
To continue the discussion from there:
Replying to toscho
Could this be extended to the following order:
- Results with exact matches for the search phrase.
- All words from search phrase in any order.
- Some words from the search phrase.
Yes, that would make the search even better when it's multi-word:
( SELECT * FROM wp_posts WHERE post_title LIKE '%one two three%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_content LIKE '%one two three%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_title LIKE '%one%' AND post_title LIKE '%two%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE post_content LIKE '%one%' AND post_content LIKE '%two%' AND ... ) UNION ( SELECT * FROM wp_posts WHERE (post_title LIKE '%one%' OR post_title LIKE '%two%') AND ... ) LIMIT 0, 20
Not sure we should do the OR
case for post_content. That usually returns tons of results (we don't do it at the moment too).
On my test install with about 2000 posts this query performs relatively well, 0.1 - 0.2 sec. with limit 50 and very common search terms.
#19
@
12 years ago
Thinking more about this: using UNION has some drawbacks like not being able to use ORDER BY in the individual SELECTs unless there is LIMIT. Also 5 SELECTs are getting slow when adding all ANDs and ORs from the standard query.
Modified @scribu's patch to include the same sort conditions and use CASE in the ORDER BY. That is quite faster. When searching for "test post" on the Posts page the produced query is:
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID FROM wp_posts WHERE 1=1 AND (((wp_posts.post_title LIKE '%test%') OR (wp_posts.post_content LIKE '%test%')) AND ((wp_posts.post_title LIKE '%post%') OR (wp_posts.post_content LIKE '%post%'))) AND wp_posts.post_type = 'post' AND ( wp_posts.post_status = 'publish' OR wp_posts.post_status = 'future' OR wp_posts.post_status = 'draft' OR wp_posts.post_status = 'pending' OR wp_posts.post_status = 'private' ) ORDER BY (CASE WHEN wp_posts.post_title LIKE '%test post%' THEN 1 WHEN wp_posts.post_content LIKE '%test post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test%' AND wp_posts.post_content LIKE '%post%' THEN 4 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 5 ELSE 6 END), wp_posts.post_date DESC LIMIT 0, 20 0.0024361610412598
This seems to work well and is quite fast on my test install. Would be good to test on a site with 300 - 400k rows in wp_posts.
#20
follow-up:
↓ 21
@
12 years ago
Well, it appears that using REGEXP is significantly slower than a brute-force WHERE and ORDER BY clause, though the SQL is arguably more elegant (but who cares, I guess).
However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries. I would argue that a post title called "Best Post Evah" matches the search term "Post" better than "Ten Composting Tricks". See below:
SELECT SQL_CALC_FOUND_ROWS wp_posts.ID, wp_posts.post_title FROM wp_posts WHERE 1=1 AND (wp_posts.post_title REGEXP 'one|two|three' OR wp_posts.post_content REGEXP 'one|two|three') AND wp_posts.post_type IN ('post', 'page', 'attachment') AND (wp_posts.post_status = 'publish' OR wp_posts.post_author = 1 AND wp_posts.post_status = 'private') ORDER BY wp_posts.post_title NOT REGEXP '[[:<:]]one two three[[:>:]]', wp_posts.post_content NOT REGEXP '[[:<:]]one two three[[:>:]]', wp_posts.post_title NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]', wp_posts.post_content NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]', wp_posts.post_title NOT REGEXP 'one|two|three', wp_posts.post_date DESC LIMIT 0, 10
On 1000 posts, this search clocked in at >100ms where @azaozz' brute-force search was closer to 16ms. Significant delta there.
Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.
Of course, this then excludes pluralizations and so forth, so "Post" would no longer have a high relevance rating with "Top Ten Posts of 2012" because of the "s".
#21
in reply to:
↑ 20
@
12 years ago
Replying to tomauger:
Well, it appears that using REGEXP is significantly slower...
However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries...
Yes, same as in my tests. Using REGEXP makes the search more precise but is quite slower. On the other hand it almost doesn't affect the highest relevance, full string match in titles. Also when using REGEXP the whole set goes through all sorting rules. When using CASE with multiple WHEN ... THEN it acts like a if()... elseif() block in PHP.
Adding sorting to the search query will slow it down in any case. Thinking best would be to try to get the "most bang for the buck", i.e. something like:
- full string in title,
- all words in title,
- any word in title,
- full string in content,
- everything else.
Assuming that most searches are for posts by title, and that 'all words' and 'any word' matches in title would also match in the content.
Of course that makes the sorting less precise but keeps it very fast and is a huge improvement over the current search.
Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.
Yes, thinking that too. Seems best to make all matches in titles better than a match in content.
#22
@
12 years ago
The ORDER BY part of the query with 7394-3.patch is:
... ORDER BY (CASE WHEN wp_posts.post_title LIKE '%test post%' THEN 1 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test post%' THEN 4 ELSE 5 END), wp_posts.post_date DESC LIMIT 0, 20 time as reported by SAVEQUERIES: 0.0022249221801758
And for a single term search:
... ORDER BY wp_posts.post_title LIKE '%test%' DESC, wp_posts.post_date DESC LIMIT 0, 20 0.0019049644470215
Replacing the first match with regexp doesn't affect the speed much, however it wouldn't match plural form of the terms, etc. as @tomauger mentioned above. So "test posts" would not have the highest priority when searching for "test post":
... ORDER BY (CASE WHEN wp_posts.post_title REGEXP '[[:<:]]test post[[:>:]]' THEN 1 WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2 WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3 WHEN wp_posts.post_content LIKE '%test post%' THEN 4 ELSE 5 END), wp_posts.post_date DESC LIMIT 0, 20 0.0022361278533936
In that terms the best speed/precision balance seems to be when using the "brute force" LIKE matching in ORDER BY.
#26
@
12 years ago
7394-4.patch combines 7394-3.patch with 21688/21688-5.patch as they are dependent on each other. Closing #21688 as duplicate/extension of this ticket too.
The combined patch has some enhancements:
- Sorting by relevance is only used when there's no explicit ORDER BY set for the query.
- Filter to specify whether the search ORDER BY should look into post_title and/or post_content (can be used to disable sorting by relevance).
- Looks at Unicode character properties
\p{L}
to filter out single letter terms but not high UTF-8 chars.
#28
@
12 years ago
This has been running on WordPress.com for a while now with no noticeable performance impact (either positive or negative).
#29
@
12 years ago
- Keywords needs-refresh added
Could we agree to stop introducing new calls to apply_filters_ref_array()? It's not needed in PHP 5.
Also, if we have a 'posts_search_orderby_on' filter for changing the search orderby fields, it would make sense to have an equivalent filter for changing the fields the actual search is performed on, i.e. #21803
#31
follow-up:
↓ 32
@
12 years ago
Per IRC discussion:
- apply_filters() instead of apply_filters_ref_array().
- Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().
There remain three distinct concerns:
- Plugin compatibility: Does this have the potential to break plugins?
- Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?
- Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.
Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...
#32
in reply to:
↑ 31
@
12 years ago
Replying to nacin:
- Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().
It used to be #21688 however sanity checks and removal of one letter terms and stopwords is needed for implementing sorting by relevance.
If you mean separating the "stopwords" functionality in another function, it used to be that way in a previous patch. There might be a possibility to use stopwords somewhere else, so not merging them in _check_search_terms()
makes sense.
_search_terms_tidy()
was designed to be a callback for array_filter()
and has limitations.
There remain three distinct concerns:
- Plugin compatibility: Does this have the potential to break plugins?
Not plugins that implement fulltext index on the posts table. Will look for other plugins that (perhaps) implement something similar.
- Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?
The results from WP.com show no change to the load of MySQL whether it's on the same server or on a dedicated DB server with SSDs, etc. Also ran quite a bit of tests on my tests server and didn't see any MySQL performance problems.
In most cases the ORDER BY would run several more LIKE on the selected rows. While at first look this seems slow, in reality it's very fast. Further the sorting uses only the whole search string if it's too specific (contains many search terms) and has some sensible "sanity limits".
- Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.
Did quite a bit of research while working on this. The sorting was modelled to mimic how the search engines work. This improvement concerns mostly the front-end searches when a visitor to the site uses our search form. The results we return should be similar to the results Google, Bing, etc. return for the site.
It puts heavy emphasis on term matches in the title with full search string matches receiving the highest priority.
In the particular case for P2s, the auto-generated title is the same as the first few words of the content and may not represent the post very well. For that case matches in the title are disabled but full search string matches in the content are still being used to improve the sorting.
Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...
That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)
The proposed patch makes it many times better both for the site visitors and for the admin. I know the SQL may look scary at first but it's just a simple MySQL functionality. It's not more complicated that a join or a subquery.
#33
@
12 years ago
That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)
That can go both ways: if it's been broken for so long, then it's clear that users manage somehow, so waiting a few more months isn't the end of the world.
#34
@
12 years ago
http://wordpress.org/extend/plugins/relevanssi/ is how some users have managed. While it would be great to have this improvement in core by 3.5, I have to agree with Scribu that those who have needed this the most are likely already using a plugin solution.
#35
@
12 years ago
- Milestone changed from Future Release to 3.6
This has been running on WordPress.com for a long while now. Hopefully the "scary SQL" doesn't look so scary any more :)
#37
@
12 years ago
- Cc kovshenin added
- Keywords needs-refresh needs-unit-tests removed
Refreshed @azaozz's -4.patch in 7394.diff. Applies cleanly against trunk, added some spacing here and there, drops usage of _ref_array.
Also added some basic unit tests in 7394.tests.diff run with --group 7394
.
#44
in reply to:
↑ 43
;
follow-up:
↓ 45
@
11 years ago
Replying to ryan:
Milestone changed from 3.6 to Future Release
Why? Seems to be good enough for now.
#49
@
11 years ago
What is the status of this? I think it should be committed now or moved to 3.8 so we can commit it when 3.7 is out.
#50
@
11 years ago
May need to review the patch again. Separating the stopwords part of _check_search_terms()
in another function makes sense. Also a better way for plugins to manage and/or change the posts search query would be nice. Perhaps then we can add this functionality with a filter making it easy to create more advanced search plugins.
#53
@
11 years ago
I know the stopwords are filterable, but I think we should probably take an extremely conservative approach here. Words being ignored can give unexpected results. For example a search for "about Jaquith" would actually only search for "Jaquith" which will return all sorts of results but NOT the expected about page.
Having gone through the list a few times, I think it would be fine to just remove about. There might actually be others we could safely add to the list (am, were, do, has, he, she, etc), I'd just prefer to keep it as minimal as possible.
#54
@
11 years ago
This patch has had a lot of good traction, but it's been mainly developer feedback and not as much UX spitballing and conversation.
This ticket could benefit from the "plugin as a feature" approach. Improving search is a huge undertaking, and could serve as a flagship feature in a future release.
#55
follow-up:
↓ 61
@
11 years ago
- Keywords needs-refresh added
This is purely architecture. It's not a candidate for a feature plugin.
This patch needs a refresh. On two fronts:
- The way it integrates into WP_Query is quite messy. We should try to keep this constrained to its own method (or multiple methods).
- We should consider dropping stopword support. Yes, it can cause the query to grow more complicated, but it's also in no way truly localizable. This is way too much of a burden on translators, and even in English, it's fraught with issues. It would be *far* easier to implement such a major change if we avoided stopwords entirely for now.
Also, this could strongly benefit from some basic unit tests.
#56
@
11 years ago
Lines 3686-3732 in wp_old_slug_redirect()
seem to be dropped in as if from mars.
Is that code supposed to live somewhere else? And if not, how is it that $q
, $search_orderby_title, $search_orderby, et al are accessible in a standalone function?
#57
@
11 years ago
Hmm, seems like it accidentally got moved when I refreshed about 5 months ago. Working on fixing that, then I guess we'll look at splitting to separate methods re: comment:55
#59
in reply to:
↑ 58
@
11 years ago
Replying to wonderboymusic:
I have a patch coming in soon, bruh
FYI: I refreshed off of your 7394.3.diff to 7394.4.diff 10 minutes ago, but seems you won't need it ;)
#60
@
11 years ago
RFC @nacin / @DrewAPicture - patch is refreshed with Unit Tests, probably needs filter doc things - take a look at 7394.5.diff
#61
in reply to:
↑ 55
@
11 years ago
Replying to nacin:
This patch needs a refresh. On two fronts:
- The way it integrates into WP_Query is quite messy. We should try to keep this constrained to its own method (or multiple methods).
Was even thinking it can be handled only with filters, even the part that is currently inline in query.php. Also, the current code reuses the foreach ( $q['search_terms'] as $term )
loop in parse_search()
which is not necessary, we can loop through the terms again when building the ORDER BY part. That will keep the two parts independent.
- We should consider dropping stopword support. Yes, it can cause the query to grow more complicated, but it's also in no way truly localizable. This is way too much of a burden on translators, and even in English, it's fraught with issues. It would be *far* easier to implement such a major change if we avoided stopwords entirely for now.
Stopwords are an integral part of all searching. Using them not only produces better results, it also improves performance, quite a bit in some cases. There's no point in using search terms that will match all table rows, and using them several times makes it slower.
For example, currently searching for "about the about" will select all table rows several times and make the results irrelevant. With stopwords it will only search for the whole string making it (much) faster and relevant.
It's a good point that adding stopwords may not be that easy for translators. However doing a simple web search for "stopwords" brings lots of examples for all languages, so thinking they will be able to do it. Of course we can add better explanation, both in a comment in that function and on http://make.wordpress.org/polyglots/.
Also can disable (return empty string) the stopwords if they haven't been translated, although if the default English stopwords are used in another language, it won't affect searching much.
#63
@
11 years ago
- Type changed from task (blessed) to enhancement
Thinking more about translating stopwords. Good approach would be to not use them when not translated. Then explain that they require some research and testing, and can be left untranslated, so we don't pressure the translators.
#64
@
11 years ago
- Type changed from enhancement to task (blessed)
Ugh, sorry for the revert, was accidental :)
#65
@
11 years ago
I agree stopwords are important to searching. But this ticket is about ordering by relevance, which in itself is a good win. The support for stopwords is a bit more controversial and untested.
#66
@
11 years ago
The English stopwords have been tested for a year on .com.
The problem is stopwords are irrelevant. We should at least remove them when ordering by relevance. Also, this will make the ordering slower as when the search string contains a stopword, we will be processing all posts instead of a subset when ordering. Will need more testing.
#67
@
11 years ago
Discussed during the 25 September 2013 dev chat.
https://irclogs.wordpress.org/chanlog.php?channel=wordpress-dev&day=2013-09-25&sort=asc#m696001
#70
@
11 years ago
Looking at .6.diff and .7.diff: array_map( 'trim', ... )
may not be safe for UTF-8 chars. By default trim()
removes more than [\r\n\t ]+
which can break certain UTF-8 character sequences similar to preg_replace( '/\s+/', ... )
. Seems safer to use a loop and control what exactly is trimmed.
Also there's a typo in the stopwords string, "the" is listed twice (my fault).
#71
@
11 years ago
- Keywords commit added; needs-testing 3.7-early needs-refresh removed
Okay, I've updated this with 7394.9.diff.
When writing documentation for new and existing filters, I struggled to explain what the posts_search_orderby_on did. It allowed you to filter array( 'post_title', 'post_content' )
. But, there were a few issues:
- You couldn't add a new field to search, like post_excerpt or post_name.
- You couldn't rearrange them by putting post_content first.
- You could only remove one, or the other, or both.
It seems like if we're going to do relevance searches, we should use both post_title and post_content, as we do now. If we want to add better support for changing the searched fields, we can do that, but I don't think it should be tied to this, and adding a filter that is less than useful is not ideal.
The primary purpose of the filter was to avoid the relevance-based ordering all together by returning an empty array. But we already have the posts_search_orderby filter which can be used for that.
I've chosen to not make our relevance-based ordering directly filterable. posts_search_orderby is instead a slightly more generic filter that always runs when a search is in play — even when orderby is set to something else and our relevance-based ORDER BY is ignored. We now have four new protected methods:
- parse_search(), which is where the 'posts_search' filter ended up. Modeled after parse_tax_query(), etc.
- parse_search_terms(), no filter, used inside parse_search()
- get_search_stopwords(), with a generic filter on stopwords, used inside parse_search_terms()
- parse_search_order(), greatly simplified and without a filter inside of it (see note above)
I've also switched the order of esc_sql() and like_escape(). I'm pretty sure (well, very sure) calling like_escape() before esc_sql() just causes the escape characters to themselves be escaped, thus causing the wildcards to work again. Obviously, not ideal, and core has this problem all over — I figured we can just fix it here for now.
Feedback welcome. Planning to commit this today for Beta 1 tonight.
#72
@
11 years ago
- Owner set to nacin
- Resolution set to fixed
- Status changed from assigned to closed
In 25632:
#73
@
11 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
Small typo: The version parameter in the deprecated _search_terms_tidy
function should be 3.7 and not 3.5.
#74
@
11 years ago
Sorry for the late feedback. Yes, the posts_search_orderby_on
filter could only limit fields we look at when sorting by relevance. As we only look at post_title and post_content, that's not very useful. It was added to cover some border cases. For example if a site doesn't have post titles, or uses something like the P2 theme where titles are auto-generated and not particularly relevant.
#76
follow-up:
↓ 77
@
11 years ago
@nacin, looks like this will work also for the search result rss feed which IMO it is not a desirable outcome as in a feed you expect things to be sorted by date, the latest being first.
#77
in reply to:
↑ 76
@
11 years ago
Replying to mark-k:
@nacin, looks like this will work also for the search result rss feed which IMO it is not a desirable outcome as in a feed you expect things to be sorted by date, the latest being first.
Solid point. There are extremely limited circumstances where you don't want a chronological feed.
That'd mean ordering by "relevance" instead of ordering by date...