Make WordPress Core

Opened 16 years ago

Closed 11 years ago

Last modified 11 years ago

#7394 closed task (blessed) (fixed)

Search: order results by relevance

Reported by: markjaquith's profile markjaquith Owned by: nacin's profile nacin
Milestone: 3.7 Priority: normal
Severity: normal Version: 2.6
Component: General Keywords: has-patch commit
Focuses: Cc:

Description

I have 35 pages in my WordPress install. My "About" page is on the second page of results when I search for "about"

We should put hits on the title first in the results list.

I'm open to suggestions for possible technical implementations.

Attachments (14)

basic.7394.diff (1.4 KB) - added by scribu 14 years ago.
7394-2.patch (2.1 KB) - added by azaozz 12 years ago.
7394-3.patch (2.3 KB) - added by azaozz 12 years ago.
7394-4.patch (7.3 KB) - added by azaozz 12 years ago.
7394.tests.diff (2.5 KB) - added by kovshenin 12 years ago.
7394.diff (7.1 KB) - added by kovshenin 12 years ago.
7394.2.diff (7.2 KB) - added by DrewAPicture 12 years ago.
Refresh at r24174
7394.3.diff (7.3 KB) - added by wonderboymusic 11 years ago.
7394.4.diff (7.3 KB) - added by DrewAPicture 11 years ago.
Baseline refresh at r25605
7394.5.diff (12.2 KB) - added by wonderboymusic 11 years ago.
7394.6.diff (9.4 KB) - added by nacin 11 years ago.
7394.7.diff (12.4 KB) - added by wonderboymusic 11 years ago.
7394.8.diff (12.7 KB) - added by azaozz 11 years ago.
7394.9.diff (12.3 KB) - added by nacin 11 years ago.

Download all attachments as: .zip

Change History (93)

#1 @joostdevalk
16 years ago

That'd mean ordering by "relevance" instead of ordering by date...

#3 @ryan
16 years ago

  • Milestone changed from 2.7 to 2.8

#4 @FFEMTcJ
16 years ago

  • Keywords needs-post added
  • Milestone changed from 2.8 to Future Release

#5 @FFEMTcJ
16 years ago

  • Keywords needs-patch added; needs-post removed

#6 @Denis-de-Bernardy
16 years ago

  • Milestone Future Release deleted
  • Resolution set to duplicate
  • Status changed from new to closed

see #9785

#7 @scribu
14 years ago

  • Cc scribu added
  • Keywords changed from search, posts, page, title, relevance needs-patch to search posts, page, title, relevance needs-patch

#8 @scribu
14 years ago

  • Keywords changed from search posts, page, title, relevance needs-patch to search posts page, title, relevance needs-patch
  • Milestone set to Future Release
  • Resolution duplicate deleted
  • Status changed from closed to reopened
  • Summary changed from When searching posts/pages in wp-admin, give more emphasis to matches on title to Search: give more emphasis to matches on title

I think this is the low-hanging fruit of the search tickets:

  • easy to implement: just play with the ORDER BY clause
  • major positive effect on search experience

Therefore, re-opening as a separate task. Patch in the works.

#9 @scribu
14 years ago

  • Keywords changed from search posts page, title, relevance needs-patch to search posts page title, relevance needs-patch
  • Owner changed from anonymous to scribu
  • Status changed from reopened to accepted
  • Summary changed from Search: give more emphasis to matches on title to Search: order results by relevance

#10 @scribu
14 years ago

  • Component changed from Administration to General
  • Keywords changed from search posts page title, relevance needs-patch to search posts page title relevance needs-patch

#11 @scribu
14 years ago

Related: #16844

@scribu
14 years ago

#12 @scribu
14 years ago

  • Keywords has-patch added; needs-patch removed

basic.7394.diff simply puts posts that have a matching title above those that have only matching content.

#13 @scribu
14 years ago

It occurs to me that this should have a query var for easy disabling.

For example, it might not be well suited for an 'event' post type.

#14 @scribu
14 years ago

Related: #17139

#15 @scribu
14 years ago

Related: #17152

#16 @scribu
12 years ago

  • Keywords search posts page title relevance removed
  • Owner scribu deleted
  • Status changed from accepted to assigned

#17 @ericlewis
12 years ago

  • Cc eric.andrew.lewis@… added

#18 @azaozz
12 years ago

#21638 was closed as duplicate.

To continue the discussion from there:
Replying to toscho

Could this be extended to the following order:

  1. Results with exact matches for the search phrase.
  2. All words from search phrase in any order.
  3. Some words from the search phrase.

Yes, that would make the search even better when it's multi-word:

( SELECT * FROM wp_posts WHERE post_title LIKE '%one two three%' AND ... )
UNION
( SELECT * FROM wp_posts WHERE post_content LIKE '%one two three%' AND ... )
UNION
( SELECT * FROM wp_posts WHERE post_title LIKE '%one%' AND post_title LIKE '%two%' AND ... )
UNION
( SELECT * FROM wp_posts WHERE post_content LIKE '%one%' AND post_content LIKE '%two%' AND ... )
UNION
( SELECT * FROM wp_posts WHERE (post_title LIKE '%one%' OR post_title LIKE '%two%') AND ... )
LIMIT 0, 20

Not sure we should do the OR case for post_content. That usually returns tons of results (we don't do it at the moment too).

On my test install with about 2000 posts this query performs relatively well, 0.1 - 0.2 sec. with limit 50 and very common search terms.

#19 @azaozz
12 years ago

Thinking more about this: using UNION has some drawbacks like not being able to use ORDER BY in the individual SELECTs unless there is LIMIT. Also 5 SELECTs are getting slow when adding all ANDs and ORs from the standard query.

Modified @scribu's patch to include the same sort conditions and use CASE in the ORDER BY. That is quite faster. When searching for "test post" on the Posts page the produced query is:

SELECT SQL_CALC_FOUND_ROWS  wp_posts.ID FROM wp_posts  WHERE 1=1
AND (((wp_posts.post_title LIKE '%test%')
  OR (wp_posts.post_content LIKE '%test%'))
AND ((wp_posts.post_title LIKE '%post%')
  OR (wp_posts.post_content LIKE '%post%')))
AND wp_posts.post_type = 'post'
AND (
  wp_posts.post_status = 'publish'
  OR wp_posts.post_status = 'future'
  OR wp_posts.post_status = 'draft'
  OR wp_posts.post_status = 'pending'
  OR wp_posts.post_status = 'private'
)
ORDER BY
(CASE
  WHEN wp_posts.post_title LIKE '%test post%' THEN 1
  WHEN wp_posts.post_content LIKE '%test post%' THEN 2
  WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 3
  WHEN wp_posts.post_content LIKE '%test%' AND wp_posts.post_content LIKE '%post%' THEN 4
  WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 5
  ELSE 6
END),
wp_posts.post_date DESC
LIMIT 0, 20

0.0024361610412598

This seems to work well and is quite fast on my test install. Would be good to test on a site with 300 - 400k rows in wp_posts.

@azaozz
12 years ago

#20 follow-up: @tomauger
12 years ago

Well, it appears that using REGEXP is significantly slower than a brute-force WHERE and ORDER BY clause, though the SQL is arguably more elegant (but who cares, I guess).

However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries. I would argue that a post title called "Best Post Evah" matches the search term "Post" better than "Ten Composting Tricks". See below:

SELECT SQL_CALC_FOUND_ROWS wp_posts.ID, wp_posts.post_title
FROM wp_posts 
WHERE 1=1 
	AND (wp_posts.post_title REGEXP 'one|two|three' OR wp_posts.post_content REGEXP 'one|two|three')
	AND wp_posts.post_type IN ('post', 'page', 'attachment') 
	AND (wp_posts.post_status = 'publish' OR wp_posts.post_author = 1 AND wp_posts.post_status = 'private') 
ORDER BY 
	wp_posts.post_title NOT REGEXP '[[:<:]]one two three[[:>:]]',
	wp_posts.post_content NOT REGEXP '[[:<:]]one two three[[:>:]]',
	wp_posts.post_title NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]',
	wp_posts.post_content NOT REGEXP '[[:<:]]one two[[:>:]]|[[:<:]]two three[[:>:]]',
	wp_posts.post_title NOT REGEXP 'one|two|three',
	wp_posts.post_date DESC 
LIMIT 0, 10

On 1000 posts, this search clocked in at >100ms where @azaozz' brute-force search was closer to 16ms. Significant delta there.

Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.

Of course, this then excludes pluralizations and so forth, so "Post" would no longer have a high relevance rating with "Top Ten Posts of 2012" because of the "s".

Last edited 12 years ago by tomauger (previous) (diff)

@azaozz
12 years ago

#21 in reply to: ↑ 20 @azaozz
12 years ago

Replying to tomauger:

Well, it appears that using REGEXP is significantly slower...
However, one takeaway from the SQL below is that we might want to be a bit more careful around word boundaries...

Yes, same as in my tests. Using REGEXP makes the search more precise but is quite slower. On the other hand it almost doesn't affect the highest relevance, full string match in titles. Also when using REGEXP the whole set goes through all sorting rules. When using CASE with multiple WHEN ... THEN it acts like a if()... elseif() block in PHP.

Adding sorting to the search query will slow it down in any case. Thinking best would be to try to get the "most bang for the buck", i.e. something like:

  • full string in title,
  • all words in title,
  • any word in title,
  • full string in content,
  • everything else.

Assuming that most searches are for posts by title, and that 'all words' and 'any word' matches in title would also match in the content.

Of course that makes the sorting less precise but keeps it very fast and is a huge improvement over the current search.

Note that I'm unsure as to the weighting of the same search sequence within post_content as post_title. We may decide that two search terms with proper word boundaries in the title is still better than the full match in the content.

Yes, thinking that too. Seems best to make all matches in titles better than a match in content.

#22 @azaozz
12 years ago

The ORDER BY part of the query with 7394-3.patch is:

...
ORDER BY 
(CASE 
  WHEN wp_posts.post_title LIKE '%test post%' THEN 1
  WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2
  WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3
  WHEN wp_posts.post_content LIKE '%test post%' THEN 4
  ELSE 5
END),
wp_posts.post_date DESC
LIMIT 0, 20

time as reported by SAVEQUERIES: 0.0022249221801758

And for a single term search:

...
ORDER BY
wp_posts.post_title LIKE '%test%' DESC,
wp_posts.post_date DESC
LIMIT 0, 20

0.0019049644470215

Replacing the first match with regexp doesn't affect the speed much, however it wouldn't match plural form of the terms, etc. as @tomauger mentioned above. So "test posts" would not have the highest priority when searching for "test post":

...
ORDER BY
(CASE
  WHEN wp_posts.post_title REGEXP '[[:<:]]test post[[:>:]]' THEN 1
  WHEN wp_posts.post_title LIKE '%test%' AND wp_posts.post_title LIKE '%post%' THEN 2
  WHEN wp_posts.post_title LIKE '%test%' OR wp_posts.post_title LIKE '%post%' THEN 3
  WHEN wp_posts.post_content LIKE '%test post%' THEN 4
  ELSE 5
END),
wp_posts.post_date DESC
LIMIT 0, 20

0.0022361278533936

In that terms the best speed/precision balance seems to be when using the "brute force" LIKE matching in ORDER BY.

#23 @azaozz
12 years ago

Related #21688.

#24 @toscho
12 years ago

  • Cc info@… added

#25 @gibrown
12 years ago

  • Cc gibrown added

@azaozz
12 years ago

#26 @azaozz
12 years ago

7394-4.patch combines 7394-3.patch with 21688/21688-5.patch as they are dependent on each other. Closing #21688 as duplicate/extension of this ticket too.

The combined patch has some enhancements:

  • Sorting by relevance is only used when there's no explicit ORDER BY set for the query.
  • Filter to specify whether the search ORDER BY should look into post_title and/or post_content (can be used to disable sorting by relevance).
  • Looks at Unicode character properties \p{L} to filter out single letter terms but not high UTF-8 chars.

#27 @azaozz
12 years ago

#21688 was marked as a duplicate.

#28 @barry
12 years ago

This has been running on WordPress.com for a while now with no noticeable performance impact (either positive or negative).

#29 @scribu
12 years ago

  • Keywords needs-refresh added

Could we agree to stop introducing new calls to apply_filters_ref_array()? It's not needed in PHP 5.

Also, if we have a 'posts_search_orderby_on' filter for changing the search orderby fields, it would make sense to have an equivalent filter for changing the fields the actual search is performed on, i.e. #21803

#30 @scribu
12 years ago

  • Keywords needs-unit-tests added

#31 follow-up: @nacin
12 years ago

Per IRC discussion:

  • apply_filters() instead of apply_filters_ref_array().
  • Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().

There remain three distinct concerns:

  1. Plugin compatibility: Does this have the potential to break plugins?
  1. Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?
  1. Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.

Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...

#32 in reply to: ↑ 31 @azaozz
12 years ago

Replying to nacin:

  • Split word cleaning (removal of short words, etc) into a separate patch. This should be considered separately. That probably means we can continue to use _search_terms_tidy() instead of _check_search_terms().

It used to be #21688 however sanity checks and removal of one letter terms and stopwords is needed for implementing sorting by relevance.

If you mean separating the "stopwords" functionality in another function, it used to be that way in a previous patch. There might be a possibility to use stopwords somewhere else, so not merging them in _check_search_terms() makes sense.

_search_terms_tidy() was designed to be a callback for array_filter() and has limitations.

There remain three distinct concerns:

  1. Plugin compatibility: Does this have the potential to break plugins?

Not plugins that implement fulltext index on the posts table. Will look for other plugins that (perhaps) implement something similar.

  1. Performance: This worked well on WP.com, but they use SSDs, query caching, and have mostly vanilla use cases (ties back into plugin compatibility). Does this cause problems under strain?

The results from WP.com show no change to the load of MySQL whether it's on the same server or on a dedicated DB server with SSDs, etc. Also ran quite a bit of tests on my tests server and didn't see any MySQL performance problems.

In most cases the ORDER BY would run several more LIKE on the selected rows. While at first look this seems slow, in reality it's very fast. Further the sorting uses only the whole search string if it's too specific (contains many search terms) and has some sensible "sanity limits".

  1. Results: Does this result in bad search results on occasion by promoting the wrong things to the top? One example could include P2 auto titles. Yes, there is a filter, but if there are concerns that were raised by WP.com developers, I'd like to work them out here.

Did quite a bit of research while working on this. The sorting was modelled to mimic how the search engines work. This improvement concerns mostly the front-end searches when a visitor to the site uses our search form. The results we return should be similar to the results Google, Bing, etc. return for the site.

It puts heavy emphasis on term matches in the title with full search string matches receiving the highest priority.

In the particular case for P2s, the auto-generated title is the same as the first few words of the content and may not represent the post very well. For that case matches in the title are disabled but full search string matches in the content are still being used to improve the sorting.

Overall, not looking likely for 3.5. This is something that needs further review and needs to land early. Also, unit tests...

That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)

The proposed patch makes it many times better both for the site visitors and for the admin. I know the SQL may look scary at first but it's just a simple MySQL functionality. It's not more complicated that a join or a subquery.

Last edited 12 years ago by azaozz (previous) (diff)

#33 @scribu
12 years ago

That's pity. Our search has been pretty bad for a very long time, look at when this ticket was opened :)

That can go both ways: if it's been broken for so long, then it's clear that users manage somehow, so waiting a few more months isn't the end of the world.

#34 @Daedalon
12 years ago

http://wordpress.org/extend/plugins/relevanssi/ is how some users have managed. While it would be great to have this improvement in core by 3.5, I have to agree with Scribu that those who have needed this the most are likely already using a plugin solution.

#35 @azaozz
12 years ago

  • Milestone changed from Future Release to 3.6

This has been running on WordPress.com for a long while now. Hopefully the "scary SQL" doesn't look so scary any more :)

#36 @emzo
12 years ago

  • Cc wordpress@… added

@kovshenin
12 years ago

#37 @kovshenin
12 years ago

  • Cc kovshenin added
  • Keywords needs-refresh needs-unit-tests removed

Refreshed @azaozz's -4.patch in 7394.diff. Applies cleanly against trunk, added some spacing here and there, drops usage of _ref_array.

Also added some basic unit tests in 7394.tests.diff run with --group 7394.

#38 @tollmanz
12 years ago

  • Cc tollmanz@… added

@DrewAPicture
12 years ago

Refresh at r24174

#39 @DrewAPicture
12 years ago

  • Keywords needs-testing added

7394.2.diff refreshes the patch.

#40 @mordauk
12 years ago

  • Cc pippin@… added

#41 @alex-ye
12 years ago

  • Cc nashwan.doaqan@… added

#42 @sunnyratilal
12 years ago

  • Cc sunnyratilal5@… added

#43 follow-up: @ryan
11 years ago

  • Milestone changed from 3.6 to Future Release

#44 in reply to: ↑ 43 ; follow-up: @toscho
11 years ago

Replying to ryan:

Milestone changed from 3.6 to Future Release

Why? Seems to be good enough for now.

#45 in reply to: ↑ 44 @DrewAPicture
11 years ago

  • Keywords 3.7-early added

Replying to toscho:

Replying to ryan:

Milestone changed from 3.6 to Future Release

Why? Seems to be good enough for now.

Probably because this is an enhancement and we're in beta 3.

#46 @wonderboymusic
11 years ago

  • Milestone changed from Future Release to 3.7

these are all marked 3.7-early

#47 @desrosj
11 years ago

  • Cc desrosj@… added

#48 @hidgw
11 years ago

  • Cc hidgw added

#49 @markoheijnen
11 years ago

What is the status of this? I think it should be committed now or moved to 3.8 so we can commit it when 3.7 is out.

#50 @azaozz
11 years ago

May need to review the patch again. Separating the stopwords part of _check_search_terms() in another function makes sense. Also a better way for plugins to manage and/or change the posts search query would be nice. Perhaps then we can add this functionality with a filter making it easy to create more advanced search plugins.

Last edited 11 years ago by azaozz (previous) (diff)

#51 @bhengh
11 years ago

  • Cc ben@… added

#52 @jchristopher
11 years ago

  • Cc jonathan@… added

#53 @aaroncampbell
11 years ago

I know the stopwords are filterable, but I think we should probably take an extremely conservative approach here. Words being ignored can give unexpected results. For example a search for "about Jaquith" would actually only search for "Jaquith" which will return all sorts of results but NOT the expected about page.

Having gone through the list a few times, I think it would be fine to just remove about. There might actually be others we could safely add to the list (am, were, do, has, he, she, etc), I'd just prefer to keep it as minimal as possible.

#54 @ericlewis
11 years ago

This patch has had a lot of good traction, but it's been mainly developer feedback and not as much UX spitballing and conversation.

This ticket could benefit from the "plugin as a feature" approach. Improving search is a huge undertaking, and could serve as a flagship feature in a future release.

#55 follow-up: @nacin
11 years ago

  • Keywords needs-refresh added

This is purely architecture. It's not a candidate for a feature plugin.

This patch needs a refresh. On two fronts:

  • The way it integrates into WP_Query is quite messy. We should try to keep this constrained to its own method (or multiple methods).
  • We should consider dropping stopword support. Yes, it can cause the query to grow more complicated, but it's also in no way truly localizable. This is way too much of a burden on translators, and even in English, it's fraught with issues. It would be *far* easier to implement such a major change if we avoided stopwords entirely for now.

Also, this could strongly benefit from some basic unit tests.

#56 @DrewAPicture
11 years ago

Lines 3686-3732 in wp_old_slug_redirect() seem to be dropped in as if from mars.

Is that code supposed to live somewhere else? And if not, how is it that $q, $search_orderby_title, $search_orderby, et al are accessible in a standalone function?

#57 @DrewAPicture
11 years ago

Hmm, seems like it accidentally got moved when I refreshed about 5 months ago. Working on fixing that, then I guess we'll look at splitting to separate methods re: comment:55

@DrewAPicture
11 years ago

Baseline refresh at r25605

#58 follow-up: @wonderboymusic
11 years ago

I have a patch coming in soon, bruh

#59 in reply to: ↑ 58 @DrewAPicture
11 years ago

Replying to wonderboymusic:

I have a patch coming in soon, bruh

FYI: I refreshed off of your 7394.3.diff to 7394.4.diff 10 minutes ago, but seems you won't need it ;)

#60 @wonderboymusic
11 years ago

RFC @nacin / @DrewAPicture - patch is refreshed with Unit Tests, probably needs filter doc things - take a look at 7394.5.diff​

#61 in reply to: ↑ 55 @azaozz
11 years ago

Replying to nacin:

This patch needs a refresh. On two fronts:

  • The way it integrates into WP_Query is quite messy. We should try to keep this constrained to its own method (or multiple methods).

Was even thinking it can be handled only with filters, even the part that is currently inline in query.php. Also, the current code reuses the foreach ( $q['search_terms'] as $term ) loop in parse_search() which is not necessary, we can loop through the terms again when building the ORDER BY part. That will keep the two parts independent.

  • We should consider dropping stopword support. Yes, it can cause the query to grow more complicated, but it's also in no way truly localizable. This is way too much of a burden on translators, and even in English, it's fraught with issues. It would be *far* easier to implement such a major change if we avoided stopwords entirely for now.

Stopwords are an integral part of all searching. Using them not only produces better results, it also improves performance, quite a bit in some cases. There's no point in using search terms that will match all table rows, and using them several times makes it slower.

For example, currently searching for "about the about" will select all table rows several times and make the results irrelevant. With stopwords it will only search for the whole string making it (much) faster and relevant.

It's a good point that adding stopwords may not be that easy for translators. However doing a simple web search for "stopwords" brings lots of examples for all languages, so thinking they will be able to do it. Of course we can add better explanation, both in a comment in that function and on http://make.wordpress.org/polyglots/.

Also can disable (return empty string) the stopwords if they haven't been translated, although if the default English stopwords are used in another language, it won't affect searching much.

Last edited 11 years ago by azaozz (previous) (diff)

#62 @nacin
11 years ago

  • Type changed from enhancement to task (blessed)

#63 @azaozz
11 years ago

  • Type changed from task (blessed) to enhancement

Thinking more about translating stopwords. Good approach would be to not use them when not translated. Then explain that they require some research and testing, and can be left untranslated, so we don't pressure the translators.

#64 @azaozz
11 years ago

  • Type changed from enhancement to task (blessed)

Ugh, sorry for the revert, was accidental :)

#65 @nacin
11 years ago

I agree stopwords are important to searching. But this ticket is about ordering by relevance, which in itself is a good win. The support for stopwords is a bit more controversial and untested.

#66 @azaozz
11 years ago

The English stopwords have been tested for a year on .com.

The problem is stopwords are irrelevant. We should at least remove them when ordering by relevance. Also, this will make the ordering slower as in some cases we will be processing all posts instead of a subset when ordering. Will need more tests.

Version 0, edited 11 years ago by azaozz (next)

@nacin
11 years ago

#68 @jeremyfelt
11 years ago

_deprecated_function( __FUNCTION__, '3.5', '' ); should be 3.7 when it goes in.

#69 @wonderboymusic
11 years ago

.7.diff is nacin's latest + my unit tests

#70 @azaozz
11 years ago

Looking at .6.diff and .7.diff: array_map( 'trim', ... ) may not be safe for UTF-8 chars. By default trim() removes more than [\r\n\t ]+ which can break certain UTF-8 character sequences similar to preg_replace( '/\s+/', ... ). Seems safer to use a loop and control what exactly is trimmed.

Also there's a typo in the stopwords string, "the" is listed twice (my fault).

@azaozz
11 years ago

@nacin
11 years ago

#71 @nacin
11 years ago

  • Keywords commit added; needs-testing 3.7-early needs-refresh removed

Okay, I've updated this with 7394.9.diff.

When writing documentation for new and existing filters, I struggled to explain what the posts_search_orderby_on did. It allowed you to filter array( 'post_title', 'post_content' ). But, there were a few issues:

  • You couldn't add a new field to search, like post_excerpt or post_name.
  • You couldn't rearrange them by putting post_content first.
  • You could only remove one, or the other, or both.

It seems like if we're going to do relevance searches, we should use both post_title and post_content, as we do now. If we want to add better support for changing the searched fields, we can do that, but I don't think it should be tied to this, and adding a filter that is less than useful is not ideal.

The primary purpose of the filter was to avoid the relevance-based ordering all together by returning an empty array. But we already have the posts_search_orderby filter which can be used for that.

I've chosen to not make our relevance-based ordering directly filterable. posts_search_orderby is instead a slightly more generic filter that always runs when a search is in play — even when orderby is set to something else and our relevance-based ORDER BY is ignored. We now have four new protected methods:

  • parse_search(), which is where the 'posts_search' filter ended up. Modeled after parse_tax_query(), etc.
  • parse_search_terms(), no filter, used inside parse_search()
  • get_search_stopwords(), with a generic filter on stopwords, used inside parse_search_terms()
  • parse_search_order(), greatly simplified and without a filter inside of it (see note above)

I've also switched the order of esc_sql() and like_escape(). I'm pretty sure (well, very sure) calling like_escape() before esc_sql() just causes the escape characters to themselves be escaped, thus causing the wildcards to work again. Obviously, not ideal, and core has this problem all over — I figured we can just fix it here for now.

Feedback welcome. Planning to commit this today for Beta 1 tonight.

#72 @nacin
11 years ago

  • Owner set to nacin
  • Resolution set to fixed
  • Status changed from assigned to closed

In 25632:

Order search results by relevance, rather than by date.

The ordering logic is as follows:

  • Full sentence matches in post titles.
  • All search terms in post titles.
  • Any search terms in post titles.
  • Full sentence matches in post content.

Each section and any remaining posts are then sorted by date.

Introduces some filters:

  • wp_search_stopwords, to filter stop words ignored in WHERE.
  • posts_search_orderby, to filter the ORDER BY when ordering search results.

props azaozz, wonderboymusic.
fixes #7394.

#73 @TobiasBg
11 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Small typo: The version parameter in the deprecated _search_terms_tidy function should be 3.7 and not 3.5.

#74 @azaozz
11 years ago

Sorry for the late feedback. Yes, the posts_search_orderby_on filter could only limit fields we look at when sorting by relevance. As we only look at post_title and post_content, that's not very useful. It was added to cover some border cases. For example if a site doesn't have post titles, or uses something like the P2 theme where titles are auto-generated and not particularly relevant.

#75 @nacin
11 years ago

In 25640:

Mark _search_terms_tidy() as deprecated in 3.7. see #7394.

#76 follow-up: @mark-k
11 years ago

@nacin, looks like this will work also for the search result rss feed which IMO it is not a desirable outcome as in a feed you expect things to be sorted by date, the latest being first.

#77 in reply to: ↑ 76 @nacin
11 years ago

Replying to mark-k:

@nacin, looks like this will work also for the search result rss feed which IMO it is not a desirable outcome as in a feed you expect things to be sorted by date, the latest being first.

Solid point. There are extremely limited circumstances where you don't want a chronological feed.

#78 @nacin
11 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 25668:

Don't order feeds of search results by relevance.

Allow for orderby=relevance to explicitly request relevance.

fixes #7394.

#79 @nacin
11 years ago

In 27474:

Add unit tests for searching by relevance.

props wonderboymusic.
see #7394, #20044.

Note: See TracTickets for help on using tickets.