Make WordPress Core

Opened 12 years ago

Closed 11 years ago

#21688 closed enhancement (duplicate)

Add sanity checks and improve performance when searching for posts

Reported by: azaozz's profile azaozz Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Query Keywords:
Focuses: Cc:

Description

The search part of the main query is quite basic. It needs a few sanity checks that will also improve performance in some cases:

  • Search string length. Most browsers would send between 2000 and 8000 (2KB - 8KB) characters in a GET request, including the URL. Search string longer than 1500 - 1600 chars doesn't make sense (this is urlencoded length).
  • Looking at web search engines data and behavior, most searches are 4 words or less, and searches of more than 7 words are very rare. We should treat searches with lets say 10 or more terms as "sentence", i.e. match only the whole search string instead of splitting it and matching word by word. This would improve both quality of results and speed.
  • All search engines discard very common or very short words. We can't get that sophisticated but can discard terms that are less than 3 characters long from the word by word part of the search. Again, this would improve both quality of results and speed.

Attachments (5)

21688.patch (1.8 KB) - added by azaozz 12 years ago.
21688-2.patch (1.8 KB) - added by azaozz 12 years ago.
21688-3.patch (3.6 KB) - added by azaozz 12 years ago.
21688-4.patch (3.7 KB) - added by azaozz 12 years ago.
21688-5.patch (4.6 KB) - added by azaozz 12 years ago.

Download all attachments as: .zip

Change History (23)

#1 @azaozz
12 years ago

Related #7394.

#2 follow-up: @scribu
12 years ago

Treating searches of more than 10 words as a sentence probably makes sense.

discard terms that are less than 3 characters long from the word by word part of the search

I don't think that's a good idea. Maybe someone has a specific number in the post title that they need to search for.

#3 follow-up: @azaozz
12 years ago

Yeah, was thinking about that too. Perhaps we can do strlen($term) < 3 && !is_numeric($term). That would leave any numbers as separate search terms.

#4 in reply to: ↑ 3 @azaozz
12 years ago

Replying to azaozz:

Yeah, was thinking about that too. Perhaps we can do strlen($term) < 3 && !is_numeric($term). That would leave any numbers as separate search terms.

Also the search string would still be used as a "sentence" so searching for just "5" would match any post that has 5 in the title or content.

@azaozz
12 years ago

#5 in reply to: ↑ 2 @azaozz
12 years ago

Replying to scribu:

I don't think that's a good idea (discard terms that are less than 3 characters long). Maybe someone has a specific number in the post title that they need to search for.

Thinking more about that: combined with the improvements from #7394 this works very well. Searching for "3 blind mice" strips "3" and searches only for "blind" and "mice". However in the ORDER BY the highest priority is given to the whole string match, so a post "About the 3 blind mice" would be at the very top.

@azaozz
12 years ago

#6 @toscho
12 years ago

  • Cc info@… added

#7 follow-up: @toscho
12 years ago

A filter for removed search terms would be useful. This could go into the PHP language files, eg. de_DE.php. Oh, and please mb_strlen(), not `strlen()`. :)

#8 in reply to: ↑ 7 @azaozz
12 years ago

Replying to toscho:

A filter for removed search terms would be useful. This could go into the PHP language files, eg. de_DE.php.

Ha, was thinking just that too :)

Really want to remove "the" from the search terms as it doesn't make sense at all. A filter that returns an array with common words for exclusion makes sense. Then it can either be used directly by the localization file(s) or can pass a CSV list of words to __() so each translation can set them.

Oh, and please mb_strlen(), not strlen(). :)

Using strlen() is a "trick" to not exclude high UTF-8 chars. In some languages a word can be a single char, we won't want to exclude these. May need to look into this more.

@azaozz
12 years ago

#9 follow-up: @azaozz
12 years ago

21688-3.patch introduces wp_search_stopwords() and wp_search_stopwords filter and uses them to filter the separate search terms.

#10 in reply to: ↑ 9 @toscho
12 years ago

Replying to azaozz:

Looks good. I think this comment could be more clear:

// (some browsers won't send more than 2000 characters incl. the URL in a urlencoded GET request).


$q['s'] is already urldecoded when it is counted here. Maybe the comment should say 2000 bytes?

Even that is sometimes wrong when mbstring.func_overload is > 1 and strlen() acts like mb_strlen() … but that’s an edge case and probably not important.

#11 @azaozz
12 years ago

Yeah, too much explanations. Will take that line out. Wanted to point out that 2000 bytes is the limit when the URL is urlencoded. We are checking the length after it's been urldecoded so expecting it to be shorter.

@azaozz
12 years ago

#12 @azaozz
12 years ago

In 21688-4.patch changed strlen($term) < 3 to use string index instead. $string{2} returns the third byte of $string regardless of strlen(), mb_strlen() or mbstring.func_overload. This is combined with checking for an empty term: empty( $term{2} ).

The purpose is to exclude all terms that are one or two characters long. It doesn't make sense to use them, example: LIKE '%ab%' would match all posts in many languages. The matches will be irrelevant and the search will be slower.

The above code is the fastest and simplest way to do this. However it's not very precise. It treats "higher" UTF-8 characters like è, ä, etc. as two letters and misses some like (which is 3 bytes). On the other hand that is useful to not remove terms like 東京 (Tokio) which would be removed if we use mb_strlen().

Thinking this is an acceptable compromise. We may allow some shorter UTF-8 terms that are not essential but won't discard any terms that are needed.

Last edited 12 years ago by azaozz (previous) (diff)

#13 follow-up: @johnbillion
12 years ago

The string length has a very loose correlation to the relevance of the word in a search. Not all short words are irrelevant, especially when it comes to initialisms. Example: TV, 3G, MD, and several country names such as US and UK.

The stopword list is the best method and should also include all the one- and two-letter words in English that are to be filtered out. The Relevanssi search plugin, for example, includes quite an exhaustive list of stopwords.

#14 in reply to: ↑ 13 @azaozz
12 years ago

Replying to johnbillion:

The string length has a very loose correlation to the relevance of the word in a search.

Right. However our search doesn't use word boundaries, it uses LIKE '%term%' not REGEXP '[[:<:]]term[[:>:]]' so it matches inside words too. This has it's advantages (quite faster, matches different forms of the same word, etc.) but also makes matching of most very short terms irrelevant as they will match all or nearly all posts.

With the current patch and the patch from #7394 terms like TV, 3G, MD, UK, US, etc. will not be used when in a multi-word search but will be used for sorting the results as part of the whole search string. Also when a search is only for a short word, the search string is used literally.

The stopword list is the best method and should also include all the one- and two-letter words in English that are to be filtered out. The Relevanssi search plugin, for example, includes quite an exhaustive list of stopwords.

Was thinking about that too but it would make that array very long. Removing one and two letter terms also acts as a sanity check. There is no point in running searches like q w e r t y u i o p or qw er ty ui op as df gh jk as separate terms (they will still run as "sentence").

It's possible to improve the removal of one and two letter terms by looking for capitalization and numbers and not remove these. It would slow down that bit of code though. Will look into it.

Last edited 12 years ago by azaozz (previous) (diff)

@azaozz
12 years ago

#15 @azaozz
12 years ago

21688-5.patch introduces a private function _check_search_term() that replaces _search_terms_tidy() and tidies and checks each term. Terms of 2 bytes or less are removed if they don't contain capital letters or numbers.

There are also some other small improvements: remove all \r \n from the search string before processing, check if the splitting regexp actually matches before using the matches, keep spaces when the search term is in double quotes.

#16 @gibrown
12 years ago

Really like these improvements. Tough problem given mysql's text search limitations.

A few thoughts that came to mind. These probably don't all work together, just throwing out ideas:

  • Removing all two letter words seems like it will have a lot of implications for abbreviations and a few other english words ('id', 'ha', 'ma').
  • To reduce impact of matching sub-words with "%id%" while still matching "house" to both "house" and "houses" we could explicitly match against something like "%house[s $,.\"']". In most cases (for English) any ending besides a plural 's' will change the meaning for the word ("person" vs "personal"). Could define another filter that provides the endings to match against based on language.
  • Similarly, could detect whether there is any whitespace in the query and if there is then add whitespace (plus punctuation) around short terms (< 4 letters?). For example "%[^ ]cat[s $,.\"']%" The reason to condition this on the query having whitespace is to not break in foreign languages without whitespace. Again these patterns should probably be able to be language specific and filterable. The reason for not doing this for all words is to improve matching against compound words ("house" can match "treehouse").
  • Could expand the stopwords list if we didn't have to convert it from a translated string. There's some pretty good stopword lists for multiple languages here: http://www.ranks.nl/resources/stopwords.html And the filter mechanism allows people to modify them. Speed might be more important than the flexibility glotpress provides. I think generally fewer stop words is better.
Last edited 12 years ago by azaozz (previous) (diff)

#17 @gibrown
11 years ago

  • Cc gibrown added

#18 @azaozz
11 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Closing as duplicate of #7394 as the proposed changes are co-dependent.

Note: See TracTickets for help on using tickets.