WordPress.org

Make WordPress Core

Changes between Initial Version and Version 1 of Ticket #21688, comment 16


Ignore:
Timestamp:
09/08/2012 12:14:25 AM (7 years ago)
Author:
azaozz
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Ticket #21688, comment 16

    initial v1  
    33A few thoughts that came to mind. These probably don't all work together, just throwing out ideas:
    44- Removing all two letter words seems like it will have a lot of implications for abbreviations and a few other english words ('id', 'ha', 'ma').
    5 - To reduce impact of matching sub-words with "%id%" while still matching "house" to both "house" and "houses" we could explicitly match against something like "%house[s $,.\"']". In most cases (for English) any ending besides a plural 's' will change the meaning for the word ("person" vs "personal"). Could define another filter that provides the endings to match against based on language.
    6 - Similarly, could detect whether there is any whitespace in the query and if there is then add whitespace (plus punctuation) around short terms (< 4 letters?). For example "%[^ ]cat[s $,.\"']%" The reason to condition this on the query having whitespace is to not break in foreign languages without whitespace. Again these patterns should probably be able to be language specific and filterable. The reason for not doing this for all words is to improve matching against compound words ("house" can match "treehouse").
     5- To reduce impact of matching sub-words with "%id%" while still matching "house" to both "house" and "houses" we could explicitly match against something like `"%house[s $,.\"']"`. In most cases (for English) any ending besides a plural 's' will change the meaning for the word ("person" vs "personal"). Could define another filter that provides the endings to match against based on language.
     6- Similarly, could detect whether there is any whitespace in the query and if there is then add whitespace (plus punctuation) around short terms (< 4 letters?). For example `"%[^ ]cat[s $,.\"']%"` The reason to condition this on the query having whitespace is to not break in foreign languages without whitespace. Again these patterns should probably be able to be language specific and filterable. The reason for not doing this for all words is to improve matching against compound words ("house" can match "treehouse").
    77- Could expand the stopwords list if we didn't have to convert it from a translated string. There's some pretty good stopword lists for multiple languages here: http://www.ranks.nl/resources/stopwords.html And the filter mechanism allows people to modify them. Speed might be more important than the flexibility glotpress provides. I think generally fewer stop words is better.