Opened 11 years ago
Last modified 7 months ago
#25585 reopened enhancement
Arabic stopwords comparison
Reported by: | alex-ye | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 3.7 |
Component: | I18N | Keywords: | needs-patch needs-unit-tests |
Focuses: | Cc: |
Description
WordPress uses simple string comparison in WP_Query->parse_search_terms() function, which is fine for many languages but Arabic need more than that to provide a really smart search!
There are some chars need to removed before the string comparison, like:
Hamza
http://en.wikipedia.org/wiki/Hamza
Diacritics
http://en.wikipedia.org/wiki/Arabic_diacritics
Tāʼ marbūṭah
http://en.wikipedia.org/wiki/Taw
If this cannot be in core, Could you at least add some filters to do it ?!
Attachments (1)
Change History (22)
#3
follow-up:
↓ 5
@
11 years ago
If I understand correctly: in some cases in Arabic the search terms have to be additionally processed. Would the filter in 25585.patch do the job?
#4
@
11 years ago
- Milestone changed from Awaiting Review to 3.7
Moving to 3.7 for consideration as this is new functionality.
#5
in reply to:
↑ 3
@
11 years ago
Replying to azaozz:
Would the filter in 25585.patch do the job?
Yeah, but..
1- I can't re-use the query stopwords list, I wonder why the function is protected?
2- I would like to cancel the strtolower and old stopwords checks for performance reasons.
3- Why we don't introduce sanitize_search_term
function , so this functionality could be used in some other places.
#6
@
11 years ago
I am working now on a plugin that works on @azaozz patch:
https://github.com/nash-ye/wp-arabic-stopwords
#7
follow-up:
↓ 9
@
11 years ago
Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?
Even without the wp_search_stopwords filter, searching is still better in 3.7 than it was in 3.6. We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count. Is that OK?
#9
in reply to:
↑ 7
@
11 years ago
Replying to nacin:
Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?
Yes, If we can't remove those chars from the user input before the stopwords comparison, the Arabic translators may need to write every stopword 3-5 times! as you know we can't trust the user input.
We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after?
1+
#10
follow-up:
↓ 12
@
11 years ago
I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count.
Yeah, it can be moved to the proposed filter so plugins could change the pattern for specific languages.
The idea is to remove single letter terms from the search. The pattern /^\p{L}$/u
is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be /^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u
which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points.
search_terms_count
is the count before the terms were cleaned. It's used to determine if the sorting would use CASE and match in both title and content, or just a sentence match. This part of parse_search_order() has gone through quite a few changes, maybe there is a simpler way to do that now.
#12
in reply to:
↑ 10
@
11 years ago
Replying to azaozz:
The idea is to remove single letter terms from the search. The pattern
/^\p{L}$/u
is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be/^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u
which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points (more info).
See the example below, I have used the 'str_replace' function and I will try to apply your suggestion about the RegExp:
function ArWP_normalize( $str ) { // Normalize the Alef. $str = str_replace( array( 'أ','إ','آ' ), 'ا', $str ); // Normalize the Diacritics. $str = str_replace( array( 'َ','ً','ُ','ٌ','ِ','ٍ','ْ','ّ' ), '', $str ); // Return the new string. return $str; } // end ArWP_normalize()
Is there a simple function in WordPress core to get the unicode code point?!
#13
@
11 years ago
- Component changed from General to I18N
- Milestone changed from 3.7 to Future Release
This is too late for 3.7, but I'd be happy to revisit it. It would be great to robustly support stopwords in other languages. For now, it's no worse than it was in 3.6.
#16
@
9 years ago
- Milestone Future Release deleted
- Resolution set to maybelater
- Status changed from new to closed
Closing as maybelater. Complete lack of interest on the feature on the ticket over the last 2 years. Feel free to reopen when more interest re-emerges (particularly if there's a patch)
#17
@
9 years ago
- Keywords needs-unit-tests added
- Milestone set to Awaiting Review
- Resolution maybelater deleted
- Status changed from closed to reopened
IF THIS ACCEPTED IN CORE:
We could use some functions from ArPHP project:
http://sourceforge.net/projects/ar-php/