WordPress.org

Make WordPress Core

Opened 6 months ago

Last modified 2 months ago

#25585 new enhancement

Arabic stopwords comparison

Reported by: alex-ye Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.7
Component: I18N Keywords: needs-patch
Focuses: Cc:

Description

WordPress uses simple string comparison in WP_Query->parse_search_terms() function, which is fine for many languages but Arabic need more than that to provide a really smart search!

There are some chars need to removed before the string comparison, like:

Hamza
http://en.wikipedia.org/wiki/Hamza

Diacritics
http://en.wikipedia.org/wiki/Arabic_diacritics

Tāʼ marbūṭah
http://en.wikipedia.org/wiki/Taw

If this cannot be in core, Could you at least add some filters to do it ?!

Attachments (1)

25585.patch (662 bytes) - added by azaozz 6 months ago.

Download all attachments as: .zip

Change History (16)

comment:1 alex-ye6 months ago

IF THIS ACCEPTED IN CORE:
We could use some functions from ArPHP project:
http://sourceforge.net/projects/ar-php/

comment:2 SergeyBiryukov6 months ago

  • Version set to trunk

azaozz6 months ago

comment:3 follow-up: azaozz6 months ago

If I understand correctly: in some cases in Arabic the search terms have to be additionally processed. Would the filter in 25585.patch do the job?

comment:4 azaozz6 months ago

  • Milestone changed from Awaiting Review to 3.7

Moving to 3.7 for consideration as this is new functionality.

comment:5 in reply to: ↑ 3 alex-ye6 months ago

Replying to azaozz:

Would the filter in 25585.patch do the job?

Yeah, but..
1- I can't re-use the query stopwords list, I wonder why the function is protected?
2- I would like to cancel the strtolower and old stopwords checks for performance reasons.
3- Why we don't introduce sanitize_search_term function , so this functionality could be used in some other places.

comment:6 alex-ye6 months ago

I am working now on a plugin that works on @azaozz patch:
https://github.com/nash-ye/wp-arabic-stopwords

comment:7 follow-up: nacin6 months ago

Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?

Even without the wp_search_stopwords filter, searching is still better in 3.7 than it was in 3.6. We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count. Is that OK?

comment:8 nacin6 months ago

  • Keywords reporter-feedback added

comment:9 in reply to: ↑ 7 alex-ye6 months ago

Replying to nacin:

Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?

Yes, If we can't remove those chars from the user input before the stopwords comparison, the Arabic translators may need to write every stopword 3-5 times! as you know we can't trust the user input.

We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after?

1+

comment:10 follow-up: azaozz6 months ago

I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count.

Yeah, it can be moved to the proposed filter so plugins could change the pattern for specific languages.

The idea is to remove single letter terms from the search. The pattern /^\p{L}$/u is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be /^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points (more info).

search_terms_count is the count before the terms were cleaned. It's used to determine if the sorting would use CASE and match in both title and content, or just a sentence match. This part of parse_search_order() has gone through quite a few changes, maybe there is a simpler way to do that now.

Last edited 6 months ago by azaozz (previous) (diff)

comment:11 alex-ye6 months ago

  • Keywords reporter-feedback removed

I will try to work on a new patch.

comment:12 in reply to: ↑ 10 alex-ye6 months ago

Replying to azaozz:

The idea is to remove single letter terms from the search. The pattern /^\p{L}$/u is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be /^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points (more info).

See the example below, I have used the 'str_replace' function and I will try to apply your suggestion about the RegExp:

function ArWP_normalize( $str ) {

	// Normalize the Alef.
	$str = str_replace( array(
		'أ','إ','آ'
	), 'ا', $str );

	// Normalize the Diacritics.
	$str = str_replace( array(
		'َ','ً','ُ','ٌ','ِ','ٍ','ْ','ّ'
	), '', $str );

	// Return the new string.
	return $str;

} // end ArWP_normalize()

Is there a simple function in WordPress core to get the unicode code point?!

Last edited 6 months ago by alex-ye (previous) (diff)

comment:13 nacin6 months ago

  • Component changed from General to I18N
  • Milestone changed from 3.7 to Future Release

This is too late for 3.7, but I'd be happy to revisit it. It would be great to robustly support stopwords in other languages. For now, it's no worse than it was in 3.6.

comment:14 SergeyBiryukov5 months ago

  • Version changed from trunk to 3.7

comment:15 netweb2 months ago

Related: #26670

Note: See TracTickets for help on using tickets.