WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 6 months ago

#25585 reopened enhancement

Arabic stopwords comparison

Reported by: alex-ye Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.7
Component: I18N Keywords: needs-patch needs-unit-tests
Focuses: Cc:

Description

WordPress uses simple string comparison in WP_Query->parse_search_terms() function, which is fine for many languages but Arabic need more than that to provide a really smart search!

There are some chars need to removed before the string comparison, like:

Hamza
http://en.wikipedia.org/wiki/Hamza

Diacritics
http://en.wikipedia.org/wiki/Arabic_diacritics

Tāʼ marbūṭah
http://en.wikipedia.org/wiki/Taw

If this cannot be in core, Could you at least add some filters to do it ?!

Attachments (1)

25585.patch (662 bytes) - added by azaozz 3 years ago.

Download all attachments as: .zip

Change History (21)

#1 @alex-ye
3 years ago

IF THIS ACCEPTED IN CORE:
We could use some functions from ArPHP project:
http://sourceforge.net/projects/ar-php/

#2 @SergeyBiryukov
3 years ago

  • Version set to trunk

@azaozz
3 years ago

#3 follow-up: @azaozz
3 years ago

If I understand correctly: in some cases in Arabic the search terms have to be additionally processed. Would the filter in 25585.patch do the job?

#4 @azaozz
3 years ago

  • Milestone changed from Awaiting Review to 3.7

Moving to 3.7 for consideration as this is new functionality.

#5 in reply to: ↑ 3 @alex-ye
3 years ago

Replying to azaozz:

Would the filter in 25585.patch do the job?

Yeah, but..
1- I can't re-use the query stopwords list, I wonder why the function is protected?
2- I would like to cancel the strtolower and old stopwords checks for performance reasons.
3- Why we don't introduce sanitize_search_term function , so this functionality could be used in some other places.

#6 @alex-ye
3 years ago

I am working now on a plugin that works on @azaozz patch:
https://github.com/nash-ye/wp-arabic-stopwords

#7 follow-up: @nacin
3 years ago

Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?

Even without the wp_search_stopwords filter, searching is still better in 3.7 than it was in 3.6. We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count. Is that OK?

#8 @nacin
3 years ago

  • Keywords reporter-feedback added

#9 in reply to: ↑ 7 @alex-ye
3 years ago

Replying to nacin:

Best I can tell, you cannot add hamza, diacritics, and tāʼ marbūṭah to the stopwords list because the list only operates on entire words, not the stripping of individual characters, correct?

Yes, If we can't remove those chars from the user input before the stopwords comparison, the Arabic translators may need to write every stopword 3-5 times! as you know we can't trust the user input.

We could also add a filter to the end of the parse_search_terms() method. But at that point, I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after?

1+

#10 follow-up: @azaozz
3 years ago

I think the crazy regular expression in parse_search() should possibly be moved into it, and search_terms_count should be set after? Right now, it's possible for parse_search_terms() to return less terms than is specified in search_terms_count.

Yeah, it can be moved to the proposed filter so plugins could change the pattern for specific languages.

The idea is to remove single letter terms from the search. The pattern /^\p{L}$/u is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be /^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points (more info).

search_terms_count is the count before the terms were cleaned. It's used to determine if the sorting would use CASE and match in both title and content, or just a sentence match. This part of parse_search_order() has gone through quite a few changes, maybe there is a simpler way to do that now.

Last edited 3 years ago by azaozz (previous) (diff)

#11 @alex-ye
3 years ago

  • Keywords reporter-feedback removed

I will try to work on a new patch.

#12 in reply to: ↑ 10 @alex-ye
3 years ago

Replying to azaozz:

The idea is to remove single letter terms from the search. The pattern /^\p{L}$/u is the safest way to match a single letter in any language. It's not particularly fast as it looks through the Unicode character properties. A better (but quite slower) pattern could be /^\p{L}\p{M}*|\p{Z}|\p{P}|\p{C}$/u which also matches separators (any kind of whitespace or invisible separators), punctuation, and invisible control characters and unused code points (more info).

See the example below, I have used the 'str_replace' function and I will try to apply your suggestion about the RegExp:

function ArWP_normalize( $str ) {

	// Normalize the Alef.
	$str = str_replace( array(
		'أ','إ','آ'
	), 'ا', $str );

	// Normalize the Diacritics.
	$str = str_replace( array(
		'َ','ً','ُ','ٌ','ِ','ٍ','ْ','ّ'
	), '', $str );

	// Return the new string.
	return $str;

} // end ArWP_normalize()

Is there a simple function in WordPress core to get the unicode code point?!

Last edited 3 years ago by alex-ye (previous) (diff)

#13 @nacin
3 years ago

  • Component changed from General to I18N
  • Milestone changed from 3.7 to Future Release

This is too late for 3.7, but I'd be happy to revisit it. It would be great to robustly support stopwords in other languages. For now, it's no worse than it was in 3.6.

#14 @SergeyBiryukov
3 years ago

  • Version changed from trunk to 3.7

#15 @netweb
3 years ago

Related: #26670

#16 @chriscct7
14 months ago

  • Milestone Future Release deleted
  • Resolution set to maybelater
  • Status changed from new to closed

Closing as maybelater. Complete lack of interest on the feature on the ticket over the last 2 years. Feel free to reopen when more interest re-emerges (particularly if there's a patch)

#17 @johnbillion
14 months ago

  • Keywords needs-unit-tests added
  • Milestone set to Awaiting Review
  • Resolution maybelater deleted
  • Status changed from closed to reopened

This ticket was mentioned in Slack in #core-i18n by ocean90. View the logs.


8 months ago

This ticket was mentioned in Slack in #core-i18n by ocean90. View the logs.


6 months ago

#20 @ocean90
6 months ago

  • Milestone changed from Awaiting Review to Future Release
Note: See TracTickets for help on using tickets.