WordPress.org

Make WordPress Core

Opened 4 years ago

Closed 3 years ago

#19033 closed defect (bug) (fixed)

Problem with Hebrew letter "Nun" hiding search results

Reported by: shirgans Owned by: shir.gans@…
Milestone: 3.4 Priority: normal
Severity: critical Version: 3.2.1
Component: I18N Keywords: has-patch needs-unit-tests dev-feedback
Focuses: Cc:

Description

In the hebrew installation, when trying to search to website for words with the letter Nun ( נ ), no results found.

There was a related problem on an earlier version of wordpress with the same letter, please see ticket here:
http://core.trac.wordpress.org/ticket/11669

Please try to search on this site, the words (נתן, קושניר, אנטולי) which all has the letter נ . no results found, while all names are appears in the site.
http://www.pat.co.il/shirg/comm-it.co.il/he/

We need fix/patch ASAP. Thank you.

Attachments (1)

19033.patch (592 bytes) - added by SergeyBiryukov 4 years ago.

Download all attachments as: .zip

Change History (7)

comment:1 @shirgans4 years ago

Fornow, I have changed query.php under wp-includes, in order to have a quick hot fix (from line 2171:

if ( !empty($q['sentence']) ) {
				$q['search_terms'] = array($q['s']);
			} else {
			   
               if (strstr($q['s'], 'נ')) $q['search_terms'] = array($q['s']); 
               else {
				preg_match_all('/".*?("|$)|((?<=[\\s",+])|^)[^\\s",+]+/', $q['s'], $matches);
				$q['search_terms'] = array_map('_search_terms_tidy', $matches[0]);
                }
			}

@SergeyBiryukov4 years ago

comment:2 follow-up: @SergeyBiryukov4 years ago

  • Keywords has-patch needs-unit-tests added; needs-patch removed
  • Milestone changed from Awaiting Review to 3.3

Looks like this has to do with \s in the regexp, similarly to #11528 and [12501].

To reproduce:

preg_match_all('/".*?("|$)|((?<=[\\s",+])|^)[^\\s",+]+/', 'נתן, קושניר, אנטולי', $matches);
var_dump($matches);

Here's what I get on PHP 5.2.14 (Windows), PCRE 8.02 2010-03-19:

array(3) {
  [0]=>
  array(6) {
    [0]=>
    string(1) "�"
    [1]=>
    string(4) "תן"
    [2]=>
    string(7) "קוש�"
    [3]=>
    string(4) "יר"
    [4]=>
    string(3) "א�"
    [5]=>
    string(8) "טולי"
  }
  ...
}

With the regexp from the patch:

array(3) {
  [0]=>
  array(3) {
    [0]=>
    string(6) "נתן"
    [1]=>
    string(12) "קושניר"
    [2]=>
    string(12) "אנטולי"
  }
  ...
}

comment:3 in reply to: ↑ 2 @azaozz4 years ago

Replying to SergeyBiryukov:

Looks like this has to do with \s in the regexp, similarly to #11528 and [12501].

Yes, we should be careful not to use \s in regexp anywhere as it grabs parts of utf-8 chars (not only in Hebrew).

In this case it seems we are looking for word separators in the search string that was entered in a <input type="text" field. Perhaps \r\n\t should be stripped completely or even the string should be rejected if any of these are found, then we could use \b.

There should be many examples of search string sanitization and handling, maybe we should look around a bit. For example chars like !@#$%^& are usually ignored, etc.

comment:4 @nacin4 years ago

  • Keywords dev-feedback added
  • Milestone changed from 3.3 to Future Release

comment:5 @SergeyBiryukov3 years ago

  • Component changed from Charset to I18N
  • Milestone changed from Future Release to 3.4

comment:6 @nacin3 years ago

  • Resolution set to fixed
  • Status changed from new to closed

In [19866]:

Use [\r\n\t ], not [\s], to prevent issues with some UTF-8 characters. props SergeyBiryukov, fixes #19033.

Note: See TracTickets for help on using tickets.