Opened 14 years ago
Closed 14 years ago
#19033 closed defect (bug) (fixed)
Problem with Hebrew letter "Nun" hiding search results
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 3.4 | Priority: | normal |
| Severity: | critical | Version: | 3.2.1 |
| Component: | I18N | Keywords: | has-patch needs-unit-tests dev-feedback |
| Focuses: | Cc: |
Description
In the hebrew installation, when trying to search to website for words with the letter Nun ( נ ), no results found.
There was a related problem on an earlier version of wordpress with the same letter, please see ticket here:
http://core.trac.wordpress.org/ticket/11669
Please try to search on this site, the words (נתן, קושניר, אנטולי) which all has the letter נ . no results found, while all names are appears in the site.
http://www.pat.co.il/shirg/comm-it.co.il/he/
We need fix/patch ASAP. Thank you.
Attachments (1)
Change History (7)
#2
follow-up:
↓ 3
@
14 years ago
- Keywords has-patch needs-unit-tests added; needs-patch removed
- Milestone changed from Awaiting Review to 3.3
Looks like this has to do with \s in the regexp, similarly to #11528 and [12501].
To reproduce:
preg_match_all('/".*?("|$)|((?<=[\\s",+])|^)[^\\s",+]+/', 'נתן, קושניר, אנטולי', $matches);
var_dump($matches);
Here's what I get on PHP 5.2.14 (Windows), PCRE 8.02 2010-03-19:
array(3) {
[0]=>
array(6) {
[0]=>
string(1) "�"
[1]=>
string(4) "תן"
[2]=>
string(7) "קוש�"
[3]=>
string(4) "יר"
[4]=>
string(3) "א�"
[5]=>
string(8) "טולי"
}
...
}
With the regexp from the patch:
array(3) {
[0]=>
array(3) {
[0]=>
string(6) "נתן"
[1]=>
string(12) "קושניר"
[2]=>
string(12) "אנטולי"
}
...
}
#3
in reply to:
↑ 2
@
14 years ago
Replying to SergeyBiryukov:
Looks like this has to do with
\sin the regexp, similarly to #11528 and [12501].
Yes, we should be careful not to use \s in regexp anywhere as it grabs parts of utf-8 chars (not only in Hebrew).
In this case it seems we are looking for word separators in the search string that was entered in a <input type="text" field. Perhaps \r\n\t should be stripped completely or even the string should be rejected if any of these are found, then we could use \b.
There should be many examples of search string sanitization and handling, maybe we should look around a bit. For example chars like !@#$%^& are usually ignored, etc.
Fornow, I have changed query.php under wp-includes, in order to have a quick hot fix (from line 2171:
if ( !empty($q['sentence']) ) { $q['search_terms'] = array($q['s']); } else { if (strstr($q['s'], 'נ')) $q['search_terms'] = array($q['s']); else { preg_match_all('/".*?("|$)|((?<=[\\s",+])|^)[^\\s",+]+/', $q['s'], $matches); $q['search_terms'] = array_map('_search_terms_tidy', $matches[0]); } }