Opened 6 years ago
Last modified 5 years ago
#44296 reviewing defect (bug)
Enable double-width space works as a separator in search query
Reported by: | ryotsun | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | |
Component: | Query | Keywords: | has-patch has-unit-tests needs-refresh |
Focuses: | Cc: |
Description (last modified by )
Related ticket: #43829
It has a bug in search query (search form).
It would be recognizing as one word in case of putting double-width space in search query.
So, it should be recognized as a separator like half-width space " ".
And most important thing in this ticket, it can be a clue in order to approach CJK unified ideographs.
Attachments (2)
Change History (17)
This ticket was mentioned in Slack in #polyglots by nao. View the logs.
6 years ago
#4
@
6 years ago
From my understanding of Chinese, there shouldn't be a problem with that. However, we should not convert the quoted ideographic spaces, since it is likely intentional. e.g. '"I am quoted"'
should be searching for "I am quoted"
instead of "I am quoted"
.
#5
in reply to:
↑ 2
@
6 years ago
@tenpura Thank you for your advice.
I should not have used mbstring. So, I've changed codes to use str_replace
instead.
Replying to tenpura:
The mbstring PHP extension might be unavailable, so
mb_convert_kana()
needsfunction_exists()
check.
I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.
#6
@
6 years ago
- Milestone changed from Awaiting Review to 5.0
- Owner set to SergeyBiryukov
- Status changed from new to reviewing
#8
@
6 years ago
I created a list of tickets related to fix multibyte problems.
https://docs.google.com/spreadsheets/d/13oGbc7AqEN6OUvmze-JDCKXDdruFqsxqSAdI7-b6Lho/edit#gid=0
#9
@
6 years ago
The 44296_2.patch replaces the double-width space for all searches, but what about limiting it to the case where sentence
is false
(the default case)?
Then I think it would be nice to have unit tests for these cases:
1) $query = new WP_Query( array( 's' => $terms ) ); 2) $query = new WP_Query( array( 's' => $terms, 'sentence' => true ) ); 3) $query = new WP_Query( array( 's' => $terms, 'exact' => true ) ); 4) $query = new WP_Query( array( 's' => $terms, 'exact' => true, 'sentence' => true ) );
I wonder if some of the various white spaces characters:
https://en.wikipedia.org/wiki/Whitespace_character
including the double-width space, should be handled?
I played with the u
modifier in PCRE and skimmed through the great info here
Here's a test example, inspired by the question here:
$json = json_decode( '{ "string" : "This\u3000is\u2001a\u2002search\u2003for\u2004þ ö\tá\n" }' ); echo preg_replace( '/[\p{Zs}]/u', '_', $json->string ); echo preg_replace( '/[\pZ]/u', '_', $json->string ); echo preg_replace( '/[\pZ\pC]/u', '_', $json->string );
with output:
This_is_a_search_for_þ_ö á This_is_a_search_for_þ_ö á This_is_a_search_for_þ_ö_á_
I also played with e.g.:
if ( preg_match_all( '/".*?("|$)|((?<=[\t\p{Zs}",+])|^)[^\t\p{Zs}",+]+/u', $q['s'], $matches ) ) {
instead of:
if ( preg_match_all( '/".*?("|$)|((?<=[\t ",+])|^)[^\t ",+]+/', $q['s'], $matches ) ) {
in WP_Query::parse_search()
.
I'm not sure about the general support for the PCRE u
modifier though and the security aspect of it.
There's a core _wp_can_use_pcre_u() function and ticket #24661 introduces a possible _wp_can_use_pcre_ucp()
function for character properties.
It's interesting to see how core uses the u
modifier, both with a support check and also without such a check, e.g. in sanitize_file_name()
, WP_Text_Diff_Renderer_inline::_splitOnWords()
and wp_maybe_decline_date()
.
#10
@
6 years ago
Hi @birgire
Japanese keyboard has two types of input method 'Japanese' and 'keyboard'.
Double-width space and single-width space are same key on those input method.
So, I am thinking we should handle only double-width space and single-width space as a separator for CJK.
Thanks. :)
#13
@
6 years ago
- Keywords needs-refresh added
- Milestone changed from 5.1 to 5.2
- Version trunk deleted
I like 44296_2.patch, but I'm inclined to agree that we should be adding support for more than just double-width space.
We could pass an array of spaces that we want treat as separators to the str_replace()
, rather than adding the u
modifier to the regex.
The mbstring PHP extension might be unavailable, so
mb_convert_kana()
needsfunction_exists()
check.I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.