Make WordPress Core

Opened 6 years ago

Last modified 6 years ago

#44296 reviewing defect (bug)

Enable double-width space works as a separator in search query

Reported by: ryotsun's profile ryotsun Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Query Keywords: has-patch has-unit-tests needs-refresh
Focuses: Cc:

Description (last modified by SergeyBiryukov)

Related ticket: #43829

It has a bug in search query (search form).
It would be recognizing as one word in case of putting double-width space in search query.

So, it should be recognized as a separator like half-width space " ".

And most important thing in this ticket, it can be a clue in order to approach CJK unified ideographs.

cf. ) https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

Attachments (2)

44296_1.patch (1.7 KB) - added by ryotsun 6 years ago.
44296_2.patch (1.7 KB) - added by ryotsun 6 years ago.

Download all attachments as: .zip

Change History (17)

@ryotsun
6 years ago

#1 @SergeyBiryukov
6 years ago

  • Description modified (diff)

#2 follow-up: @tenpura
6 years ago

The mbstring PHP extension might be unavailable, so mb_convert_kana() needs function_exists() check.

I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.

This ticket was mentioned in Slack in #polyglots by nao. View the logs.


6 years ago

#4 @southp
6 years ago

From my understanding of Chinese, there shouldn't be a problem with that. However, we should not convert the quoted ideographic spaces, since it is likely intentional. e.g. '"I am quoted"' should be searching for "I am quoted" instead of "I am quoted".

Last edited 6 years ago by southp (previous) (diff)

#5 in reply to: ↑ 2 @ryotsun
6 years ago

@tenpura Thank you for your advice.

I should not have used mbstring. So, I've changed codes to use str_replace instead.

Replying to tenpura:

The mbstring PHP extension might be unavailable, so mb_convert_kana() needs function_exists() check.

I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.

@ryotsun
6 years ago

#6 @SergeyBiryukov
6 years ago

  • Milestone changed from Awaiting Review to 5.0
  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#7 @netweb
6 years ago

Related: #44541

#9 @birgire
6 years ago

The 44296_2.patch replaces the double-width space for all searches, but what about limiting it to the case where sentence is false (the default case)?

Then I think it would be nice to have unit tests for these cases:

1) $query = new WP_Query( array( 's' => $terms ) );
2) $query = new WP_Query( array( 's' => $terms, 'sentence' => true ) );
3) $query = new WP_Query( array( 's' => $terms, 'exact' => true ) );
4) $query = new WP_Query( array( 's' => $terms, 'exact' => true, 'sentence' => true  ) );

I wonder if some of the various white spaces characters:

https://en.wikipedia.org/wiki/Whitespace_character

including the double-width space, should be handled?

I played with the u modifier in PCRE and skimmed through the great info here

Here's a test example, inspired by the question here:

$json = json_decode( '{ "string" : "This\u3000is\u2001a\u2002search\u2003for\u2004þ ö\tá\n" }' );

echo preg_replace( '/[\p{Zs}]/u', '_', $json->string );
echo preg_replace( '/[\pZ]/u',    '_', $json->string );
echo preg_replace( '/[\pZ\pC]/u', '_', $json->string );

with output:

This_is_a_search_for_þ_ö	á

This_is_a_search_for_þ_ö	á

This_is_a_search_for_þ_ö_á_

I also played with e.g.:

if ( preg_match_all( '/".*?("|$)|((?<=[\t\p{Zs}",+])|^)[^\t\p{Zs}",+]+/u', $q['s'], $matches ) ) {

instead of:

if ( preg_match_all( '/".*?("|$)|((?<=[\t ",+])|^)[^\t ",+]+/', $q['s'], $matches ) ) {

in WP_Query::parse_search().

I'm not sure about the general support for the PCRE u modifier though and the security aspect of it.

Ticket #24661 introduces the _wp_can_use_pcre_u() function.

It's interesting to see how core uses the u modifier, both with a support check and also without such a check, e.g. in sanitize_file_name(), WP_Text_Diff_Renderer_inline::_splitOnWords() and wp_maybe_decline_date().

Version 2, edited 6 years ago by birgire (previous) (next) (diff)

#10 @miyauchi
6 years ago

Hi @birgire

Japanese keyboard has two types of input method 'Japanese' and 'keyboard'.
Double-width space and single-width space are same key on those input method.

So, I am thinking we should handle only double-width space and single-width space as a separator for CJK.

Thanks. :)

#11 @miyauchi
6 years ago

Oh, I made mistake.

/'Japanese' and 'keyboard'/'Japanese' and 'English'/

#12 @pento
6 years ago

  • Milestone changed from 5.0 to 5.1

#13 @pento
6 years ago

  • Keywords needs-refresh added
  • Milestone changed from 5.1 to 5.2
  • Version trunk deleted

I like 44296_2.patch, but I'm inclined to agree that we should be adding support for more than just double-width space.

We could pass an array of spaces that we want treat as separators to the str_replace(), rather than adding the u modifier to the regex.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


6 years ago

#15 @audrasjb
6 years ago

  • Milestone changed from 5.2 to Future Release

As per today's bug scrub, we are going to move this one to Future Release since beta 3 and RC are approaching.

Note: See TracTickets for help on using tickets.