WordPress.org

Make WordPress Core

Opened 14 months ago

Last modified 3 months ago

#44296 reviewing defect (bug)

Enable double-width space works as a separator in search query

Reported by: ryotsun Owned by: SergeyBiryukov
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Query Keywords: has-patch has-unit-tests needs-refresh
Focuses: Cc:

Description (last modified by SergeyBiryukov)

Related ticket: #43829

It has a bug in search query (search form).
It would be recognizing as one word in case of putting double-width space in search query.

So, it should be recognized as a separator like half-width space " ".

And most important thing in this ticket, it can be a clue in order to approach CJK unified ideographs.

cf. ) https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

Attachments (2)

44296_1.patch (1.7 KB) - added by ryotsun 14 months ago.
44296_2.patch (1.7 KB) - added by ryotsun 14 months ago.

Download all attachments as: .zip

Change History (17)

@ryotsun
14 months ago

#1 @SergeyBiryukov
14 months ago

  • Description modified (diff)

#2 follow-up: @tenpura
14 months ago

The mbstring PHP extension might be unavailable, so mb_convert_kana() needs function_exists() check.

I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.

This ticket was mentioned in Slack in #polyglots by nao. View the logs.


14 months ago

#4 @southp
14 months ago

From my understanding of Chinese, there shouldn't be a problem with that. However, we should not convert the quoted ideographic spaces, since it is likely intentional. e.g. '"I am quoted"' should be searching for "I am quoted" instead of "I am quoted".

Last edited 14 months ago by southp (previous) (diff)

#5 in reply to: ↑ 2 @ryotsun
14 months ago

@tenpura Thank you for your advice.

I should not have used mbstring. So, I've changed codes to use str_replace instead.

Replying to tenpura:

The mbstring PHP extension might be unavailable, so mb_convert_kana() needs function_exists() check.

I'm sure that ideographic space is legitimate as a search word separator in Japanese, but I'm not sure that it is treated exactly the same when it comes to search in other CJK or similar languages. If you are 100% certain about it, fine. If not, it might be safer to confirm it to polyglots community to minimize possibilities of breaking something.

@ryotsun
14 months ago

#6 @SergeyBiryukov
13 months ago

  • Milestone changed from Awaiting Review to 5.0
  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#7 @netweb
13 months ago

Related: #44541

#9 @birgire
12 months ago

The 44296_2.patch replaces the double-width space for all searches, but what about limiting it to the case where sentence is false (the default case)?

Then I think it would be nice to have unit tests for these cases:

1) $query = new WP_Query( array( 's' => $terms ) );
2) $query = new WP_Query( array( 's' => $terms, 'sentence' => true ) );
3) $query = new WP_Query( array( 's' => $terms, 'exact' => true ) );
4) $query = new WP_Query( array( 's' => $terms, 'exact' => true, 'sentence' => true  ) );

I wonder if some of the various white spaces characters:

https://en.wikipedia.org/wiki/Whitespace_character

including the double-width space, should be handled?

I played with the u modifier in PCRE and skimmed through the great info here

Here's a test example, inspired by the question here:

$json = json_decode( '{ "string" : "This\u3000is\u2001a\u2002search\u2003for\u2004þ ö\tá\n" }' );

echo preg_replace( '/[\p{Zs}]/u', '_', $json->string );
echo preg_replace( '/[\pZ]/u',    '_', $json->string );
echo preg_replace( '/[\pZ\pC]/u', '_', $json->string );

with output:

This_is_a_search_for_þ_ö	á

This_is_a_search_for_þ_ö	á

This_is_a_search_for_þ_ö_á_

I also played with e.g.:

if ( preg_match_all( '/".*?("|$)|((?<=[\t\p{Zs}",+])|^)[^\t\p{Zs}",+]+/u', $q['s'], $matches ) ) {

instead of:

if ( preg_match_all( '/".*?("|$)|((?<=[\t ",+])|^)[^\t ",+]+/', $q['s'], $matches ) ) {

in WP_Query::parse_search().

I'm not sure about the general support for the PCRE u modifier though and the security aspect of it.

There's a core _wp_can_use_pcre_u() function and ticket #24661 introduces a possible _wp_can_use_pcre_ucp() function for character properties.

It's interesting to see how core uses the u modifier, both with a support check and also without such a check, e.g. in sanitize_file_name(), WP_Text_Diff_Renderer_inline::_splitOnWords() and wp_maybe_decline_date().

Last edited 12 months ago by birgire (previous) (diff)

#10 @miyauchi
12 months ago

Hi @birgire

Japanese keyboard has two types of input method 'Japanese' and 'keyboard'.
Double-width space and single-width space are same key on those input method.

So, I am thinking we should handle only double-width space and single-width space as a separator for CJK.

Thanks. :)

#11 @miyauchi
12 months ago

Oh, I made mistake.

/'Japanese' and 'keyboard'/'Japanese' and 'English'/

#12 @pento
9 months ago

  • Milestone changed from 5.0 to 5.1

#13 @pento
6 months ago

  • Keywords needs-refresh added
  • Milestone changed from 5.1 to 5.2
  • Version trunk deleted

I like 44296_2.patch, but I'm inclined to agree that we should be adding support for more than just double-width space.

We could pass an array of spaces that we want treat as separators to the str_replace(), rather than adding the u modifier to the regex.

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


3 months ago

#15 @audrasjb
3 months ago

  • Milestone changed from 5.2 to Future Release

As per today's bug scrub, we are going to move this one to Future Release since beta 3 and RC are approaching.

Note: See TracTickets for help on using tickets.