Make WordPress Core

Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#33864 closed defect (bug) (fixed)

Cannot add accented tag if non-accented look-alike exists (eg. szel, szél)

Reported by: gezamiklo's profile geza.miklo Owned by: boonebgorges's profile boonebgorges
Milestone: 4.4 Priority: normal
Severity: normal Version: 4.3
Component: Taxonomy Keywords: has-patch dev-feedback
Focuses: Cc:

Description

The key is the query checking the existence of the keyword. A slight modification will make it work:
See function get_term_by in wp-includes/taxonomy.php

The query in line 1465 is:

$term = $wpdb->get_row( $wpdb->prepare( "SELECT t.*, tt.* FROM $wpdb->terms AS t INNER JOIN $wpdb->term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy = %s AND $field = %s LIMIT 1", $taxonomy, $value ) );

If you add BINARY to the query, then

$term = $wpdb->get_row( $wpdb->prepare( "SELECT t.*, tt.* FROM $wpdb->terms AS t INNER JOIN $wpdb->term_taxonomy AS tt ON t.term_id = tt.term_id WHERE tt.taxonomy = %s AND BINARY $field = %s LIMIT 1", $taxonomy, $value ) );

Attachments (2)

33731.diff (756 bytes) - added by mehulkaklotar 9 years ago.
33731.patch (1.1 KB) - added by tyxla 9 years ago.
Adding unit tests for the 2 cases: term with accent, followed by term without accent; term without accent, followed by term with accent.

Download all attachments as: .zip

Change History (18)

#1 @swissspidy
9 years ago

  • Milestone Awaiting Review deleted
  • Resolution set to duplicate
  • Status changed from new to closed

Duplicate of #33731.

#2 @geza.miklo
9 years ago

No it isn't a duplicate. Accents are supported is appropriate encoding is used. We are using utf-8. So I can add accented tags. But if I add szél at first I cannot add szel anymore. But in hungarian szél is wind and szel means cut. So they are completely different things.

The problem lies in the part createing the slug and checking for the existence of the word as tag name. You can see that szél and szel should not be the same name. However mysql's comparison depends on the database collation. Using the default these two queries' results will be the same:

SELECT * FROM wp_tags WHERE name = 'szel';
-- lists szel and szél if both exists

SELECT * FROM wp_tags WHERE name = 'szél';
-- lists szel and szél two is both exits

SELECT * FROM wp_tags WHERE BINARY name = 'szel';
-- will list only szel

SELECT * FROM wp_tags WHERE BINARY name = 'szél';
-- will list only szél

Please reconsider marking it a duplicate.

#3 @SergeyBiryukov
9 years ago

  • Component changed from General to Taxonomy
  • Keywords needs-patch added
  • Milestone set to 4.4
  • Resolution duplicate deleted
  • Status changed from closed to reopened

Reopening, does not seem to be related to #33731.

@mehulkaklotar
9 years ago

#4 @tyxla
9 years ago

  • Keywords has-patch added; needs-patch removed

Patch works for me. This needs unit tests, adding some right away.

@tyxla
9 years ago

Adding unit tests for the 2 cases: term with accent, followed by term without accent; term without accent, followed by term with accent.

#5 follow-up: @boonebgorges
9 years ago

  • Keywords dev-feedback added

Switching to BINARY means, among other things, that term matches will be case-sensitive. This is a change from current behavior.

Can I ask for a couple things -

  • Let's do a review of where get_term_by() is used in core. Are there places in the core interface where we depend on the case-insensitivity of a get_term_by() check?
  • If we have a problem like this in get_term_by(), it's certain that the same problem arises elsewhere. At a very quick glance, term_exists() and get_terms() (with 'name') will be affected. Likewise with Posts. We should have a single solution for all of these.
  • It looks like casting the field to BINARY means that we lose the use of the index. The comments on https://dev.mysql.com/doc/refman/5.0/en/charset-binary-op.html suggest that we should cast the comparison value instead: $field = BINARY %s. This needs testing.
  • Another (maybe lame) idea is to do the byte comparison in PHP, where we can more easily control for things like case. So, maybe instead of get_row() we get_results(), and then loop through to find the one that matches the field value exactly.

@pento or someone else with more MySQL expertise than me: what other side effects will we risk by doing a BINARY comparison here?

#6 in reply to: ↑ 5 @geza.miklo
9 years ago

You are absolutely right on being case-sensitive this way. That's not good too.

I see 2 options:

  • add BINARY LOWER($field) = LOWER(%s)
  • check equality in PHP and use get_results as yout suggested

Replying to boonebgorges:

Switching to BINARY means, among other things, that term matches will be case-sensitive. This is a change from current behavior.

Can I ask for a couple things -

  • Let's do a review of where get_term_by() is used in core. Are there places in the core interface where we depend on the case-insensitivity of a get_term_by() check?
  • If we have a problem like this in get_term_by(), it's certain that the same problem arises elsewhere. At a very quick glance, term_exists() and get_terms() (with 'name') will be affected. Likewise with Posts. We should have a single solution for all of these.
  • It looks like casting the field to BINARY means that we lose the use of the index. The comments on https://dev.mysql.com/doc/refman/5.0/en/charset-binary-op.html suggest that we should cast the comparison value instead: $field = BINARY %s. This needs testing.
  • Another (maybe lame) idea is to do the byte comparison in PHP, where we can more easily control for things like case. So, maybe instead of get_row() we get_results(), and then loop through to find the one that matches the field value exactly.

@pento or someone else with more MySQL expertise than me: what other side effects will we risk by doing a BINARY comparison here?

#7 follow-up: @pento
9 years ago

From a MySQL perspective, inefficient index use is the biggest concern.

I also don't like the idea of just dropping the collation behaviour - while it causes incorrect results in this instance, dropping it is just as likely to cause other incorrect results. For example, using a generic english keyboard, I'd want to type "resume" and have it match "résumé", because typing accented characters is not fun.

#8 in reply to: ↑ 7 @geza.miklo
9 years ago

Please note, that in foreign languages other than english two words with the same slug can have completely different meaning.

By the way now you cannot add both "resume" and "résumé". See my problem. So you brought an english example which shows that the current is not the best behavior.

I understand your concerns but this is not an acceptable behaviour in non english languages having accented characters. We've changed the collation on the database. Now it's sideeffect is:

  • lowercase words are on the end of the list
  • The whole stuff got case-sensitive

Seems the best solution can be using get_results and check the match in PHP.

Replying to pento:

From a MySQL perspective, inefficient index use is the biggest concern.

I also don't like the idea of just dropping the collation behaviour - while it causes incorrect results in this instance, dropping it is just as likely to cause other incorrect results. For example, using a generic english keyboard, I'd want to type "resume" and have it match "résumé", because typing accented characters is not fun.

#9 follow-up: @boonebgorges
9 years ago

Thanks for the feedback, pento. Whatever happens here, we'll be sure to preserve the proper use of the index.

I agree with geza.miklo that the current state of affairs is unacceptable. Having 'resume' match 'résumé' (etc) is a minor convenience in English, but it's debilitating in other languages.

It's worth noting that the particular issue in this ticket has to do only with the 'name' field for terms. There are fairly few places in the interface where this value is queried based on user input. I'm thinking that the best solution might be something like this: allow get_term_by( 'name' ) (and maybe term_exists()) to do strict vs loose lookups. When checking for the uniqueness of a term - as happens in wp_insert_term() - the lookup should be strict, so that you can have terms 'szél' and 'szel'. But in cases where a person is trying to match a name from the interface - namely, the tags autocomplete field - we can do a loose match; typing 'szel' will return both 'szel' and 'szél'. This would mean something like the following:

function get_term_by( $field, $value, $taxonomy, $output, $filter, $match = 'loose' ) {
    // ...
    if ( 'strict' === $match && 'name' === $field ) {
        $matches = $wpdb->get_results( 'SELECT ...' );
        foreach ( $matches as $match ) {
            if ( mb_strtolower( $match->name ) === mb_strtolower( $value ) ) {
                $term = $match;
                break;
            }
        }
    }
}

We'd then identify which uses of get_term_by() should be 'strict' (probably just the ones used for uniqueness in wp_insert_term() and wp_update_term(), at least for now). All existing uses would remain loose, maintaining backward compatibility.

The only change from current behavior for languages like English will be a term with the name 'resume' will not prevent the creation of a term with the name 'résumé' on the post edit page. But this seems minor, especially since the tag suggest box would return 'résumé' if typed sans accent.

#10 in reply to: ↑ 9 @geza.miklo
9 years ago

This seems to be a fair solution. Nice!

Replying to boonebgorges:

Thanks for the feedback, pento. Whatever happens here, we'll be sure to preserve the proper use of the index.

I agree with geza.miklo that the current state of affairs is unacceptable. Having 'resume' match 'résumé' (etc) is a minor convenience in English, but it's debilitating in other languages.

It's worth noting that the particular issue in this ticket has to do only with the 'name' field for terms. There are fairly few places in the interface where this value is queried based on user input. I'm thinking that the best solution might be something like this: allow get_term_by( 'name' ) (and maybe term_exists()) to do strict vs loose lookups. When checking for the uniqueness of a term - as happens in wp_insert_term() - the lookup should be strict, so that you can have terms 'szél' and 'szel'. But in cases where a person is trying to match a name from the interface - namely, the tags autocomplete field - we can do a loose match; typing 'szel' will return both 'szel' and 'szél'. This would mean something like the following:

function get_term_by( $field, $value, $taxonomy, $output, $filter, $match = 'loose' ) {
    // ...
    if ( 'strict' === $match && 'name' === $field ) {
        $matches = $wpdb->get_results( 'SELECT ...' );
        foreach ( $matches as $match ) {
            if ( mb_strtolower( $match->name ) === mb_strtolower( $value ) ) {
                $term = $match;
                break;
            }
        }
    }
}

We'd then identify which uses of get_term_by() should be 'strict' (probably just the ones used for uniqueness in wp_insert_term() and wp_update_term(), at least for now). All existing uses would remain loose, maintaining backward compatibility.

The only change from current behavior for languages like English will be a term with the name 'resume' will not prevent the creation of a term with the name 'résumé' on the post edit page. But this seems minor, especially since the tag suggest box would return 'résumé' if typed sans accent.

#11 @boonebgorges
9 years ago

I've been thinking more about this, and I'm less a fan of my previously suggested approach than I was when I wrote it :) For one thing, adding a strictness parameter to get_term_by() - especially one that is only used when field=name - is quite ugly and unwieldy. For another thing, I am concerned about backward compatibility and performance concerns in get_term_by().

Since this is a very local problem - it only affects the creation of new terms via wp_insert_term() - I think we can go with a more local solution. Instead of using get_term_by( 'name' ) to find potential duplicates, we'll use get_terms() with the name parameter. This'll return *all* terms that MySQL deems a match. Then we do a stricter comparison in PHP before deciding that we've found an actual match. Very similar in spirit to the get_results() approach suggested above, but much less likely to cause damage.

#12 @boonebgorges
9 years ago

  • Owner set to boonebgorges
  • Resolution set to fixed
  • Status changed from reopened to closed

In 34809:

When creating terms, avoid false dupe checks due to accented characters.

wp_insert_term() doesn't allow the creation of a term when the term name
is the same as another term in the same hierarchy level of the same taxonomy.
Previously, this duplicate check used get_term_by( 'name' ), which uses the
database collation to determine sameness. But common collations do not
distinguish between accented and non-accented versions of a character. As a
result, it was impossible to create a term 'Foo' if a sibling term with an
accented character existed.

We address this problem by using get_terms() to do the duplicate check. This
query returns all potentially matching terms. We then do a stricter check
for equivalence in PHP, before determining whether one of the matches is
indeed a duplicate.

Props boonebgorges, tyxla, geza.miklo, mehulkaklotar.
Fixes #33864.

This ticket was mentioned in Slack in #core by boone. View the logs.


9 years ago

This ticket was mentioned in Slack in #core by boone. View the logs.


9 years ago

#15 @boonebgorges
9 years ago

#16567 was marked as a duplicate.

#16 @boonebgorges
9 years ago

#22296 was marked as a duplicate.

Note: See TracTickets for help on using tickets.