Make WordPress Core

Opened 10 years ago

Last modified 7 years ago

#28058 new defect (bug)

Taxonomies defined with UTF8 encoded names cause notices when adding a new term

Reported by: mikejolley's profile mikejolley Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 3.9
Component: Taxonomy Keywords:
Focuses: Cc:

Description

This one is easy to reproduce as follows:

  1. Register a new taxonomy with UTF8 in the name, e.g. pa_資料庫版本. This is in particular possible in WC for its attribute system.
  2. Add a term via the admin panel
  3. You get notices like:
Notice: Trying to get property of non-object in /Users/patrick/Documents/woothemes/woocommerce/wp-includes/link-template.php on line 685

I traced it back to https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/screen.php#L413

After adding any term, this sanitize key turns pa_資料庫版本 into just 'pa', which results in the taxonomy not being loaded because 'pa' doesn't exist.

Removing the sanitize_key fixes the issue so sanitisation could be removed, modified, or moved after the taxonomy checks.

This was originally logged at https://github.com/woothemes/woocommerce/issues/5314.

Change History (7)

#1 @jesin
10 years ago

The real culprit is this - https://core.trac.wordpress.org/browser/tags/3.9/src/wp-includes/formatting.php#L1040

That regex (/[^a-z0-9_\-]/) removes non-English characters.

As a workaround this piece of string can be omitted from sanitization.

add_filter( 'sanitize_key', 'utf_untouched', 10, 2 );

function utf_untouched( $key, $raw_key ) {
	if ( 'pa_資料庫版本' == $raw_key )
		return $raw_key

	return $key
}
Last edited 10 years ago by jesin (previous) (diff)

#2 @knutsp
10 years ago

  • Keywords reporter-feedback added

May be I misunderstand this, but why is it important to have the name of the taxonomy non-ASCII? The name is internal and used in a url, either as an archive or as ?my-tax-name=term

The label, however, is the visible part of the taxonomy identification.

#3 @mikejolley
10 years ago

@knutsp I'd personally never need to use non-ASCII chars, but if you are Chinese and speak Chinese how else would you represent 資料庫版本?

In WooCommerce user's can create global attributes where they define the name and label. This occurs when they use non-ascii in the name. Understandable if they are non-english I guess.

#4 @SergeyBiryukov
10 years ago

  • Keywords reporter-feedback removed

#5 @helen
10 years ago

  • Focuses administration removed
  • Version changed from trunk to 3.9

#6 @boonebgorges
9 years ago

  • Milestone changed from Awaiting Review to Future Release

To clarify: The issue here is not with *terms*, it's with the taxonomy name itself, correct? Eg:

register_taxonomy( 'pa_資料庫版本', $args );

Looking through the component, it looks like we don't explicitly support UTF8 characters in taxonomy names, though we don't enforce it; in most places, the use of these characters for taxonomy names will work fine, but clearly there are some finer points where things break. (The same thing is almost certainly true of post types.)

It would be great to clean this up and provide full support for taxonomies/post types with non-ASCII characters in their names. This will take a pretty thorough review, however. Some things to check:

  • The 'taxonomy' field in the 'wp_term_taxonomy' table is VARCHAR(32), which imposes an absolute maximum length on taxonomy names. We throw a related _doing_it_wrong() notice in register_taxonomy() based on strlen(). This check would need to use mb_strlen() instead.
  • Non-ASCII characters will be stored differently (or sometimes not at all) in databases with different character encoding. This means that a taxonomy name that works properly on one WP installation may not work properly on another one, just due to the DB charset/collation. This might be an education issue for plugin authors; or it might suggest that core should be stricter about not allowing certain character types in certain fields that are used as keys in plugins/themes.
  • We should take special care testing rewrite issues, as non-ASCII characters will be encoded in various places in the context of URLs.
Note: See TracTickets for help on using tickets.