Make WordPress Core

Opened 15 years ago

Last modified 5 years ago

#10543 accepted defect (bug)

Incorrect (non-UTF-8) character handling in tag's name and slug

Reported by: sirzooro's profile sirzooro Owned by: westi's profile westi
Milestone: Priority: normal
Severity: normal Version: 2.8.2
Component: Charset Keywords: has-patch needs-unit-tests dev-feedback needs-refresh
Focuses: Cc:

Description

Incorrect (non-UTF-8) character tag's name and slug are handled in different way: name is truncated on 1st such character, and in slug they are just removed (no truncation). WP should handle both in the same way - drop invalid characters, instead of truncation.

I found this issue recently. One of the Polish programs for adding posts to the Wordpresses does not encode tags in UTF-8 - it left them in ISO-8859-2. I notified author of this bug. Unfortunately there are many copies around, so it may take a long time before everyone upgrade.

Attachments (2)

wp-includes-taxonomy.php.patch (1.2 KB) - added by miqrogroove 15 years ago.
wp-includes-taxonomy.php.2.patch (1.2 KB) - added by miqrogroove 15 years ago.
Moves lines to better match 2.9 style.

Download all attachments as: .zip

Change History (27)

#1 @miqrogroove
15 years ago

I don't see anything relevant to character encoding in the Taxonomy API. I wonder if this is happening in one of the Remote Publishing protocols?

#2 @sirzooro
15 years ago

Taxonomy API assumes that encoding is the same as current blog encoding. I think that Remote Publishing protocols also may be affected - this not a bug in WP itself, but in client which uses incorrect encoding.

#3 @miqrogroove
15 years ago

sirzooro, have you done any more troubleshooting on this issue? For example, did you confirm this is an input issue by seeing if the truncated version was saved in the database? And could you tell me how the tags are being input exactly? Does that Polish program use XML-RPC? Is there a specific byte sequence I can use to reproduce that behavior?

#4 @sirzooro
15 years ago

This program (called Adder) adds new posts by simulating browser (it sends forms data via POST; no XML-RRC or similar methods).

I have analysed this problem and released plugin Adder Tags Fix, which fixes it. It checks if POST'ed data contains tags, and fixes tags encoding if necessary.

#5 @sirzooro
15 years ago

One more note: this plugin fixes tags no matter how they are passed. As I remember, Adder uses only old method (used in WP 2.7 and earlier) to pass tags.

#6 @miqrogroove
15 years ago

I need to take notes because this is a bug nut to crack.

The flow of control for tags input is as follows:

wp-admin/post.php
wp-admin/includes/post.php/write_post()
wp-admin/includes/post.php/wp_write_post()
wp-includes/post.php/wp_insert_post()
wp-includes/post.php/wp_set_post_tags()
wp-includes/post.php/wp_set_post_terms()
wp-includes/taxonomy.php/wp_set_object_terms()

#7 @miqrogroove
15 years ago

sirzooro, I'll have to give you some suggestions at this point because I don't have any slick tools for injecting tag strings.

Go to the wp-includes/post.php file and start stubbing-in debug commands to the function wp_insert_post(). Specifically, look at the value of $tags_input and $tax_input right after the extract() is done. You might be able to use something like this:

error_log(print_r($tax_input, TRUE));

Figure out, at that point in the code, has the tag been truncated or modified in any way? This should give us a pretty good idea of how much string sanitizing has been done before the tags are passed to the Taxonomy API.

#8 @sirzooro
15 years ago

miqrogroove,
You can try to add following code somewhere early in WP code, e.g. in wp-config.php:

$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' );

This code will inject tags with incorrect encoding to the $_POST array. Make sure you save file with utf-8 encoding. With this you should be able to reproduce issue.

#9 @miqrogroove
15 years ago

Steps to Reproduce:

Place code at top of wp-admin/post.php

$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' );

Navigate to wp-admin/post-new.php and save a new draft.

Expected Results:
Tag name and slug should be similar, and should be attached to the new post.

Actual Results:
Tag name "ace", slug "ace-aee-ace", tag not attached to the new post.

#10 @miqrogroove
15 years ago

Getting warmer... the bug appears to be somewhere in the Taxonomy API afterall.

#11 @miqrogroove
15 years ago

Here is the code being generated by wp_insert_term() when it calls $wpdb->insert()

INSERT INTO `mwp_terms` (`name`,`slug`,`term_group`) VALUES ('ace ±æê ace','ace-aee-ace','0')

It's starting to look like sanitize_term() isn't working as intended.

#12 @miqrogroove
15 years ago

There is another bug in here that needs to be patched. For clarity, I filed a separate ticket #11166

#13 @miqrogroove
15 years ago

Nix previous comment. When testing with above steps, it is necessary to add:

unset($_POST['tax_input']);

Patch is on its way :)

#14 @miqrogroove
15 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

Steps to Test Patch (when not using other WP client):

Place code at top of wp-admin/post.php

$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' );
unset($_POST['tax_input']);

Navigate to wp-admin/post-new.php and save a new draft.

Expected Results: Tag name and slug should be similar, and should be attached to the new post. Any tags containing invalid byte sequences should be truncated before slug generation. Any tags truncated to zero bytes should not be added to the database.

#15 @miqrogroove
15 years ago

sirzooro, my patch will truncate tags consistently so there is less confusion. I don't think WordPress has a function to "drop invalid characters, instead of truncation". What you are seeing in the slug is that WP tries to remove "accents" from certain characters for SEO reasons, uses a special function to URLencode the UTF-8 chars, and then deletes all remaining bytes. This is not possible to do in the tag names, because then everyone would have spelling errors in their tags. Once this patch is applied, your Polish software with mixed encodings will probably not work with tags until it is upgraded too.

#16 @sirzooro
15 years ago

  • Keywords needs-patch added; has-patch needs-testing removed

miqrogroove,
As I checked, wp_check_invalid_utf8() truncates string, instead of dropping invalid chars only. I have just created ticked #11175 to address this.

I have tried to test your patch. Unfortunately it does not work with latest nightly build of WP 2.9. When I try to add tag with invalid encoding, WP created new tag with empty name and slug set to tag's ID. As I checked, sanitize_term_field() calls few filters when $context is set to 'db'. One of them (pre_term_name) in turn calls following functions: sanitize_text_field, wp_filter_kses and _wp_specialchars. One of these functions returns empty string when tag contains chars with invalid encoding. Looks that you should call wp_check_invalid_utf8() before these filters.

Please also check why WP adds tag with empty name and id as slug.

#17 @miqrogroove
15 years ago

Ah, sanitize_text_field() is a new function that calls wp_check_invalid_utf8( $str ). Unfortunately, that means my patch and your new ticket wont help.

I'll stare at the patch some more, but it should have blocked empty tags if it was installed correctly.

@miqrogroove
15 years ago

Moves lines to better match 2.9 style.

#18 @sirzooro
15 years ago

  • Keywords has-patch tested added; needs-patch removed

Thanks. Looks that it works correctly now.

#19 @westi
15 years ago

  • Keywords needs-unit-tests early added
  • Milestone changed from 2.9 to 3.0
  • Owner changed from filosofo to westi
  • Status changed from new to accepted

I don't think we should change this at this stage in the release cycle.

I would prefer to do this early in 3.0 along with #11175 and give the change time to get a good lot of testing.

Moving to 3.0 for an early patch.

Is there any other way of testing this issue without injecting the invalid utf8 by hacking the core?

#20 @miqrogroove
15 years ago

Is there any other way of testing this issue without injecting the invalid utf8 by hacking the core?

For bottom-up testing, load the default filters and call your favorite function from the flow. wp_write_post() is the only function that relies on the superglobal input.

wp-admin/post.php
wp-admin/includes/post.php/write_post()
wp-admin/includes/post.php/wp_write_post()
wp-includes/post.php/wp_insert_post()
wp-includes/post.php/wp_set_post_tags()
wp-includes/post.php/wp_set_post_terms()
wp-includes/taxonomy.php/wp_set_object_terms()
wp-includes/taxonomy.php/wp_insert_term()
wp-includes/taxonomy.php/sanitize_term()
wp-includes/taxonomy.php/sanitize_term_field()

#21 @scribu
15 years ago

  • Component changed from Taxonomy to Charset

#22 @ryan
14 years ago

  • Milestone changed from 3.0 to 3.1

#23 @nacin
14 years ago

  • Milestone changed from Awaiting Triage to Future Release

#24 @chriscct7
10 years ago

  • Keywords dev-feedback added; tested early removed

#25 @chriscct7
9 years ago

  • Keywords needs-refresh added
Note: See TracTickets for help on using tickets.