Opened 15 years ago
Last modified 5 years ago
#10543 accepted defect (bug)
Incorrect (non-UTF-8) character handling in tag's name and slug
Reported by: | sirzooro | Owned by: | westi |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 2.8.2 |
Component: | Charset | Keywords: | has-patch needs-unit-tests dev-feedback needs-refresh |
Focuses: | Cc: |
Description
Incorrect (non-UTF-8) character tag's name and slug are handled in different way: name is truncated on 1st such character, and in slug they are just removed (no truncation). WP should handle both in the same way - drop invalid characters, instead of truncation.
I found this issue recently. One of the Polish programs for adding posts to the Wordpresses does not encode tags in UTF-8 - it left them in ISO-8859-2. I notified author of this bug. Unfortunately there are many copies around, so it may take a long time before everyone upgrade.
Attachments (2)
Change History (27)
#2
@
15 years ago
Taxonomy API assumes that encoding is the same as current blog encoding. I think that Remote Publishing protocols also may be affected - this not a bug in WP itself, but in client which uses incorrect encoding.
#3
@
15 years ago
sirzooro, have you done any more troubleshooting on this issue? For example, did you confirm this is an input issue by seeing if the truncated version was saved in the database? And could you tell me how the tags are being input exactly? Does that Polish program use XML-RPC? Is there a specific byte sequence I can use to reproduce that behavior?
#4
@
15 years ago
This program (called Adder) adds new posts by simulating browser (it sends forms data via POST; no XML-RRC or similar methods).
I have analysed this problem and released plugin Adder Tags Fix, which fixes it. It checks if POST'ed data contains tags, and fixes tags encoding if necessary.
#5
@
15 years ago
One more note: this plugin fixes tags no matter how they are passed. As I remember, Adder uses only old method (used in WP 2.7 and earlier) to pass tags.
#6
@
15 years ago
I need to take notes because this is a bug nut to crack.
The flow of control for tags input is as follows:
wp-admin/post.php wp-admin/includes/post.php/write_post() wp-admin/includes/post.php/wp_write_post() wp-includes/post.php/wp_insert_post() wp-includes/post.php/wp_set_post_tags() wp-includes/post.php/wp_set_post_terms() wp-includes/taxonomy.php/wp_set_object_terms()
#7
@
15 years ago
sirzooro, I'll have to give you some suggestions at this point because I don't have any slick tools for injecting tag strings.
Go to the wp-includes/post.php file and start stubbing-in debug commands to the function wp_insert_post(). Specifically, look at the value of $tags_input and $tax_input right after the extract() is done. You might be able to use something like this:
error_log(print_r($tax_input, TRUE));
Figure out, at that point in the code, has the tag been truncated or modified in any way? This should give us a pretty good idea of how much string sanitizing has been done before the tags are passed to the Taxonomy API.
#8
@
15 years ago
miqrogroove,
You can try to add following code somewhere early in WP code, e.g. in wp-config.php:
$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' );
This code will inject tags with incorrect encoding to the $_POST array. Make sure you save file with utf-8 encoding. With this you should be able to reproduce issue.
#9
@
15 years ago
Steps to Reproduce:
Place code at top of wp-admin/post.php
$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' );
Navigate to wp-admin/post-new.php and save a new draft.
Expected Results:
Tag name and slug should be similar, and should be attached to the new post.
Actual Results:
Tag name "ace", slug "ace-aee-ace", tag not attached to the new post.
#11
@
15 years ago
Here is the code being generated by wp_insert_term() when it calls $wpdb->insert()
INSERT INTO `mwp_terms` (`name`,`slug`,`term_group`) VALUES ('ace ±æê ace','ace-aee-ace','0')
It's starting to look like sanitize_term() isn't working as intended.
#12
@
15 years ago
There is another bug in here that needs to be patched. For clarity, I filed a separate ticket #11166
#13
@
15 years ago
Nix previous comment. When testing with above steps, it is necessary to add:
unset($_POST['tax_input']);
Patch is on its way :)
#14
@
15 years ago
- Keywords has-patch needs-testing added; needs-patch removed
Steps to Test Patch (when not using other WP client):
Place code at top of wp-admin/post.php
$_POST['tags_input'] = mb_convert_encoding( 'ace ąćę ace', 'iso-8859-2', 'utf-8' ); unset($_POST['tax_input']);
Navigate to wp-admin/post-new.php and save a new draft.
Expected Results: Tag name and slug should be similar, and should be attached to the new post. Any tags containing invalid byte sequences should be truncated before slug generation. Any tags truncated to zero bytes should not be added to the database.
#15
@
15 years ago
sirzooro, my patch will truncate tags consistently so there is less confusion. I don't think WordPress has a function to "drop invalid characters, instead of truncation". What you are seeing in the slug is that WP tries to remove "accents" from certain characters for SEO reasons, uses a special function to URLencode the UTF-8 chars, and then deletes all remaining bytes. This is not possible to do in the tag names, because then everyone would have spelling errors in their tags. Once this patch is applied, your Polish software with mixed encodings will probably not work with tags until it is upgraded too.
#16
@
15 years ago
- Keywords needs-patch added; has-patch needs-testing removed
miqrogroove,
As I checked, wp_check_invalid_utf8() truncates string, instead of dropping invalid chars only. I have just created ticked #11175 to address this.
I have tried to test your patch. Unfortunately it does not work with latest nightly build of WP 2.9. When I try to add tag with invalid encoding, WP created new tag with empty name and slug set to tag's ID. As I checked, sanitize_term_field() calls few filters when $context is set to 'db'. One of them (pre_term_name) in turn calls following functions: sanitize_text_field, wp_filter_kses and _wp_specialchars. One of these functions returns empty string when tag contains chars with invalid encoding. Looks that you should call wp_check_invalid_utf8() before these filters.
Please also check why WP adds tag with empty name and id as slug.
#17
@
15 years ago
Ah, sanitize_text_field() is a new function that calls wp_check_invalid_utf8( $str ). Unfortunately, that means my patch and your new ticket wont help.
I'll stare at the patch some more, but it should have blocked empty tags if it was installed correctly.
#18
@
15 years ago
- Keywords has-patch tested added; needs-patch removed
Thanks. Looks that it works correctly now.
#19
@
15 years ago
- Keywords needs-unit-tests early added
- Milestone changed from 2.9 to 3.0
- Owner changed from filosofo to westi
- Status changed from new to accepted
I don't think we should change this at this stage in the release cycle.
I would prefer to do this early in 3.0 along with #11175 and give the change time to get a good lot of testing.
Moving to 3.0 for an early patch.
Is there any other way of testing this issue without injecting the invalid utf8 by hacking the core?
#20
@
15 years ago
Is there any other way of testing this issue without injecting the invalid utf8 by hacking the core?
For bottom-up testing, load the default filters and call your favorite function from the flow. wp_write_post() is the only function that relies on the superglobal input.
wp-admin/post.php wp-admin/includes/post.php/write_post() wp-admin/includes/post.php/wp_write_post() wp-includes/post.php/wp_insert_post() wp-includes/post.php/wp_set_post_tags() wp-includes/post.php/wp_set_post_terms() wp-includes/taxonomy.php/wp_set_object_terms() wp-includes/taxonomy.php/wp_insert_term() wp-includes/taxonomy.php/sanitize_term() wp-includes/taxonomy.php/sanitize_term_field()
I don't see anything relevant to character encoding in the Taxonomy API. I wonder if this is happening in one of the Remote Publishing protocols?