Opened 20 years ago
Closed 16 years ago
#1955 closed defect (bug) (fixed)
Outgoing trackback ping could include 'charset' attribute for international trackbacks
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 2.0 |
Component: | Optimization | Keywords: | trackback charset has-patch |
Focuses: | Cc: |
Description
Although the original trackback API (Six Apart) requires only title, excerpt, url and blog_name attributes in trackback pings, 'charset' attribute is getting appreciated to correclty parse trackbacks encoded in multibyte character sets.
Current WordPress does parse 'charset' attribute in incoming trackbacks (please see wp-trackback.php), but 'charset' attribute is lacking in outgoing trackbacks. Recieving trackbacks without 'charset' attribute, wp-trackback.php tries to convert characters using PHP mb_convert_encoding function, and this sometimes fails even for UTF-8 trackbacks causing broken characters.
Proposal:
/wp-includes/functions.php, line 776-
// Send a Trackback function trackback($trackback_url, $title, $excerpt, $ID) { global $wpdb, $wp_version; if ( empty($trackback_url) ) return; $title = urlencode($title); $excerpt = urlencode($excerpt); $blog_name = urlencode(get_settings('blogname')); $tb_url = $trackback_url; $url = urlencode(get_permalink($ID)); $query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt";
could be like this
// Send a Trackback function trackback($trackback_url, $title, $excerpt, $ID) { global $wpdb, $wp_version; if (empty($trackback_url)) return; $title = urlencode($title); $excerpt = urlencode($excerpt); $blog_name = urlencode(get_settings('blogname')); $tb_url = $trackback_url; $url = urlencode(get_permalink($ID)); $charset = get_settings('blog_charset'); $query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt&charset=$charset";
When 'charset' argument is received by other blog systems that do not require charset attribute, it will be just ignored. This modification may help many of multibyte character users. Thank you.
Attachments (3)
Change History (25)
#3
@
20 years ago
Hi philor,
Thanks for the comment. I agree that it is reasonable to parse charset in the header instead of the post. I may have misunderstood about the change of trackback specification seeing outdated websites. So we'd better to tweak wp-trackback.php, right?
#4
@
20 years ago
I modified wp-trackback.php to parse charset parameter in the request header. Below codes need to be verified, works for me though.
Thank you.
/wp-trackback.php, lines 33-
$tb_url = $_POST['url']; $title = $_POST['title']; $excerpt = $_POST['excerpt']; $blog_name = $_POST['blog_name']; $charset = $_POST['charset']; if ($charset) $charset = strtoupper( trim($charset) ); else $charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS';
change to
$tb_url = $_POST['url']; $title = $_POST['title']; $excerpt = $_POST['excerpt']; $blog_name = $_POST['blog_name']; $headers = apache_request_headers(); $charset = stristr ($headers['Content-Type'], 'charset='); if ($charset) { $charset = (strpos ($charset, ';')) ? substr ($charset, 8, (strpos($charset, ';') - 8)) : substr ($charset, 8); $charset = strtoupper (trim ($charset)); } else { $charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS'; }
#5
@
19 years ago
- Milestone set to 2.1
- Version changed from 1.6 to 2.0
$headers = apache_request_headers(); $charset = stristr ($headers['Content-Type'], 'charset=');
change to below:
$charset = stristr ($_SERVER['Content-Type'], 'charset=');
is possible?
#6
@
19 years ago
- Type changed from defect to enhancement
- 'ASCII' is 100% compatible with 'UTF-8' so it will be better to remove 'ASCII' from $charset;
- I've never seen error in conversion string from 'iso8859-1' to 'utf-8' (or other) if the string was encoded with 'euc-kr' or whatever. I don't know how mb_convert_encoding works, but I think mb_convert_encoding will judge 'iso8859-1' is the appropriate encoding because there's no error in conversion 'iso8859-1' to any other encoding. How about move 'iso8859-1' to the end of $charset.
- Why there's no Korean encoding like 'euc-kr', 'cp949' in $charset? (cp949 is compatible with euc-kr, and cp949 has every character in euc-kr, so it will be enough adding 'cp949' to $chraset)
- For system which doesn't support mbstrings, how about using iconv if there not exist mbstring module.
if ($_POST['charset']) $charset = strtoupper( trim($_POST['charset']) ); else $charset = 'UTF-8, EUC-KR, JIS, EUC-JP, SJIS, ISO-8859-1'; $passed = array( 0, 0, 0 ); if ( function_exists('mb_convert_encoding') ) { $title = mb_convert_encoding($title, get_settings('blog_charset'), $charset); $excerpt = mb_convert_encoding($excerpt, get_settings('blog_charset'), $charset); $blog_name = mb_convert_encoding($blog_name, get_settings('blog_charset'), $charset); } else if ( function_exists('iconv') ) { $charsets = explode( ', ', $charset ); for( $i = 0 ; $i < count($charsets); $i++ ){ if ( !$passed[0] ) $passed[0] = iconv($charsets[$i], get_settings('blog_charset'), $title); if ( !$passed[1] ) $passed[1] = iconv($charsets[$i], get_settings('blog_charset'), $excerpt); if ( !$passed[2] ) $passed[2] = iconv($charsets[$i], get_settings('blog_charset'), $blog_name); } $title = $passed[0]; $excerpt = $passed[1]; $blog_name = $passed[2]; }
#8
@
19 years ago
Is there a compelling reason not to just add a filter and make this pluggable? Then plugins, especially multilingual ones, can take advantage and send whatever trackback charset is required. (I'm assuming this area of the code isn't already pluggable, but I haven't checked).
#9
@
19 years ago
First patch upload
include:
- matopc's update
- iconv fallback
ps. trackback response is always UTF-8. so in LINE9, charset is set fixed value utf-8.
need : validate
#10
@
19 years ago
majelbstoat Unfortunately we can't make this pluggable. there's no do_action(...) in trackback function. I made korean trackback plugin, but it's almost hack! do_action('trackback_post') called after inserting trackback to the database. (in wp-trackback.php).
drssay here's utf-8 validation code
// Returns true if $string is valid UTF-8 and false otherwise. // utf-8 validation with regex expression function is_utf8($string) { return preg_match('%^(?: [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*$%xs', $string); }
Here's another one. I referenced glib's utf-8 validating interface .
function utf8_validation( $str, &$i ){ $i = 0; $len = strlen($str); while( $i < $len ){ if( (ord($str[$i]) & 0xF0) == 0xE0 ){ if( $i <= ($len-3) && (ord($str[$i+1])&0x80 == 0x80) && (ord($str[$i+2])&0x80 == 0x80) ) $i += 3; else return; } // 2Byte else if( (ord($str[$i]) & 0xE0) == 0xC0 ){ if( $i <= ($len-2) && (ord($str[$i+1])&0x80 == 0x80) ) $i += 2; else return; } // 1Byte else { $i++; } } }
you can test the second code in the url below.
http://mytears.org/resources/mysrc/php/unicode/
#11
@
19 years ago
drssay I find bug in the patch :)
iconv's charset argument must be a single encoding like 'utf-8' or 'euc-kr', it can't be multiple encodings like 'utf-8, jis, euc-jp, ...'. So we should explode $charset to a single encoding.
#13
@
19 years ago
My suggestion was to make it pluggable!
This patch (1955.diff) adds the filter 'trackback_charset', so that any plugin can do whatever it wants to to the charset. Admittedly, it doesn't allow for iconv, but that whole section of the code could be plugged too if it was really needed.
#14
@
19 years ago
Also, regardless, we don't need to do any of that stuff if the tb_id isn't valid, so that check should come first.
#15
@
19 years ago
Hi,
I'm finding an error in the way wordpress is sending trackbacks in my blog, and I'd like to know if it's the same issue described here, so I don't file a duplicate bug report.
The article here:
http://eldiabloenlosdetalles.net/2006/08/16/%c2%bfque-es-the-long-tail-parte-i/
has a a trackback from:
http://eldiabloenlosdetalles.net/2006/08/20/%c2%bfque-es-the-long-tail-parte-ii/
but it shows up as:
http://eldiabloenlosdetalles.net/2006/08/20/c2bfque-es-the-long-tail-parte-ii/
so the link is broken. I appreciatte any feedback. Is that the same issue?
Carlos
#18
@
19 years ago
majelbstoat,
It is already pluggable. You can bypass or recode the whole part through preprocess_comment hook.
Note to matt and ryan:
I think this encoding conversion section is one of the worst code in the WP. Currently only UTF-8, ISO-8859-1 and JIS input are seemingly converted safely and many of the rest would possibly be destroyed or result in error (Notice that even EUC-JP and SJIS are destroyed!). It seems that the demerit of the existence of the code is bigger than the merit.
Suggested Solution:
Simply remove the input encoding conversion section and let local people write their own plugins.
Reasons:
- It is pluggable.
- Current incomming pingback section does not have this kind of engine and people seem fine without it.
- Implementing almighty input encoding conversion logic is almost impossible.
- If one has proper MySQL character set settings (like skip-character-set-client-handshake), theoretically any foreign encoding input can be retrievable later on.
- Nobody seems to like the current code.
Hmm. Isn't the bug that we ignore an incoming charset param on the content-type header, rather than that we don't send a charset param in the post data that hasn't ever seen the light of any spec? According to [1734] the WP Japan folks wanted it that way, but I'm not sure quite why, since that was after the August 2004 change in the Trackback spec that called for it on the content-type header, where we do send it.