Make WordPress Core

Opened 20 years ago

Closed 16 years ago

#1955 closed defect (bug) (fixed)

Outgoing trackback ping could include 'charset' attribute for international trackbacks

Reported by: matopc's profile matopc Owned by:
Milestone: Priority: normal
Severity: normal Version: 2.0
Component: Optimization Keywords: trackback charset has-patch
Focuses: Cc:

Description

Although the original trackback API (Six Apart) requires only title, excerpt, url and blog_name attributes in trackback pings, 'charset' attribute is getting appreciated to correclty parse trackbacks encoded in multibyte character sets.

Current WordPress does parse 'charset' attribute in incoming trackbacks (please see wp-trackback.php), but 'charset' attribute is lacking in outgoing trackbacks. Recieving trackbacks without 'charset' attribute, wp-trackback.php tries to convert characters using PHP mb_convert_encoding function, and this sometimes fails even for UTF-8 trackbacks causing broken characters.

Proposal:
/wp-includes/functions.php, line 776-

// Send a Trackback
function trackback($trackback_url, $title, $excerpt, $ID) {
	global $wpdb, $wp_version;

	if ( empty($trackback_url) )
		return;

	$title = urlencode($title);
	$excerpt = urlencode($excerpt);
	$blog_name = urlencode(get_settings('blogname'));
	$tb_url = $trackback_url;
	$url = urlencode(get_permalink($ID));
	$query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt";

could be like this

// Send a Trackback
function trackback($trackback_url, $title, $excerpt, $ID) {
	global $wpdb, $wp_version;
 
	if (empty($trackback_url))
	return;
 
	$title = urlencode($title);
	$excerpt = urlencode($excerpt);
	$blog_name = urlencode(get_settings('blogname'));
	$tb_url = $trackback_url;
	$url = urlencode(get_permalink($ID));
	$charset = get_settings('blog_charset');
	$query_string = "title=$title&url=$url&blog_name=$blog_name&excerpt=$excerpt&charset=$charset";

When 'charset' argument is received by other blog systems that do not require charset attribute, it will be just ignored. This modification may help many of multibyte character users. Thank you.

Attachments (3)

trackback.patch (1.5 KB) - added by drssay 19 years ago.
1955.diff (1.3 KB) - added by majelbstoat 19 years ago.
Filter trackback charset.
trackback2.patch (1.8 KB) - added by drssay 19 years ago.

Download all attachments as: .zip

Change History (25)

#1 @davidhouse
20 years ago

  • Keywords bg|has-patch added

#2 @philor
20 years ago

Hmm. Isn't the bug that we ignore an incoming charset param on the content-type header, rather than that we don't send a charset param in the post data that hasn't ever seen the light of any spec? According to [1734] the WP Japan folks wanted it that way, but I'm not sure quite why, since that was after the August 2004 change in the Trackback spec that called for it on the content-type header, where we do send it.

#3 @matopc
20 years ago

Hi philor,
Thanks for the comment. I agree that it is reasonable to parse charset in the header instead of the post. I may have misunderstood about the change of trackback specification seeing outdated websites. So we'd better to tweak wp-trackback.php, right?

#4 @matopc
20 years ago

I modified wp-trackback.php to parse charset parameter in the request header. Below codes need to be verified, works for me though.
Thank you.

/wp-trackback.php, lines 33-

$tb_url    = $_POST['url'];
$title     = $_POST['title'];
$excerpt   = $_POST['excerpt'];
$blog_name = $_POST['blog_name'];
$charset   = $_POST['charset'];

if ($charset)
	$charset = strtoupper( trim($charset) );
else
	$charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS';

change to

$tb_url    = $_POST['url'];
$title     = $_POST['title'];
$excerpt   = $_POST['excerpt'];
$blog_name = $_POST['blog_name'];
$headers = apache_request_headers();
$charset = stristr ($headers['Content-Type'], 'charset=');

if ($charset) {
    $charset = (strpos ($charset, ';')) ? substr ($charset, 8, (strpos($charset, ';') - 8)) : substr ($charset, 8);
    $charset = strtoupper (trim ($charset));
    } else {
    $charset = 'ASCII, UTF-8, ISO-8859-1, JIS, EUC-JP, SJIS';
    }

#5 @drssay
19 years ago

  • Milestone set to 2.1
  • Version changed from 1.6 to 2.0
$headers = apache_request_headers();
$charset = stristr ($headers['Content-Type'], 'charset=');

change to below:

$charset = stristr ($_SERVER['Content-Type'], 'charset=');

is possible?

#6 @Tae-young
19 years ago

  • Type changed from defect to enhancement
  • 'ASCII' is 100% compatible with 'UTF-8' so it will be better to remove 'ASCII' from $charset;
  • I've never seen error in conversion string from 'iso8859-1' to 'utf-8' (or other) if the string was encoded with 'euc-kr' or whatever. I don't know how mb_convert_encoding works, but I think mb_convert_encoding will judge 'iso8859-1' is the appropriate encoding because there's no error in conversion 'iso8859-1' to any other encoding. How about move 'iso8859-1' to the end of $charset.
  • Why there's no Korean encoding like 'euc-kr', 'cp949' in $charset? (cp949 is compatible with euc-kr, and cp949 has every character in euc-kr, so it will be enough adding 'cp949' to $chraset)
  • For system which doesn't support mbstrings, how about using iconv if there not exist mbstring module.
    if ($_POST['charset'])
        $charset = strtoupper( trim($_POST['charset']) );
    else
        $charset = 'UTF-8, EUC-KR, JIS, EUC-JP, SJIS, ISO-8859-1';

    $passed = array( 0, 0, 0 );
    if ( function_exists('mb_convert_encoding') ) {

        $title     = mb_convert_encoding($title, get_settings('blog_charset'), $charset);
        $excerpt   = mb_convert_encoding($excerpt, get_settings('blog_charset'), $charset);
        $blog_name = mb_convert_encoding($blog_name, get_settings('blog_charset'), $charset);
    }
    else if ( function_exists('iconv') ) {
        $charsets = explode( ', ', $charset  );
        
        for( $i = 0 ; $i < count($charsets); $i++ ){
            if ( !$passed[0] )
                $passed[0] = iconv($charsets[$i], get_settings('blog_charset'), $title);
            if ( !$passed[1] )
                $passed[1] = iconv($charsets[$i], get_settings('blog_charset'), $excerpt);
            if ( !$passed[2] )
                $passed[2] = iconv($charsets[$i], get_settings('blog_charset'), $blog_name);
        }
        $title = $passed[0];
        $excerpt = $passed[1];
        $blog_name = $passed[2];

    }

#7 @aqua0125
19 years ago

  • Type changed from enhancement to defect

#8 @majelbstoat
19 years ago

Is there a compelling reason not to just add a filter and make this pluggable? Then plugins, especially multilingual ones, can take advantage and send whatever trackback charset is required. (I'm assuming this area of the code isn't already pluggable, but I haven't checked).

#9 @drssay
19 years ago

First patch upload

include:

  1. matopc's update
  2. iconv fallback

ps. trackback response is always UTF-8. so in LINE9, charset is set fixed value utf-8.

need : validate

@drssay
19 years ago

#10 @Tae-young
19 years ago

majelbstoat Unfortunately we can't make this pluggable. there's no do_action(...) in trackback function. I made korean trackback plugin, but it's almost hack! do_action('trackback_post') called after inserting trackback to the database. (in wp-trackback.php).

drssay here's utf-8 validation code

// Returns true if $string is valid UTF-8 and false otherwise.
// utf-8 validation with regex expression
function is_utf8($string) {
return preg_match('%^(?:
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*$%xs', $string);
}

Here's another one. I referenced glib's utf-8 validating interface .

function utf8_validation( $str, &$i ){

    $i = 0;
    $len = strlen($str);

    while( $i < $len ){
    
        if( (ord($str[$i]) & 0xF0) == 0xE0 ){
            if( $i <= ($len-3) && 
                (ord($str[$i+1])&0x80 == 0x80) &&
                (ord($str[$i+2])&0x80 == 0x80) )
                    $i += 3;
            else
                return;
        }
        // 2Byte
        else if( (ord($str[$i]) & 0xE0) == 0xC0 ){
            if( $i <= ($len-2) && 
                (ord($str[$i+1])&0x80 == 0x80) )
                    $i += 2;
            else
                return;
        }
        // 1Byte
        else {
            $i++;
        }
        
    }
    
}

you can test the second code in the url below.
http://mytears.org/resources/mysrc/php/unicode/

#11 @Tae-young
19 years ago

drssay I find bug in the patch :)

iconv's charset argument must be a single encoding like 'utf-8' or 'euc-kr', it can't be multiple encodings like 'utf-8, jis, euc-jp, ...'. So we should explode $charset to a single encoding.

#12 @drssay
19 years ago

iconv charset error resolved

#13 @majelbstoat
19 years ago

My suggestion was to make it pluggable!

This patch (1955.diff) adds the filter 'trackback_charset', so that any plugin can do whatever it wants to to the charset. Admittedly, it doesn't allow for iconv, but that whole section of the code could be plugged too if it was really needed.

@majelbstoat
19 years ago

Filter trackback charset.

#14 @majelbstoat
19 years ago

Also, regardless, we don't need to do any of that stuff if the tb_id isn't valid, so that check should come first.

@drssay
19 years ago

#15 @raghos
19 years ago

Hi,

I'm finding an error in the way wordpress is sending trackbacks in my blog, and I'd like to know if it's the same issue described here, so I don't file a duplicate bug report.

The article here:
http://eldiabloenlosdetalles.net/2006/08/16/%c2%bfque-es-the-long-tail-parte-i/

has a a trackback from:
http://eldiabloenlosdetalles.net/2006/08/20/%c2%bfque-es-the-long-tail-parte-ii/

but it shows up as:
http://eldiabloenlosdetalles.net/2006/08/20/c2bfque-es-the-long-tail-parte-ii/

so the link is broken. I appreciatte any feedback. Is that the same issue?

Carlos

#16 @ryan
19 years ago

raghos, that is a different issue that has already been fixed in trunk.

#17 @matt
19 years ago

  • Milestone changed from 2.1 to 2.2

#18 @tenpura
19 years ago

majelbstoat,
It is already pluggable. You can bypass or recode the whole part through preprocess_comment hook.

Note to matt and ryan:
I think this encoding conversion section is one of the worst code in the WP. Currently only UTF-8, ISO-8859-1 and JIS input are seemingly converted safely and many of the rest would possibly be destroyed or result in error (Notice that even EUC-JP and SJIS are destroyed!). It seems that the demerit of the existence of the code is bigger than the merit.

Suggested Solution:
Simply remove the input encoding conversion section and let local people write their own plugins.

Reasons:

  • It is pluggable.
  • Current incomming pingback section does not have this kind of engine and people seem fine without it.
  • Implementing almighty input encoding conversion logic is almost impossible.
  • If one has proper MySQL character set settings (like skip-character-set-client-handshake), theoretically any foreign encoding input can be retrievable later on.
  • Nobody seems to like the current code.

#19 @foolswisdom
18 years ago

  • Keywords has-patch added; bg|has-patch removed
  • Milestone changed from 2.2 to 2.3

#20 @Nazgul
18 years ago

  • Milestone changed from 2.3 to 2.4

#21 @pishmishy
17 years ago

  • Milestone changed from 2.5 to 2.6

Bumping milestone for feature freeze.

#22 @Denis-de-Bernardy
16 years ago

  • Milestone 2.9 deleted
  • Resolution set to fixed
  • Status changed from new to closed

closing as fixed -- lack of activity and obvious lack of interest from the devs. plus, the odds are this has been fixed when the function was changed to use wp_remote_post().

Note: See TracTickets for help on using tickets.