Make WordPress Core

Opened 13 years ago

Closed 10 years ago

Last modified 10 years ago

#19368 closed defect (bug) (invalid)

UTF-8 characters truncated mid-byte sequence in excerpt in RSS2 feed

Reported by: kurtmckee's profile kurtmckee Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Feeds Keywords:
Focuses: Cc:

Description

I received a bug report at a project I maintain and discovered what appears to be a bug in Wordpress 3.2.1.

The trouble is that the description element is being truncated in the middle of a UTF-8 multibyte character, which is producing garbage binary data. An example can be found at:

http://www.arnaudmontebourg.fr/?feed=rss2

I downloaded the site's theme but found nothing that would affect post_excerpt or the_excerpt_rss. I then downloaded Wordpress trunk and attempted to figure out where the problem might be, but I'm unfamiliar with the Wordpress source and couldn't find anything after tracing through multiple files using grep.

I did discover that trackback_url_list() in wp-includes/post.php appears to be using a simple substr() call that might cause problems with multibyte characters. However, I'm more concerned with the potential for malformed feeds.

I've included a copy of the feed XML in question for longevity.

Attachments (1)

truncated-utf8.xml (11.5 KB) - added by kurtmckee 13 years ago.
Copy of feed from arnaudmontebourg.fr

Download all attachments as: .zip

Change History (6)

@kurtmckee
13 years ago

Copy of feed from arnaudmontebourg.fr

#1 @dd32
13 years ago

  • Component changed from General to Feeds

The template for the RSS feed is this one: http://core.trac.wordpress.org/browser/trunk/wp-includes/feed-rss2.php

the description element uses the_excerpt_rss() which ultimately uses wp_trim_excerpt to generate the excerpt.

That looks multi-byte safe to me, as it's only splitting on "\r\n\t ".. I've not tested anything here, just traced it for you

#2 @SergeyBiryukov
13 years ago

wp_trim_excerpt() is extensively used in many themes via the_excerpt() and has no UTF-8 issues that I know of, so steps to reproduce this on a clean install would be helpful.

My guess is that the feed in question is broken by some plugin or a theme customization.

#3 @kurtmckee
13 years ago

Unfortunately I don't have a clean install, and after looking at the feed more carefully this morning the contents of the description element aren't even a consistent word count nor byte length. The problem with the feed is probably an issue with a plugin or theme customization, as you've noted.

Before closing this ticket, is my concern about trackback_url_list() valid? The PHP documentation suggests that using substr() on a UTF-8 string can produce truncated byte sequences. At line 3092 in wp-includes/post.php:

if (strlen($excerpt) > 255) {
    $excerpt = substr($excerpt,0,252) . '...';
}

#4 @chriscct7
10 years ago

  • Resolution set to invalid
  • Status changed from new to closed

That function is for URLs which have to be percent encoded so that's not a bug.

#5 @DrewAPicture
10 years ago

  • Milestone Awaiting Review deleted
Note: See TracTickets for help on using tickets.