WordPress.org

Make WordPress Core

Opened 4 years ago

Closed 10 months ago

Last modified 10 months ago

#19368 closed defect (bug) (invalid)

UTF-8 characters truncated mid-byte sequence in excerpt in RSS2 feed

Reported by: kurtmckee Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Feeds Keywords:
Focuses: Cc:

Description

I received a bug report at a project I maintain and discovered what appears to be a bug in Wordpress 3.2.1.

The trouble is that the description element is being truncated in the middle of a UTF-8 multibyte character, which is producing garbage binary data. An example can be found at:

http://www.arnaudmontebourg.fr/?feed=rss2

I downloaded the site's theme but found nothing that would affect post_excerpt or the_excerpt_rss. I then downloaded Wordpress trunk and attempted to figure out where the problem might be, but I'm unfamiliar with the Wordpress source and couldn't find anything after tracing through multiple files using grep.

I did discover that trackback_url_list() in wp-includes/post.php appears to be using a simple substr() call that might cause problems with multibyte characters. However, I'm more concerned with the potential for malformed feeds.

I've included a copy of the feed XML in question for longevity.

Attachments (1)

truncated-utf8.xml (11.5 KB) - added by kurtmckee 4 years ago.
Copy of feed from arnaudmontebourg.fr

Download all attachments as: .zip

Change History (6)

@kurtmckee4 years ago

Copy of feed from arnaudmontebourg.fr

comment:1 @dd324 years ago

  • Component changed from General to Feeds

The template for the RSS feed is this one: http://core.trac.wordpress.org/browser/trunk/wp-includes/feed-rss2.php

the description element uses the_excerpt_rss() which ultimately uses wp_trim_excerpt to generate the excerpt.

That looks multi-byte safe to me, as it's only splitting on "\r\n\t ".. I've not tested anything here, just traced it for you

comment:2 @SergeyBiryukov4 years ago

wp_trim_excerpt() is extensively used in many themes via the_excerpt() and has no UTF-8 issues that I know of, so steps to reproduce this on a clean install would be helpful.

My guess is that the feed in question is broken by some plugin or a theme customization.

comment:3 @kurtmckee4 years ago

Unfortunately I don't have a clean install, and after looking at the feed more carefully this morning the contents of the description element aren't even a consistent word count nor byte length. The problem with the feed is probably an issue with a plugin or theme customization, as you've noted.

Before closing this ticket, is my concern about trackback_url_list() valid? The PHP documentation suggests that using substr() on a UTF-8 string can produce truncated byte sequences. At line 3092 in wp-includes/post.php:

if (strlen($excerpt) > 255) {
    $excerpt = substr($excerpt,0,252) . '...';
}

comment:4 @chriscct710 months ago

  • Resolution set to invalid
  • Status changed from new to closed

That function is for URLs which have to be percent encoded so that's not a bug.

comment:5 @DrewAPicture10 months ago

  • Milestone Awaiting Review deleted
Note: See TracTickets for help on using tickets.