Opened 18 months ago
Last modified 18 months ago
#19368 new defect (bug)
UTF-8 characters truncated mid-byte sequence in excerpt in RSS2 feed
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | Feeds | Version: | |
| Severity: | normal | Keywords: | |
| Cc: |
Description
I received a bug report at a project I maintain and discovered what appears to be a bug in Wordpress 3.2.1.
The trouble is that the description element is being truncated in the middle of a UTF-8 multibyte character, which is producing garbage binary data. An example can be found at:
http://www.arnaudmontebourg.fr/?feed=rss2
I downloaded the site's theme but found nothing that would affect post_excerpt or the_excerpt_rss. I then downloaded Wordpress trunk and attempted to figure out where the problem might be, but I'm unfamiliar with the Wordpress source and couldn't find anything after tracing through multiple files using grep.
I did discover that trackback_url_list() in wp-includes/post.php appears to be using a simple substr() call that might cause problems with multibyte characters. However, I'm more concerned with the potential for malformed feeds.
I've included a copy of the feed XML in question for longevity.
Attachments (1)
Change History (4)
- Component changed from General to Feeds
The template for the RSS feed is this one: http://core.trac.wordpress.org/browser/trunk/wp-includes/feed-rss2.php
the description element uses the_excerpt_rss() which ultimately uses wp_trim_excerpt to generate the excerpt.
That looks multi-byte safe to me, as it's only splitting on "\r\n\t ".. I've not tested anything here, just traced it for you
comment:2
SergeyBiryukov — 18 months ago
wp_trim_excerpt() is extensively used in many themes via the_excerpt() and has no UTF-8 issues that I know of, so steps to reproduce this on a clean install would be helpful.
My guess is that the feed in question is broken by some plugin or a theme customization.
Unfortunately I don't have a clean install, and after looking at the feed more carefully this morning the contents of the description element aren't even a consistent word count nor byte length. The problem with the feed is probably an issue with a plugin or theme customization, as you've noted.
Before closing this ticket, is my concern about trackback_url_list() valid? The PHP documentation suggests that using substr() on a UTF-8 string can produce truncated byte sequences. At line 3092 in wp-includes/post.php:
if (strlen($excerpt) > 255) {
$excerpt = substr($excerpt,0,252) . '...';
}

Copy of feed from arnaudmontebourg.fr