#19368 closed defect (bug) (invalid)
UTF-8 characters truncated mid-byte sequence in excerpt in RSS2 feed
Reported by: | kurtmckee | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | |
Component: | Feeds | Keywords: | |
Focuses: | Cc: |
Description
I received a bug report at a project I maintain and discovered what appears to be a bug in Wordpress 3.2.1.
The trouble is that the description
element is being truncated in the middle of a UTF-8 multibyte character, which is producing garbage binary data. An example can be found at:
http://www.arnaudmontebourg.fr/?feed=rss2
I downloaded the site's theme but found nothing that would affect post_excerpt
or the_excerpt_rss
. I then downloaded Wordpress trunk and attempted to figure out where the problem might be, but I'm unfamiliar with the Wordpress source and couldn't find anything after tracing through multiple files using grep.
I did discover that trackback_url_list()
in wp-includes/post.php
appears to be using a simple substr()
call that might cause problems with multibyte characters. However, I'm more concerned with the potential for malformed feeds.
I've included a copy of the feed XML in question for longevity.
Attachments (1)
Change History (6)
#1
@
13 years ago
- Component changed from General to Feeds
The template for the RSS feed is this one: http://core.trac.wordpress.org/browser/trunk/wp-includes/feed-rss2.php
the description element uses the_excerpt_rss() which ultimately uses wp_trim_excerpt to generate the excerpt.
That looks multi-byte safe to me, as it's only splitting on "\r\n\t ".. I've not tested anything here, just traced it for you
#2
@
13 years ago
wp_trim_excerpt()
is extensively used in many themes via the_excerpt()
and has no UTF-8 issues that I know of, so steps to reproduce this on a clean install would be helpful.
My guess is that the feed in question is broken by some plugin or a theme customization.
#3
@
13 years ago
Unfortunately I don't have a clean install, and after looking at the feed more carefully this morning the contents of the description
element aren't even a consistent word count nor byte length. The problem with the feed is probably an issue with a plugin or theme customization, as you've noted.
Before closing this ticket, is my concern about trackback_url_list()
valid? The PHP documentation suggests that using substr()
on a UTF-8 string can produce truncated byte sequences. At line 3092 in wp-includes/post.php:
if (strlen($excerpt) > 255) { $excerpt = substr($excerpt,0,252) . '...'; }
Copy of feed from arnaudmontebourg.fr