Opened 10 years ago
Last modified 5 years ago
#28816 new defect (bug)
HTML entities in post titles break feeds
Reported by: | blowery | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.9.1 |
Component: | Feeds | Keywords: | |
Focuses: | Cc: |
Description
If the title of a blog post contains escaped HTML entities, like – or › the feed containing that title becomes invalid XML. To repro:
- Start a new post
- Use a title of Broken – Escaping
- Publish the post
- Load up the /feed/ url for the blog, notice the feed is invalid due to an unknown entity reference.
It appears the culprit is calling ent2ncr followed by esc_html as part of the the_title_rss filter. esc_html turns the & into an actual &, which in the replaced string appears as an entity reference. Reversing the calling order of those two filter calls outputs the entity as an XML-style numeric reference, which fixes the feed, but is also wrong.
The title should be output as "Broken – Escaping". Had the title been "Broken – Escaping" it should be output as "Broken – Escaping".
Attachments (1)
Change History (4)
#2
@
10 years ago
Good bug report, and I can confirm that the problem is still present in 4.1
In this case, I think there are actually three separate issues:
- Function
ent2ncr()
is being too aggressive by substituting the standard XML entities with their numeric equivalents. Although this is not totally incorrect, it is unnecessary and this function should really only need to replace the HTML entities that are not defined in XML. - Function
esc_html()
should not be used for generating feed output as it only performs a "single encode" rather than the "double encode" that is needed for including HTML within XML. For some background on this, see http://php.net/manual/en/function.htmlspecialchars.php. - There is a bug in function
esc_html()
which means it effectively "eats" the string "&" when this is followed by text that looks like a valid HTML entity.
Since this ticket is largely about feeds, I suggest we create a new ticket for the third issue.
#3
@
10 years ago
Added #31190 for the specific issue with esc_html()
. Having looked at this and related escaping issues for a couple of days, my brain has turned to mush and I'm not sure what I believe anymore :-)
Also relevant for future investigation may be the fact that the <title>
element in RSS is only supposed to contain text unlike the <description>
and <content:encoded>
elements which should contain HTML (see the RSS Best Practices Profile at http://www.rssboard.org/rss-profile).
patch to apply ent2ncr after enc_html for the_title_rss