Make WordPress Core

Opened 10 years ago

Last modified 5 years ago

#28816 new defect (bug)

HTML entities in post titles break feeds

Reported by: blowery's profile blowery Owned by:
Milestone: Priority: normal
Severity: normal Version: 3.9.1
Component: Feeds Keywords:
Focuses: Cc:

Description

If the title of a blog post contains escaped HTML entities, like – or › the feed containing that title becomes invalid XML. To repro:

  1. Start a new post
  2. Use a title of Broken – Escaping
  3. Publish the post
  4. Load up the /feed/ url for the blog, notice the feed is invalid due to an unknown entity reference.

It appears the culprit is calling ent2ncr followed by esc_html as part of the the_title_rss filter. esc_html turns the & into an actual &, which in the replaced string appears as an entity reference. Reversing the calling order of those two filter calls outputs the entity as an XML-style numeric reference, which fixes the feed, but is also wrong.

The title should be output as "Broken – Escaping". Had the title been "Broken – Escaping" it should be output as "Broken – Escaping".

Attachments (1)

order.patch (721 bytes) - added by blowery 10 years ago.
patch to apply ent2ncr after enc_html for the_title_rss

Download all attachments as: .zip

Change History (4)

@blowery
10 years ago

patch to apply ent2ncr after enc_html for the_title_rss

#1 @stevenkword
10 years ago

See also #9993

Last edited 10 years ago by stevenkword (previous) (diff)

#2 @mdgl
10 years ago

Good bug report, and I can confirm that the problem is still present in 4.1

In this case, I think there are actually three separate issues:

  • Function ent2ncr() is being too aggressive by substituting the standard XML entities with their numeric equivalents. Although this is not totally incorrect, it is unnecessary and this function should really only need to replace the HTML entities that are not defined in XML.
  • Function esc_html() should not be used for generating feed output as it only performs a "single encode" rather than the "double encode" that is needed for including HTML within XML. For some background on this, see http://php.net/manual/en/function.htmlspecialchars.php.
  • There is a bug in function esc_html() which means it effectively "eats" the string "&" when this is followed by text that looks like a valid HTML entity.

Since this ticket is largely about feeds, I suggest we create a new ticket for the third issue.

#3 @mdgl
10 years ago

Added #31190 for the specific issue with esc_html(). Having looked at this and related escaping issues for a couple of days, my brain has turned to mush and I'm not sure what I believe anymore :-)

Also relevant for future investigation may be the fact that the <title> element in RSS is only supposed to contain text unlike the <description> and <content:encoded> elements which should contain HTML (see the RSS Best Practices Profile at http://www.rssboard.org/rss-profile).

Note: See TracTickets for help on using tickets.