WordPress.org

Make WordPress Core

Opened 8 years ago

Last modified 5 years ago

#3260 new defect (bug)

XML output (rss, atom, rdf ...) should always use UTF-8 or CDATA for user input

Reported by: deremder Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 2.8.2
Component: Feeds Keywords: has-patch
Focuses: Cc:

Description

If none UTF-8 is used, user generated text like titles or categories may contain unallowed entities. The elements <title>, <tagline>, <dc:subject>, <category> (there may be more) are not protected width <[[CDATA ]]>.

The conversion can be done width the PHP MulitByte function. If MB is not available, the <![[CDATA should be used instead.

Please contact me for development help.

Attachments (1)

3260.patch (9.0 KB) - added by hakre 5 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 foolswisdom8 years ago

#3252 seems to deal with the category element.

comment:2 Nazgul7 years ago

  • Milestone set to 2.4 (future)

comment:3 thee176 years ago

  • Milestone changed from 2.5 to 2.6

comment:4 Denis-de-Bernardy5 years ago

  • Component changed from Optimization to Feeds
  • Owner anonymous deleted

comment:5 peaceablewhale5 years ago

  • Keywords reporter-feedback added

I wonder if this problem still persists. Any test case?

comment:6 Denis-de-Bernardy5 years ago

utf-8 is probably a no go because a site might use utf-16.

that being said, no idea if this is still a valid. there are tons of tickets that got closed to fix this/that field here and there due to the fact that they contained data that needed cdata or entities.

suggesting invalid, personally, and opening tickets as the need arises as we've done in the past years.

comment:7 peaceablewhale5 years ago

I think outputing UTF-16 or other non-ASCII compatible encoded contents is not possible in WordPress... the themes are written in ASCII...

comment:8 Denis-de-Bernardy5 years ago

  • Keywords close added

comment:9 iron_xman5 years ago

  • Cc iron_xman added

Can we look at wrapping the title tag in the rss feed in CDATA as it throws up when there are entities in the title? I've been building a script to bring in an rss feed from a wordpress site, and display it on my site, and it keeps coming back with the & #8217 as a weird garbage character. When it's wrapped in CDATA tags, it works as expected. In looking at the code, the content of the rss is in CDATA, but not the title, why? The fix would be very simple to do, as long as there aren't any side effects. I haven't noticed any in my own install.
I hate hacking core files, so I'm requesting that this be added. And as it pertains to this bug, I thought I would re-open it.

comment:10 iron_xman5 years ago

  • Keywords close removed

comment:11 iron_xman5 years ago

  • Keywords reporter-feedback removed

comment:12 iron_xman5 years ago

  • Version changed from 2.0.4 to 2.8.2

comment:13 peaceablewhale5 years ago

& #8217 is U+2019 ('RIGHT SINGLE QUOTATION MARK'). It may be normal if the title contains this character. Would you mind minding the URL of your feed here?

hakre5 years ago

comment:14 hakre5 years ago

  • Keywords has-patch added

I made a patch taking care of the CDATA part. Before that I did some tests that confirmed the bug partially is still in and needs a fix.

Encoding is based on the blog so this does not need a modification, it's already in the output.

comment:15 follow-up: peaceablewhale5 years ago

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

comment:16 hakre5 years ago

Related: #3884

comment:17 in reply to: ↑ 15 hakre5 years ago

Replying to peaceablewhale:

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

unallowed entities? there are no unallowed entities in CDATA. infact there are not entities at all in CDATA that's why it is used there. This is by intention not by mistake. For example if you've got the term "&nbsp;" in a posts title it should be in the RSS-Feeds title as well. Currently it is not. After the patch it is.

comment:18 peaceablewhale5 years ago

The RSS title element and Atom title element are text by default, using <[[CDATA&nbsp;]]> is a violation to the specifications.

I think the functions that return the title, subtitle, etc should handle those entities (resolve &nbsp; to numeric character reference).

comment:19 hakre5 years ago

I do not get your point. CDATA is perfectly valid for XML therefore anything built on it should not have any problems with it. It's just a way to express that there is character data. Please read this first: http://en.wikipedia.org/wiki/CDATA to understand what a cdata section means for SGML and XML documents, eg. RSS and ATOM. What should and what not is pretty much already defined so please stick to the specs, I see no use in inventing new specs only for this bugreport.

comment:20 peaceablewhale5 years ago

I am sorry that I did not express it very clearly..

&nbsp; is a reference of HTML/XHTML, but the RSS title element and Atom title element accept plain text only by default.

comment:21 ryan5 years ago

  • Milestone changed from 2.9 to Future Release
Note: See TracTickets for help on using tickets.