WordPress.org

Make WordPress Core

Opened 11 years ago

Last modified 2 years ago

#3260 assigned defect (bug)

XML output (rss, atom, rdf ...) should always use UTF-8 or CDATA for user input

Reported by: deremder Owned by: stevenkword
Milestone: Future Release Priority: normal
Severity: normal Version: 2.8.2
Component: Feeds Keywords: has-patch
Focuses: Cc:

Description

If none UTF-8 is used, user generated text like titles or categories may contain unallowed entities. The elements <title>, <tagline>, <dc:subject>, <category> (there may be more) are not protected width <[[CDATA ]]>.

The conversion can be done width the PHP MulitByte function. If MB is not available, the <![[CDATA should be used instead.

Please contact me for development help.

Attachments (1)

3260.patch (9.0 KB) - added by hakre 8 years ago.

Download all attachments as: .zip

Change History (25)

#1 @foolswisdom
11 years ago

#3252 seems to deal with the category element.

#2 @Nazgul
10 years ago

  • Milestone set to 2.4 (future)

#3 @thee17
10 years ago

  • Milestone changed from 2.5 to 2.6

#4 @Denis-de-Bernardy
9 years ago

  • Component changed from Optimization to Feeds
  • Owner anonymous deleted

#5 @peaceablewhale
9 years ago

  • Keywords reporter-feedback added

I wonder if this problem still persists. Any test case?

#6 @Denis-de-Bernardy
9 years ago

utf-8 is probably a no go because a site might use utf-16.

that being said, no idea if this is still a valid. there are tons of tickets that got closed to fix this/that field here and there due to the fact that they contained data that needed cdata or entities.

suggesting invalid, personally, and opening tickets as the need arises as we've done in the past years.

#7 @peaceablewhale
9 years ago

I think outputing UTF-16 or other non-ASCII compatible encoded contents is not possible in WordPress... the themes are written in ASCII...

#8 @Denis-de-Bernardy
8 years ago

  • Keywords close added

#9 @iron_xman
8 years ago

  • Cc iron_xman added

Can we look at wrapping the title tag in the rss feed in CDATA as it throws up when there are entities in the title? I've been building a script to bring in an rss feed from a wordpress site, and display it on my site, and it keeps coming back with the & #8217 as a weird garbage character. When it's wrapped in CDATA tags, it works as expected. In looking at the code, the content of the rss is in CDATA, but not the title, why? The fix would be very simple to do, as long as there aren't any side effects. I haven't noticed any in my own install.
I hate hacking core files, so I'm requesting that this be added. And as it pertains to this bug, I thought I would re-open it.

#10 @iron_xman
8 years ago

  • Keywords close removed

#11 @iron_xman
8 years ago

  • Keywords reporter-feedback removed

#12 @iron_xman
8 years ago

  • Version changed from 2.0.4 to 2.8.2

#13 @peaceablewhale
8 years ago

& #8217 is U+2019 ('RIGHT SINGLE QUOTATION MARK'). It may be normal if the title contains this character. Would you mind minding the URL of your feed here?

@hakre
8 years ago

#14 @hakre
8 years ago

  • Keywords has-patch added

I made a patch taking care of the CDATA part. Before that I did some tests that confirmed the bug partially is still in and needs a fix.

Encoding is based on the blog so this does not need a modification, it's already in the output.

#15 follow-up: @peaceablewhale
8 years ago

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

#16 @hakre
8 years ago

Related: #3884

#17 in reply to: ↑ 15 @hakre
8 years ago

Replying to peaceablewhale:

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

unallowed entities? there are no unallowed entities in CDATA. infact there are not entities at all in CDATA that's why it is used there. This is by intention not by mistake. For example if you've got the term "&nbsp;" in a posts title it should be in the RSS-Feeds title as well. Currently it is not. After the patch it is.

#18 @peaceablewhale
8 years ago

The RSS title element and Atom title element are text by default, using <[[CDATA&nbsp;]]> is a violation to the specifications.

I think the functions that return the title, subtitle, etc should handle those entities (resolve &nbsp; to numeric character reference).

#19 @hakre
8 years ago

I do not get your point. CDATA is perfectly valid for XML therefore anything built on it should not have any problems with it. It's just a way to express that there is character data. Please read this first: http://en.wikipedia.org/wiki/CDATA to understand what a cdata section means for SGML and XML documents, eg. RSS and ATOM. What should and what not is pretty much already defined so please stick to the specs, I see no use in inventing new specs only for this bugreport.

#20 @peaceablewhale
8 years ago

I am sorry that I did not express it very clearly..

&nbsp; is a reference of HTML/XHTML, but the RSS title element and Atom title element accept plain text only by default.

#21 @ryan
8 years ago

  • Milestone changed from 2.9 to Future Release

#22 @chriscct7
3 years ago

  • Keywords close added

As there are several drawbacks with this proposal (utf 16 support, unintended escaping), this probably should be closed.

#23 @stevenkword
3 years ago

  • Owner set to stevenkword
  • Status changed from new to assigned

#24 @wonderboymusic
2 years ago

  • Keywords close removed

leaving open since @stevenkword grabbed it

Note: See TracTickets for help on using tickets.