WordPress.org

Make WordPress Core

Opened 13 years ago

Last modified 9 months ago

#3260 assigned defect (bug)

XML output (rss, atom, rdf ...) should always use UTF-8 or CDATA for user input

Reported by: deremder Owned by: stevenkword
Milestone: Future Release Priority: normal
Severity: normal Version: 2.8.2
Component: Feeds Keywords: has-patch
Focuses: Cc:
PR Number:

Description

If none UTF-8 is used, user generated text like titles or categories may contain unallowed entities. The elements <title>, <tagline>, <dc:subject>, <category> (there may be more) are not protected width <[[CDATA ]]>.

The conversion can be done width the PHP MulitByte function. If MB is not available, the <![[CDATA should be used instead.

Please contact me for development help.

Attachments (1)

3260.patch (9.0 KB) - added by hakre 10 years ago.

Download all attachments as: .zip

Change History (25)

#1 @foolswisdom
13 years ago

#3252 seems to deal with the category element.

#2 @Nazgul
12 years ago

  • Milestone set to 2.4 (future)

#3 @thee17
12 years ago

  • Milestone changed from 2.5 to 2.6

#4 @Denis-de-Bernardy
11 years ago

  • Component changed from Optimization to Feeds
  • Owner anonymous deleted

#5 @peaceablewhale
11 years ago

  • Keywords reporter-feedback added

I wonder if this problem still persists. Any test case?

#6 @Denis-de-Bernardy
11 years ago

utf-8 is probably a no go because a site might use utf-16.

that being said, no idea if this is still a valid. there are tons of tickets that got closed to fix this/that field here and there due to the fact that they contained data that needed cdata or entities.

suggesting invalid, personally, and opening tickets as the need arises as we've done in the past years.

#7 @peaceablewhale
11 years ago

I think outputing UTF-16 or other non-ASCII compatible encoded contents is not possible in WordPress... the themes are written in ASCII...

#8 @Denis-de-Bernardy
10 years ago

  • Keywords close added

#9 @iron_xman
10 years ago

  • Cc iron_xman added

Can we look at wrapping the title tag in the rss feed in CDATA as it throws up when there are entities in the title? I've been building a script to bring in an rss feed from a wordpress site, and display it on my site, and it keeps coming back with the & #8217 as a weird garbage character. When it's wrapped in CDATA tags, it works as expected. In looking at the code, the content of the rss is in CDATA, but not the title, why? The fix would be very simple to do, as long as there aren't any side effects. I haven't noticed any in my own install.
I hate hacking core files, so I'm requesting that this be added. And as it pertains to this bug, I thought I would re-open it.

#10 @iron_xman
10 years ago

  • Keywords close removed

#11 @iron_xman
10 years ago

  • Keywords reporter-feedback removed

#12 @iron_xman
10 years ago

  • Version changed from 2.0.4 to 2.8.2

#13 @peaceablewhale
10 years ago

& #8217 is U+2019 ('RIGHT SINGLE QUOTATION MARK'). It may be normal if the title contains this character. Would you mind minding the URL of your feed here?

@hakre
10 years ago

#14 @hakre
10 years ago

  • Keywords has-patch added

I made a patch taking care of the CDATA part. Before that I did some tests that confirmed the bug partially is still in and needs a fix.

Encoding is based on the blog so this does not need a modification, it's already in the output.

#15 follow-up: @peaceablewhale
10 years ago

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

#16 @hakre
10 years ago

Related: #3884

#17 in reply to: ↑ 15 @hakre
10 years ago

Replying to peaceablewhale:

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

unallowed entities? there are no unallowed entities in CDATA. infact there are not entities at all in CDATA that's why it is used there. This is by intention not by mistake. For example if you've got the term "&nbsp;" in a posts title it should be in the RSS-Feeds title as well. Currently it is not. After the patch it is.

#18 @peaceablewhale
10 years ago

The RSS title element and Atom title element are text by default, using <[[CDATA&nbsp;]]> is a violation to the specifications.

I think the functions that return the title, subtitle, etc should handle those entities (resolve &nbsp; to numeric character reference).

#19 @hakre
10 years ago

I do not get your point. CDATA is perfectly valid for XML therefore anything built on it should not have any problems with it. It's just a way to express that there is character data. Please read this first: http://en.wikipedia.org/wiki/CDATA to understand what a cdata section means for SGML and XML documents, eg. RSS and ATOM. What should and what not is pretty much already defined so please stick to the specs, I see no use in inventing new specs only for this bugreport.

#20 @peaceablewhale
10 years ago

I am sorry that I did not express it very clearly..

&nbsp; is a reference of HTML/XHTML, but the RSS title element and Atom title element accept plain text only by default.

#21 @ryan
10 years ago

  • Milestone changed from 2.9 to Future Release

#22 @chriscct7
5 years ago

  • Keywords close added

As there are several drawbacks with this proposal (utf 16 support, unintended escaping), this probably should be closed.

#23 @stevenkword
5 years ago

  • Owner set to stevenkword
  • Status changed from new to assigned

#24 @wonderboymusic
4 years ago

  • Keywords close removed

leaving open since @stevenkword grabbed it

Note: See TracTickets for help on using tickets.