Make WordPress Core

Opened 17 years ago

Last modified 5 weeks ago

#3260 assigned defect (bug)

XML output (rss, atom, rdf ...) should always use UTF-8 or CDATA for user input

Reported by: deremder's profile deremder Owned by: stevenkword's profile stevenkword
Milestone: Future Release Priority: normal
Severity: normal Version: 2.8.2
Component: Feeds Keywords: has-patch
Focuses: Cc:

Description

If none UTF-8 is used, user generated text like titles or categories may contain unallowed entities. The elements <title>, <tagline>, <dc:subject>, <category> (there may be more) are not protected width <[[CDATA ]]>.

The conversion can be done width the PHP MulitByte function. If MB is not available, the <![[CDATA should be used instead.

Please contact me for development help.

Attachments (1)

3260.patch (9.0 KB) - added by hakre 15 years ago.

Download all attachments as: .zip

Change History (26)

#1 @foolswisdom
17 years ago

#3252 seems to deal with the category element.

#2 @Nazgul
17 years ago

  • Milestone set to 2.4 (future)

#3 @thee17
16 years ago

  • Milestone changed from 2.5 to 2.6

#4 @Denis-de-Bernardy
15 years ago

  • Component changed from Optimization to Feeds
  • Owner anonymous deleted

#5 @peaceablewhale
15 years ago

  • Keywords reporter-feedback added

I wonder if this problem still persists. Any test case?

#6 @Denis-de-Bernardy
15 years ago

utf-8 is probably a no go because a site might use utf-16.

that being said, no idea if this is still a valid. there are tons of tickets that got closed to fix this/that field here and there due to the fact that they contained data that needed cdata or entities.

suggesting invalid, personally, and opening tickets as the need arises as we've done in the past years.

#7 @peaceablewhale
15 years ago

I think outputing UTF-16 or other non-ASCII compatible encoded contents is not possible in WordPress... the themes are written in ASCII...

#8 @Denis-de-Bernardy
15 years ago

  • Keywords close added

#9 @iron_xman
15 years ago

  • Cc iron_xman added

Can we look at wrapping the title tag in the rss feed in CDATA as it throws up when there are entities in the title? I've been building a script to bring in an rss feed from a wordpress site, and display it on my site, and it keeps coming back with the & #8217 as a weird garbage character. When it's wrapped in CDATA tags, it works as expected. In looking at the code, the content of the rss is in CDATA, but not the title, why? The fix would be very simple to do, as long as there aren't any side effects. I haven't noticed any in my own install.
I hate hacking core files, so I'm requesting that this be added. And as it pertains to this bug, I thought I would re-open it.

#10 @iron_xman
15 years ago

  • Keywords close removed

#11 @iron_xman
15 years ago

  • Keywords reporter-feedback removed

#12 @iron_xman
15 years ago

  • Version changed from 2.0.4 to 2.8.2

#13 @peaceablewhale
15 years ago

& #8217 is U+2019 ('RIGHT SINGLE QUOTATION MARK'). It may be normal if the title contains this character. Would you mind minding the URL of your feed here?

@hakre
15 years ago

#14 @hakre
15 years ago

  • Keywords has-patch added

I made a patch taking care of the CDATA part. Before that I did some tests that confirmed the bug partially is still in and needs a fix.

Encoding is based on the blog so this does not need a modification, it's already in the output.

#15 follow-up: @peaceablewhale
15 years ago

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

#16 @hakre
15 years ago

Related: #3884

#17 in reply to: ↑ 15 @hakre
15 years ago

Replying to peaceablewhale:

When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.

<title>&amp;</title> is &, but <title><[[CDATA&amp;]]></title> is &amp;.

unallowed entities? there are no unallowed entities in CDATA. infact there are not entities at all in CDATA that's why it is used there. This is by intention not by mistake. For example if you've got the term "&nbsp;" in a posts title it should be in the RSS-Feeds title as well. Currently it is not. After the patch it is.

#18 @peaceablewhale
15 years ago

The RSS title element and Atom title element are text by default, using <[[CDATA&nbsp;]]> is a violation to the specifications.

I think the functions that return the title, subtitle, etc should handle those entities (resolve &nbsp; to numeric character reference).

#19 @hakre
14 years ago

I do not get your point. CDATA is perfectly valid for XML therefore anything built on it should not have any problems with it. It's just a way to express that there is character data. Please read this first: http://en.wikipedia.org/wiki/CDATA to understand what a cdata section means for SGML and XML documents, eg. RSS and ATOM. What should and what not is pretty much already defined so please stick to the specs, I see no use in inventing new specs only for this bugreport.

#20 @peaceablewhale
14 years ago

I am sorry that I did not express it very clearly..

&nbsp; is a reference of HTML/XHTML, but the RSS title element and Atom title element accept plain text only by default.

#21 @ryan
14 years ago

  • Milestone changed from 2.9 to Future Release

#22 @chriscct7
9 years ago

  • Keywords close added

As there are several drawbacks with this proposal (utf 16 support, unintended escaping), this probably should be closed.

#23 @stevenkword
9 years ago

  • Owner set to stevenkword
  • Status changed from new to assigned

#24 @wonderboymusic
9 years ago

  • Keywords close removed

leaving open since @stevenkword grabbed it

#25 @Mte90
14 months ago

Reading today this ticket, I am wondering if it still an issue.
There is no specification about what kind of unicode is suggested for RSS. I had to deal with an RSS aggregator and the issue was to generate an RSS that works with the various input where everyone writes symbols not supported and so on.
Probably wrap everything with CDATA will simplify many things, but it used only to avoid parsing as XML the content inside https://en.wikipedia.org/wiki/CDATA.
To avoid strange symbols, don't work, and it is a matter of unicode.

Said so, as today there is a function that print as example the title https://github.com/WordPress/wordpress-develop/blob/0cb8475c0d07d23893b1d73d755eda5f12024585/src/wp-includes/feed.php#L156 that as a hook.
This function use the wordpress get_the_title.

In the case of the description https://github.com/WordPress/wordpress-develop/blob/0cb8475c0d07d23893b1d73d755eda5f12024585/src/wp-includes/feed.php#L27 that use https://developer.wordpress.org/reference/functions/convert_chars/ that replace with ASCII symbols.

As today I think that the ticket can be closed as it is a bug that doesn't happens anymore.

Note: See TracTickets for help on using tickets.