Opened 18 years ago
Last modified 12 months ago
#3260 assigned defect (bug)
XML output (rss, atom, rdf ...) should always use UTF-8 or CDATA for user input
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 2.8.2 |
Component: | Feeds | Keywords: | has-patch |
Focuses: | Cc: |
Description
If none UTF-8 is used, user generated text like titles or categories may contain unallowed entities. The elements <title>, <tagline>, <dc:subject>, <category> (there may be more) are not protected width <[[CDATA ]]>.
The conversion can be done width the PHP MulitByte function. If MB is not available, the <![[CDATA should be used instead.
Please contact me for development help.
Attachments (1)
Change History (26)
#5
@
16 years ago
- Keywords reporter-feedback added
I wonder if this problem still persists. Any test case?
#6
@
16 years ago
utf-8 is probably a no go because a site might use utf-16.
that being said, no idea if this is still a valid. there are tons of tickets that got closed to fix this/that field here and there due to the fact that they contained data that needed cdata or entities.
suggesting invalid, personally, and opening tickets as the need arises as we've done in the past years.
#7
@
16 years ago
I think outputing UTF-16 or other non-ASCII compatible encoded contents is not possible in WordPress... the themes are written in ASCII...
#9
@
15 years ago
- Cc iron_xman added
Can we look at wrapping the title tag in the rss feed in CDATA as it throws up when there are entities in the title? I've been building a script to bring in an rss feed from a wordpress site, and display it on my site, and it keeps coming back with the & #8217 as a weird garbage character. When it's wrapped in CDATA tags, it works as expected. In looking at the code, the content of the rss is in CDATA, but not the title, why? The fix would be very simple to do, as long as there aren't any side effects. I haven't noticed any in my own install.
I hate hacking core files, so I'm requesting that this be added. And as it pertains to this bug, I thought I would re-open it.
#13
@
15 years ago
& #8217 is U+2019 ('RIGHT SINGLE QUOTATION MARK'). It may be normal if the title contains this character. Would you mind minding the URL of your feed here?
#14
@
15 years ago
- Keywords has-patch added
I made a patch taking care of the CDATA part. Before that I did some tests that confirmed the bug partially is still in and needs a fix.
Encoding is based on the blog so this does not need a modification, it's already in the output.
#15
follow-up:
↓ 17
@
15 years ago
When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.
<title>&</title> is &, but <title><[[CDATA&]]></title> is &.
#17
in reply to:
↑ 15
@
15 years ago
Replying to peaceablewhale:
When will unallowed entities be output to the feed? I don't think that using CDATA is a good solution as it may cause double escaping.
<title>&</title> is &, but <title><[[CDATA&]]></title> is &.
unallowed entities? there are no unallowed entities in CDATA. infact there are not entities at all in CDATA that's why it is used there. This is by intention not by mistake. For example if you've got the term " " in a posts title it should be in the RSS-Feeds title as well. Currently it is not. After the patch it is.
#18
@
15 years ago
The RSS title element and Atom title element are text by default, using <[[CDATA ]]> is a violation to the specifications.
I think the functions that return the title, subtitle, etc should handle those entities (resolve to numeric character reference).
#19
@
15 years ago
I do not get your point. CDATA is perfectly valid for XML therefore anything built on it should not have any problems with it. It's just a way to express that there is character data. Please read this first: http://en.wikipedia.org/wiki/CDATA to understand what a cdata section means for SGML and XML documents, eg. RSS and ATOM. What should and what not is pretty much already defined so please stick to the specs, I see no use in inventing new specs only for this bugreport.
#20
@
15 years ago
I am sorry that I did not express it very clearly..
is a reference of HTML/XHTML, but the RSS title element and Atom title element accept plain text only by default.
#22
@
10 years ago
- Keywords close added
As there are several drawbacks with this proposal (utf 16 support, unintended escaping), this probably should be closed.
#25
@
2 years ago
Reading today this ticket, I am wondering if it still an issue.
There is no specification about what kind of unicode is suggested for RSS. I had to deal with an RSS aggregator and the issue was to generate an RSS that works with the various input where everyone writes symbols not supported and so on.
Probably wrap everything with CDATA will simplify many things, but it used only to avoid parsing as XML the content inside https://en.wikipedia.org/wiki/CDATA.
To avoid strange symbols, don't work, and it is a matter of unicode.
Said so, as today there is a function that print as example the title https://github.com/WordPress/wordpress-develop/blob/0cb8475c0d07d23893b1d73d755eda5f12024585/src/wp-includes/feed.php#L156 that as a hook.
This function use the wordpress get_the_title
.
In the case of the description https://github.com/WordPress/wordpress-develop/blob/0cb8475c0d07d23893b1d73d755eda5f12024585/src/wp-includes/feed.php#L27 that use https://developer.wordpress.org/reference/functions/convert_chars/ that replace with ASCII symbols.
As today I think that the ticket can be closed as it is a bug that doesn't happens anymore.
#3252 seems to deal with the category element.