WordPress.org

Make WordPress Core

Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#9992 closed defect (bug) (invalid)

Atom feed titles are CDATA'ed and XML encoded.

Reported by: pm24601 Owned by: josephscott
Milestone: Priority: normal
Severity: normal Version: 2.7.1
Component: Feeds Keywords: has-patch tested reporter-feedback dev-feedback
Focuses: Cc:

Description

Create a blog post with a title "> & test" as the title of the post.

The atom feed has the title:

<title type="html"><![CDATA[&amp; &gt; test ]]></title>

The title should not be CDATA'ed or xml encoded. but not both.

Attachments (1)

9992.patch (600 bytes) - added by peaceablewhale 6 years ago.

Download all attachments as: .zip

Change History (20)

comment:1 @pm246016 years ago

  • Cc pm24601 added

comment:2 @Denis-de-Bernardy6 years ago

  • Keywords needs-patch added
  • Milestone changed from Unassigned to 2.8

comment:4 @peaceablewhale6 years ago

  • Component changed from XML-RPC to Feeds
  • Keywords needs-patch removed
  • Milestone 2.8 deleted
  • Resolution set to invalid
  • Status changed from new to closed

It's valid. Please note that the type is "html", not "text".

The CDATA block is used for escaping HTML character entity references and it's required. For example, if the title is "What is &reg;?", it has to be "What is &amp;reg;?" or "<![CDATA[What is &reg;?]]>" in feed.

@peaceablewhale6 years ago

comment:5 @peaceablewhale6 years ago

  • Keywords has-patch needs-testing added
  • Milestone set to 2.8
  • Resolution invalid deleted
  • Status changed from closed to reopened

Even though the existing format is valid and working properly, I found that the "ent2ncr" function will automatically change the HTML character entity references to numeric character references. In this case, the extra escaping doesn't seem necessary. I have uploaded a patch to remove the CDATA block and change to the title content type to text. Please test if it breaks anything.

comment:7 @Denis-de-Bernardy6 years ago

  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

comment:8 @peaceablewhale6 years ago

  • Keywords tested added; needs-testing removed

Tested

comment:11 @markjaquith6 years ago

  • Milestone changed from 2.9 to 3.0

Punting pending feedback from Joseph Scott. Feel free to rescue this (and upgrade its severity) if this is a critical issue for Atom validation. If it's not, let's fix it in early 3.0.

comment:12 @josephscott6 years ago

Both the W3C validator and beta.feedvalidator.org show the Atom feed as valid (on -trunk with a test post using the example title). I agree that from an XML point of view this does feel odd though, where doing one or the other would be technically right and doing both is, well, some what strange.

What might explain this more is context of the data. I look at it and see escaped HTML in a CDATA string, as opposed to escaped XML, inside another XML escaped string.

I also added my test feed with this in it to Google Reader and it did the right thing. I don't see this as being a major issue for 2.9, so punt for now and do more testing during 3.0 dev round to see if clients will choke on the proposed changes or not.

comment:13 @hakre6 years ago

  • Keywords reporter-feedback added

Well shure does a validator show that it's valid. Infact even entities in CDATA should be translated by useragents as of the SGML definition of CDATA.

It would be interesing to know of a concrete useragent (feedreader) that is having a problem with the atom feed.

Technically spoken this bug report is bogus. It should be considered to resolve this as invalid. The bugreporter should provide a step-by-step scenario to reproduce the bug within wordpress and name a feedreader where the display is not handeled properly.

comment:14 @Denis-de-Bernardy6 years ago

I'm leaning on the side of closing as well.

comment:15 @hakre5 years ago

Since reporter feedback still is pending, I suggest to close this as wontfix. If the reporter is able to report an actual mis-behaviour with an atom client software, she/he should re-open the ticket then. Until then I'm missing a propper bug-report here.

comment:16 @jarrettc5 years ago

  • Cc jarrettc added

The XML is wrong, and this is a real problem. I'll explain the reasons for both assertions.

The problem is that the XML is escaped twice. Entity-encoding is one method for escaping XML's control characters. CDATA is another method. Either one can be used. But if you use them both, you're escaping twice. Take the following string as an example:

Johnson & Johnson

If I entity-encode it, I have:

Johnson &amp; Johnson

Now, if I wrap it in CDATA, I have:

<![CDATA[Johnson &amp; Johnson]]>

A well-behaved XML parser will decode this string as "Johnson &amp; Johnson," which is not what we want. The decoded string should be "Johnson & Johnson."

Here's the W3C's spec on CDATA:

http://www.w3.org/TR/REC-xml/#sec-cdata-sect

As the W3C says, ampersands inside CDATA are treated literally. This is why <![CDATA[Johnson &amp; Johnson]]> is decoded as "Johnson &amp; Johnson."

Even if we can't name a specific client that chokes on the twice-escaped XML Wordpress produces, it is a very bad idea to spit out incorrect XML. The practical reason--and this is the practical reason for all W3C standards--is that you want your output to be readable by any standards-compliant client. The fact that we can't name a client that requires proper XML doesn't mean one doesn't exist. Nor should we expect developers of future clients to pander to the incorrect XML we produce. If we continue double-escaping our XML, we run the risk of creating something analogous to quirks mode in web browsers: clients will have to figure out on a case-by-case basis whether a given feed uses proper XML, or the quirky, double-escaped Wordpress style. They'll have to say, "Is this a Wordpress feed? If so, I should take into account that the markup is improperly escaped. But if not, I should follow the W3C standard."

The fact that the markup validates does not prove it is correct. To the contrary, the XML's encoding is improper according to the W3C standard. So why doesn't the validator complain? Beceause it doesn't know the markup is double-escaped. It thinks you intend for the HTML entities to be literals, rather than markup. For all the validator knows, you could be writing a how-to on HTML entities, which would properly include the entities inside CDATA. For example, this would be perfectly valid:

<![CDATA[The XML entity for the ampersand is &amp;.]]>

In the above example, the author's intent was for &amp; to be treated as a literal string, rather than be replaced with "&" after the XML is parsed. So the code is correct. But the W3C validator can't guess your intent, so it can't complain about the following:

<![CDATA[Johnson &amp; Johnson]]>

From the validator's perspective, it's quite possible that you wanted &amp; to appear as a literal after parsing, when in fact you were most likely trying to write "Johnson & Johnson." But the validator doesn't (and, practically speaking, can't) know what you intended, so it has to give you the benefit of the doubt when you include entities inside CDATA.

comment:17 @dd325 years ago

  • Keywords dev-feedback added

Is there any agreement here on what the exact output should be?

From what i can tell, CDATA would be the "better" option?

comment:18 @markjaquith5 years ago

  • Milestone changed from 3.0 to Unassigned
  • Resolution set to invalid
  • Status changed from reopened to closed

jarrettc — after consulting both the XML and the Atom specs, I've determined that while what WordPress is doing here is unorthodox and likely redundant, it isn't incorrect and it is not double-encoding in the context of Atom.

CDATA acts as a passthrough. Everything inside is passed through. Johnson &amp; Johnson comes out as Johnson &amp; Johnson. That would be an issue if this were just XML. It is Atom, so from here, we have to look to the Atom spec.

I quote:

If type="html", then this element contains entity escaped html.

Thus, if type="html", Atom sees Johnson &amp; Johnson as Johnson & Johnson. If type were "text," then we'd have the problem of Atom seeing it as Johnson &amp; Johnson (which is the mistake that peacablewhale's patch makes).

No feed parsing clients that I tested have any problem with WordPress' current feeds. However unorthodox, it doesn't violate the specs that I can tell, it doesn't get flagged in any validator, and it works in all tested feed parsers. Consequently, I'm closing as invalid.

comment:19 @nacin5 years ago

  • Milestone Unassigned deleted
Note: See TracTickets for help on using tickets.