WordPress.org

Make WordPress Core

Opened 7 years ago

Last modified 10 days ago

#9993 assigned defect (bug)

Rss and atom feeds are dropping some characters

Reported by: pm24601 Owned by: stevenkword
Milestone: Future Release Priority: normal
Severity: normal Version: 2.7.1
Component: Feeds Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

Blog post with title:

& > test <

has the less than ( < ) is stripped from the atom and rss feed.

Attachments (3)

9993.patch (530 bytes) - added by peaceablewhale 6 years ago.
9993-tests.diff (1014 bytes) - added by stevenkword 17 months ago.
Unit tests for the_title_rss
9993.2.patch (594 bytes) - added by stevenkword 3 months ago.
Refresh @ 4.5.1

Download all attachments as: .zip

Change History (32)

#1 @pm24601
7 years ago

  • Cc pm24601 added

#2 @redsweater
7 years ago

Confirmed in WordPress trunk. The test case example loses the final "<" character, specifically the final output in the feed looks like:

<![CDATA[&amp; &gt; test ]]>

#3 @Denis-de-Bernardy
7 years ago

  • Keywords needs-patch added
  • Milestone changed from Unassigned to 2.8

#5 @peaceablewhale
7 years ago

  • Component changed from XML-RPC to Feeds
  • Keywords has-patch added; needs-patch removed
  • Owner josephscott deleted

The PHP's built-in "strip_tags" function has caused the error. It also seems not necessary to have it applied to the_title_rss(). I have uploaded a patch to remove it.

#6 @Denis-de-Bernardy
7 years ago

trouble with the patch is the strip_tags() call might be there for a reason. isn't it possible to have html in a post title?

#7 @Denis-de-Bernardy
7 years ago

  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

#8 @peaceablewhale
7 years ago

Yes. However, WordPress will escape the HTML (via the "ent2ncr" function) and therefore they are no longer HTML.

I think calling strip_tags is problmatic because an author can have a post title like "Use <h1> to <h6> for headings, <p> for paragraphy, but not formatting". strip_tags will remove the HTML unexpectedly.

#9 @Denis-de-Bernardy
7 years ago

  • Keywords needs-review added

some core dev feedback would be useful here

#10 @peaceablewhale
7 years ago

The priorities of ent2ncr are no longer modified by the patch. The new patch only removes the unnecessary 'strip_tags' filter.

#11 @peaceablewhale
7 years ago

  • Keywords needs-testing added; needs-review removed

Marking this report "needs-testing" in order to gain attention in the upcoming bug hunt.

#12 @ryan
7 years ago

  • Milestone changed from 2.9 to 3.0

There are other areas where we need a smarter strip. Let's figure it out in 3.0.

#13 @dd32
6 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1

Punting due to impending beta

@peaceablewhale
6 years ago

#14 follow-up: @peaceablewhale
6 years ago

Patch updated for 3.1.

"& > test <" now works even without the patch, but "<test>This is a test</test>" not.

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

#15 @nacin
6 years ago

  • Keywords 3.2-early added; early removed
  • Milestone changed from Awaiting Triage to Future Release

Bumping once more.

#16 @MikeHansenMe
4 years ago

  • Cc mdhansen@… added
  • Keywords 3.2-early removed

I was able to reproduce the problem in trunk (3.5-beta2-22252). Confirmed the patch fixes the problem.

#17 @nacin
4 years ago

  • Keywords needs-unit-tests added

@stevenkword
17 months ago

Unit tests for the_title_rss

#18 @stevenkword
17 months ago

  • Keywords needs-unit-tests removed

Errors still exists. Patch 9993-tests.diff adds test cases for both lone XML entities and XML tags.

#19 in reply to: ↑ 14 ; follow-up: @mdgl
16 months ago

Replying to peaceablewhale:

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

According to the RSS Best Practices Profile (http://www.rssboard.org/rss-profile), the <title> element for both channels and items is plain text character data and so should not contain embedded HTML tags.

I also don't see how post titles are already "escaped".

#20 @stevenkword
15 months ago

This directly relates to #29621 in the Export component.

#21 in reply to: ↑ 19 @stevenkword
15 months ago

Replying to mdgl:

Replying to peaceablewhale:

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

According to the RSS Best Practices Profile (http://www.rssboard.org/rss-profile), the <title> element for both channels and items is plain text character data and so should not contain embedded HTML tags.

I also don't see how post titles are already "escaped".

I believe what @peaceablewhale is saying is that the filters applied to the the_title_rss include methods for escaping HTML. If an additional filter was added to the_title_rss that also escaped the disallowed XML characters, there would be no reason to strip the tags.

The string, <h1>, would be disallowed inside an XML element, but &lt;h1> is valid even thought &lt;h1&gt; would be best.

The point is, properly escaped HTML is technically plain text. The only necessary addition would be an additional filter to encode the disallowed characters.

Last edited 15 months ago by stevenkword (previous) (diff)

@stevenkword
3 months ago

Refresh @ 4.5.1

#22 @stevenkword
3 months ago

  • Owner set to stevenkword
  • Status changed from new to assigned

I'm standing behind (attachment:9993.2.patch) as a reasonable approach for the time being. Since the output is also passed through esc_html(), it is not necessary to strip these tags. It also opens up the benefit of having parity between the web view and the feed content.

Unit tests, (attachment:9993-tests.diff), are present and show failure before the patch is applied and success after.

#23 @stevenkword
3 months ago

  • Keywords has-unit-tests added; needs-testing removed
  • Milestone changed from Future Release to 4.6

#24 follow-up: @netweb
3 months ago

@stevenkword Maybe add tests for Atom also?

#25 in reply to: ↑ 24 ; follow-up: @stevenkword
3 months ago

Replying to netweb:

@stevenkword Maybe add tests for Atom also?

This might be a case where the naming conventions don't quite make sense. Both Atom and RSS feeds use the same the_title_rss() method in their respective templates. The only difference being that the Atom template wraps the output in CDATA ,which would be unnecessary, but not problematic after this change.

Ideally, since both templates use XML (the export component too), naming the method the_title_xml() would have been more appropriate. While we do have the ability to add unit tests for Atom as of 4.5, I think the tests proposed in this ticket cover those test cases. As a future part of adding test coverage for feeds, I would like to add a new common.php to /tests/feed/ to help make things more obvious.

Last edited 3 months ago by stevenkword (previous) (diff)

#26 in reply to: ↑ 25 @netweb
3 months ago

Replying to stevenkword:

This might be a case where the naming conventions don't quite make sense. Both Atom and RSS feeds use the same the_title_rss() method in their respective templates. The only difference being that the Atom template wraps the output in CDATA ,which would be unnecessary, but not problematic after this change.

Ideally, since both templates use XML (the export component too), naming the method the_title_xml() would have been more appropriate. While we do have the ability to add unit tests for Atom as of 4.5, I think the tests proposed in this ticket cover those test cases. As a future part of adding test coverage for feeds, I would like to add a new common.php to /tests/feed/ to help make things more obvious.

Because the functionality is different isn't this even more reason to add unit tests for Atom?

Also looking and the_title_rss() is also used in src/wp-admin/includes/export.php#L426, sadly there are zero export unit tests at the moment, there is a patch adding export unit tests in #22435 but they are dependant on the new Export API enhancements in that ticket.

#27 follow-up: @ocean90
3 weeks ago

The strip_tags() call exists since b2, see https://github.com/ericandrewlewis/b2/blame/e0de2a065d3e1d139fbb635e505e7fed5271954e/b2-include/b2template.functions.php#L269.
I think one of the reasons might be that titles can include HTML like <br> or <strong> for visual purposes and not as a part of the title. These shouldn't be displayed in a feed reader.

#28 in reply to: ↑ 27 ; follow-up: @stevenkword
3 weeks ago

Replying to ocean90:

The strip_tags() call exists since b2, see https://github.com/ericandrewlewis/b2/blame/e0de2a065d3e1d139fbb635e505e7fed5271954e/b2-include/b2template.functions.php#L269.
I think one of the reasons might be that titles can include HTML like <br> or <strong> for visual purposes and not as a part of the title. These shouldn't be displayed in a feed reader.

How would you feel about str_replace() those explicity rather than running them through strip_tags()?

#29 in reply to: ↑ 28 @ocean90
10 days ago

  • Milestone changed from 4.6 to Future Release

Replying to stevenkword:

How would you feel about str_replace() those explicity rather than running them through strip_tags()?

Those were just two examples, it can be any HTML.

Maybe we can add another filter for the_title_rss which runs before strip_tags() and encodes < which are not part of a HTML tag as &lt;.

Note: See TracTickets for help on using tickets.