Make WordPress Core

Opened 17 years ago

Last modified 2 weeks ago

#9993 assigned defect (bug)

Rss and atom feeds are dropping some characters

Reported by: pm24601's profile pm24601 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 2.7.1
Component: Feeds Keywords: has-patch has-unit-tests needs-refresh needs-test-info
Focuses: Cc:

Description

Blog post with title:

& > test <

has the less than ( < ) is stripped from the atom and rss feed.

Attachments (3)

9993.patch (530 bytes) - added by peaceablewhale 16 years ago.
9993-tests.diff (1014 bytes) - added by stevenkword 11 years ago.
Unit tests for the_title_rss
9993.2.patch (594 bytes) - added by stevenkword 10 years ago.
Refresh @ 4.5.1

Download all attachments as: .zip

Change History (38)

#1 @pm24601
17 years ago

  • Cc pm24601 added

#2 @redsweater
17 years ago

Confirmed in WordPress trunk. The test case example loses the final "<" character, specifically the final output in the feed looks like:

<![CDATA[&amp; &gt; test ]]>

#3 @Denis-de-Bernardy
17 years ago

  • Keywords needs-patch added
  • Milestone changed from Unassigned to 2.8

#5 @peaceablewhale
17 years ago

  • Component changed from XML-RPC to Feeds
  • Keywords has-patch added; needs-patch removed
  • Owner josephscott deleted

The PHP's built-in "strip_tags" function has caused the error. It also seems not necessary to have it applied to the_title_rss(). I have uploaded a patch to remove it.

#6 @Denis-de-Bernardy
17 years ago

trouble with the patch is the strip_tags() call might be there for a reason. isn't it possible to have html in a post title?

#7 @Denis-de-Bernardy
17 years ago

  • Milestone changed from 2.8 to 2.9

punting per IRC discussion

#8 @peaceablewhale
17 years ago

Yes. However, WordPress will escape the HTML (via the "ent2ncr" function) and therefore they are no longer HTML.

I think calling strip_tags is problmatic because an author can have a post title like "Use <h1> to <h6> for headings, <p> for paragraphy, but not formatting". strip_tags will remove the HTML unexpectedly.

#9 @Denis-de-Bernardy
17 years ago

  • Keywords needs-review added

some core dev feedback would be useful here

#10 @peaceablewhale
17 years ago

The priorities of ent2ncr are no longer modified by the patch. The new patch only removes the unnecessary 'strip_tags' filter.

#11 @peaceablewhale
16 years ago

  • Keywords needs-testing added; needs-review removed

Marking this report "needs-testing" in order to gain attention in the upcoming bug hunt.

#12 @ryan
16 years ago

  • Milestone changed from 2.9 to 3.0

There are other areas where we need a smarter strip. Let's figure it out in 3.0.

#13 @dd32
16 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1

Punting due to impending beta

#14 follow-up: @peaceablewhale
16 years ago

Patch updated for 3.1.

"& > test <" now works even without the patch, but "<test>This is a test</test>" not.

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

#15 @nacin
15 years ago

  • Keywords 3.2-early added; early removed
  • Milestone changed from Awaiting Triage to Future Release

Bumping once more.

#16 @MikeHansenMe
13 years ago

  • Cc mdhansen@… added
  • Keywords 3.2-early removed

I was able to reproduce the problem in trunk (3.5-beta2-22252). Confirmed the patch fixes the problem.

#17 @nacin
13 years ago

  • Keywords needs-unit-tests added

@stevenkword
11 years ago

Unit tests for the_title_rss

#18 @stevenkword
11 years ago

  • Keywords needs-unit-tests removed

Errors still exists. Patch 9993-tests.diff adds test cases for both lone XML entities and XML tags.

#19 in reply to: ↑ 14 ; follow-up: @mdgl
11 years ago

Replying to peaceablewhale:

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

According to the RSS Best Practices Profile (http://www.rssboard.org/rss-profile), the <title> element for both channels and items is plain text character data and so should not contain embedded HTML tags.

I also don't see how post titles are already "escaped".

#20 @stevenkword
11 years ago

This directly relates to #29621 in the Export component.

#21 in reply to: ↑ 19 @stevenkword
11 years ago

Replying to mdgl:

Replying to peaceablewhale:

I still think there is no reason for having 'strip_tags' in feed title as they are escaped already.

According to the RSS Best Practices Profile (http://www.rssboard.org/rss-profile), the <title> element for both channels and items is plain text character data and so should not contain embedded HTML tags.

I also don't see how post titles are already "escaped".

I believe what @peaceablewhale is saying is that the filters applied to the the_title_rss include methods for escaping HTML. If an additional filter was added to the_title_rss that also escaped the disallowed XML characters, there would be no reason to strip the tags.

The string, <h1>, would be disallowed inside an XML element, but &lt;h1> is valid even thought &lt;h1&gt; would be best.

The point is, properly escaped HTML is technically plain text. The only necessary addition would be filter to encode the disallowed characters.

Version 2, edited 11 years ago by stevenkword (previous) (next) (diff)

@stevenkword
10 years ago

Refresh @ 4.5.1

#22 @stevenkword
10 years ago

  • Owner set to stevenkword
  • Status changed from new to assigned

I'm standing behind (attachment:9993.2.patch) as a reasonable approach for the time being. Since the output is also passed through esc_html(), it is not necessary to strip these tags. It also opens up the benefit of having parity between the web view and the feed content.

Unit tests, (attachment:9993-tests.diff), are present and show failure before the patch is applied and success after.

#23 @stevenkword
10 years ago

  • Keywords has-unit-tests added; needs-testing removed
  • Milestone changed from Future Release to 4.6

#24 follow-up: @netweb
10 years ago

@stevenkword Maybe add tests for Atom also?

#25 in reply to: ↑ 24 ; follow-up: @stevenkword
10 years ago

Replying to netweb:

@stevenkword Maybe add tests for Atom also?

This might be a case where the naming conventions don't quite make sense. Both Atom and RSS feeds use the same the_title_rss() method in their respective templates. The only difference being that the Atom template wraps the output in CDATA ,which would be unnecessary, but not problematic after this change.

Ideally, since both templates use XML (the export component too), naming the method the_title_xml() would have been more appropriate. While we do have the ability to add unit tests for Atom as of 4.5, I think the tests proposed in this ticket cover those test cases. As a future part of adding test coverage for feeds, I would like to add a new common.php to /tests/feed/ to help make things more obvious.

Last edited 10 years ago by stevenkword (previous) (diff)

#26 in reply to: ↑ 25 @netweb
10 years ago

Replying to stevenkword:

This might be a case where the naming conventions don't quite make sense. Both Atom and RSS feeds use the same the_title_rss() method in their respective templates. The only difference being that the Atom template wraps the output in CDATA ,which would be unnecessary, but not problematic after this change.

Ideally, since both templates use XML (the export component too), naming the method the_title_xml() would have been more appropriate. While we do have the ability to add unit tests for Atom as of 4.5, I think the tests proposed in this ticket cover those test cases. As a future part of adding test coverage for feeds, I would like to add a new common.php to /tests/feed/ to help make things more obvious.

Because the functionality is different isn't this even more reason to add unit tests for Atom?

Also looking and the_title_rss() is also used in src/wp-admin/includes/export.php#L426, sadly there are zero export unit tests at the moment, there is a patch adding export unit tests in #22435 but they are dependant on the new Export API enhancements in that ticket.

#27 follow-up: @ocean90
10 years ago

The strip_tags() call exists since b2, see https://github.com/ericandrewlewis/b2/blame/e0de2a065d3e1d139fbb635e505e7fed5271954e/b2-include/b2template.functions.php#L269.
I think one of the reasons might be that titles can include HTML like <br> or <strong> for visual purposes and not as a part of the title. These shouldn't be displayed in a feed reader.

#28 in reply to: ↑ 27 ; follow-up: @stevenkword
10 years ago

Replying to ocean90:

The strip_tags() call exists since b2, see https://github.com/ericandrewlewis/b2/blame/e0de2a065d3e1d139fbb635e505e7fed5271954e/b2-include/b2template.functions.php#L269.
I think one of the reasons might be that titles can include HTML like <br> or <strong> for visual purposes and not as a part of the title. These shouldn't be displayed in a feed reader.

How would you feel about str_replace() those explicity rather than running them through strip_tags()?

#29 in reply to: ↑ 28 @ocean90
10 years ago

  • Milestone changed from 4.6 to Future Release

Replying to stevenkword:

How would you feel about str_replace() those explicity rather than running them through strip_tags()?

Those were just two examples, it can be any HTML.

Maybe we can add another filter for the_title_rss which runs before strip_tags() and encodes < which are not part of a HTML tag as &lt;.

#30 @desrosj
13 months ago

  • Keywords needs-testing needs-refresh added
  • Milestone set to Awaiting Review
  • Owner stevenkword deleted

Found this going through a list of tickets missing a milestone.

Since it has been a good amount of time since this was last investigated, looking at the current state of this report in the context of today's code base would be a good first step to reviving.

I'm going to remove the ticket owner to make it more clear that no one is actively working on this one.

#31 @redsweater
13 months ago

I just tested against a WordPress.com blog (most convenient) and the bug is still there. It seems minor, but it would still be nice to fix. Probably still an issue in .org blogs as well.

#32 @sukhendu2002
7 months ago

  • Keywords has-test-info added

Reproduction Report

Description

This report validates whether the issue can be reproduced.

Environment

  • WordPress: 6.9-alpha-60093-src
  • PHP: 8.2.29
  • Server: nginx/1.29.0
  • Database: mysqli (Server: 8.4.5 / Client: mysqlnd 8.2.29)
  • Browser: Chrome 138.0.0.0
  • OS: macOS
  • Theme: Twenty Fifteen 4.0
  • MU Plugins: None activated
  • Plugins:
    • Hello Dolly 1.7.2
    • Test Reports 1.2.0

Actual Results

Error condition does not occur .

  • When creating a post with the title & > test <, the generated RSS/Atom feed encodes the < character as &#060; (numeric entity), not as a missing or stripped character.
  • The feed output is: `xml <title> Comments on: &#038; &gt; test &#060; </title> `
  • The < character is present and correctly encoded, not stripped.

Additional Notes

  • The original concern was that < was missing from the feed output. However, with the current filter configuration, < is encoded as &#060;.

#33 @redsweater
7 months ago

For what it's worth, I just reproduced this with the latest trunk sources. I think that perhaps the test @sukhendu2002 performed made it appear that the bug is fixed because an entity encoding was created by the web admin interface when creating the post. When using the XMLRPC API, for example, to send a post to the blog, the "<" character is still allowed to reach the database, whereupon the bug can be observed.

I don't know that it's necessarily worth fixing after all these years, but the bug still exists.

#34 @SirLouen
2 weeks ago

  • Keywords needs-test-info added; needs-testing has-test-info removed

Given that @redsweater has reproduced it, but has not provided exact instructions, I think it would be nice to have them

#35 @redsweater
2 weeks ago

As I mentioned, I'm not sure it's worth fixing this after all these years. Anybody who would work on a fix would probably be able to create arbitrary XML-RPC requests for testing at will, so the steps to reproduce are:

  1. Establish a test blog e.g. at http://localhost:8889/
  2. POST to the http://localhost:8889/xmlrpc.php endpoint with XML representing a new post request, where the title is "& > test <"
  3. Using a newsreader or curl, access the feed for the site:

curl http://localhost:8889/feed/

Expected: the text within <title></title> tag on the post in the feed should include a "<" character at the end.
Actual: It's stripped.

Last edited 2 weeks ago by redsweater (previous) (diff)
Note: See TracTickets for help on using tickets.