Opened 6 months ago
Last modified 19 seconds ago
#63611 assigned defect (bug)
wp_widget_rss_output: HTML entities that are part of HTML tags should be removed
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | Widgets | Keywords: | good-first-bug has-test-info has-patch commit has-unit-tests |
| Focuses: | Cc: |
Description
Related to GB 70477
Some RSS feeds seem to have HTML tags in the title field escaped to HTML entities.
- RSS feed example: https://pubmed.ncbi.nlm.nih.gov/rss/search/16cUU5Jcud0BSYRzHgbqJGm_F6kq07gr9atM8kZoogUmZdX5oj/
- Title example:
Oral administration of <em>Lactiplantibacillus plantarum</em> GKK1 ameliorates atopic dermatitis in a mouse model
The wp_widget_rss_output() function correctly strips HTML tags, but doesn't remove HTML entities that have already been escaped:
Maybe we need to decode the HTML before stripping the tags:
$title = esc_html( trim( strip_tags( html_entity_decode( $item->get_title() ) ) ) );
Attachments (4)
Change History (38)
#1
@
6 months ago
- Summary changed from wp_widget_rss_output: should escape HTML entities to wp_widget_rss_output: HTML entities that are part of HTML tags should be removed
#2
follow-up:
↓ 3
@
6 months ago
- Keywords needs-patch needs-unit-tests good-first-bug has-test-info added
Reproduction Report
Description
✅ This report validates that the issue can be reproduced.
Environment
- WordPress: 6.9-alpha-60093-src
- PHP: 8.2.28
- Server: nginx/1.27.5
- Database: mysqli (Server: 8.4.5 / Client: mysqlnd 8.2.28)
- Browser: Chrome 137.0.0.0
- OS: Windows 10/11
- Theme: Twenty Nineteen 3.1
- MU Plugins: None activated
- Plugins:
- Test Reports 1.2.0
Reproduction Instructions
- Use the URL provided in OP, in a RSS Block
- 🐞 Text appears with HTML tags.
Actual Results
- ✅ Error condition occurs (reproduced, check artifacts).
Additional Notes
- On top of your proposed code, adding some unit tests with some mock RSS data like this one would be a good idea (obviously adapted to this test case)
Supplemental Artifacts
#3
in reply to:
↑ 2
;
follow-up:
↓ 4
@
6 months ago
Replying to SirLouen: Thanks for the testing.
Perhaps my explanation was insufficient, but the wp_widget_rss_output() function is not used in the RSS block. The title is escaped by the block itself, so it needs to be fixed separately in Gutenberg.
This core ticket targets the wp_widget_rss_output() function and features that use it, such as the RSS widget in Classic Widgets.
#4
in reply to:
↑ 3
@
6 months ago
Replying to wildworks:
Replying to SirLouen: Thanks for the testing.
Perhaps my explanation was insufficient, but the
wp_widget_rss_output()function is not used in the RSS block. The title is escaped by the block itself, so it needs to be fixed separately in Gutenberg.
This core ticket targets the
wp_widget_rss_output()function and features that use it, such as the RSS widget in Classic Widgets.
I did not review the code, but I was assuming that the RSS block also used such function. I could also see a wrong html parsing in the block.
I'm assuming that you brought this because it was also found in the Gutenberg repo?
This ticket was mentioned in PR #9042 on WordPress/wordpress-develop by @ankitkumarshah.
6 months ago
#5
- Keywords has-patch added; needs-patch removed
https://core.trac.wordpress.org/ticket/63611
### What
RSS feeds sometimes contain HTML tags that have been escaped as HTML entities in their titles. The wp_widget_rss_output() function currently strips HTML tags but doesn't decode HTML entities first, causing escaped tags like <em> to display as literal text instead of being properly removed.
Example:
- RSS feed title:
Oral administration of <em> Lactiplantibacillus plantarum </em> GKK1 ameliorates atopic dermatitis - Current display: Shows
<em>as visible text - Expected display: Clean title without HTML entities or tags
### How
Add html_entity_decode() before strip_tags() in the title processing logic.
### Screenshot:
| Before | After |
|---|---|
| | |
#6
@
6 months ago
I'm assuming that you brought this because it was also found in the Gutenberg repo?
That's right. RSS titles need to be formatted consistently in both core and Gutenberg.
I think it would be a good idea to fix that in core first and then apply the same approach to the Gutenberg RSS block.
@mukesh27 commented on PR #9042:
6 months ago
#7
Thanks @Infinite-Null, The changes looks good to me.
6 months ago
#8
@Infinite-Null these changes look good, can you also make the change to the same code in the wp-includes/blocks/rss.php file?
@mukesh27 commented on PR #9042:
6 months ago
#9
Let's add unit tests that check the updated functionality.
@wildworks commented on PR #9042:
6 months ago
#11
@Infinite-Null Thanks for the PR.
Are the ENT_QUOTES and get_option( 'blog_charset' ) options necessary? Because the title will be escaped by esc_html at the end.
@ankitkumarshah commented on PR #9042:
6 months ago
#12
@t-hamano Thanks for the feedback! I've updated the patch to use the simpler html_entity_decode( $item->get_title() ) without the additional parameters.
Please review at your convenience.
@wildworks commented on PR #9042:
6 months ago
#13
@Infinite-Null can you check the following feedback?
Let's add unit tests that check the updated functionality.
@ankitkumarshah commented on PR #9042:
6 months ago
#14
Sure @mukeshpanchal27 @t-hamano , I will start working on it shortly.
#15
@
6 months ago
- Keywords needs-testing added
Reproduction Report
Description
This report validates whether the issue reported here can be reproduced.
Environment
- WordPress: 6.9-alpha-60093-src
- PHP: 8.2.28
- Server: nginx/1.27.5
- Database: mysqli (Server: 8.4.5 / Client: mysqlnd 8.2.28)
- Browser: Chrome 137.0.0.0
- OS: Linux
- Theme: Twenty Fifteen 4.0
- MU Plugins: None activated
- Plugins:
- Classic Widgets 0.3
- Test Reports 1.2.0
Steps to Reproduce
- Install and activate the Classic Widgets plugin — https://wordpress.org/plugins/classic-widgets/
- Switch to the Twenty Fifteen theme or any other classic theme.
- Go to Appearance → Widgets.
- Add an RSS widget to the sidebar.
- Use this RSS feed URL: https://pubmed.ncbi.nlm.nih.gov/rss/search/16cUU5Jcud0BSYRzHgbqJGm_F6kq07gr9atM8kZoogUmZdX5oj/
- Save the widget.
- View the site frontend.
Actual Results
✅ RSS feed titles that contain escaped HTML (such as <em>) are not decoded before being stripped, resulting in raw HTML entities being displayed in the title.
ℹ️ Additional Notes
- ⚠️ Ensure you are using an older bundled theme such as Twenty Fifteen.
- ⚠️ The Classic Widgets plugin is required to disable the Block Widgets UI and restore access to the legacy RSS widget.
- Plugin source for reference:
https://plugins.trac.wordpress.org/browser/classic-widgets/tags/0.3/classic-widgets.php
Supplemental Artifacts
#16
@
6 months ago
- Keywords needs-testing removed
Test Report
Description
This report validates whether the proposed patch resolves the issue described this ticket.
Patch tested: https://github.com/WordPress/wordpress-develop/pull/9042
Environment
- WordPress: 6.9-alpha-60093-src
- PHP: 8.2.28
- Server: nginx/1.27.5
- Database: mysqli (Server: 8.4.5 / Client: mysqlnd 8.2.28)
- Browser: Chrome 137.0.0.0
- OS: Linux
- Theme: Twenty Fifteen 4.0
- MU Plugins: None activated
- Plugins:
- Classic Widgets 0.3
- Test Reports 1.2.0
Actual Results
Tested using the same reproduction steps outlined in comment:15.
📝 For others testing this patch, please refer to those steps. In my case, I did not need to repeat them since my local environment was already configured.
✅ The patch resolves the issue.
Escaped HTML entities such as <em> in RSS feed titles are now properly decoded and stripped before display, resulting in clean, readable titles.
ℹ️ Additional Notes
- ⚠️ Ensure you are using an older bundled theme (e.g., Twenty Fifteen) that supports classic sidebars.
- ⚠️ Activate the Classic Widgets plugin — https://wordpress.org/plugins/classic-widgets/
- ⚠️ This plugin disables the block-based widget editor to expose the legacy RSS Widget, which is necessary for testing this behavior. Reference: https://plugins.trac.wordpress.org/browser/classic-widgets/tags/0.3/classic-widgets.php
Supplemental Artifacts
Trunk - Literally before the patch

After fetching the PR patch - I have outlined the right side of the screenshot to avoid confusion.

@ankitkumarshah commented on PR #9042:
6 months ago
#17
Hi @mukeshpanchal27 and @t-hamano, I have completed the unit test for this PR can you please review the test at your convenience.
@wildworks commented on PR #9042:
5 months ago
#18
Thanks for the update, the unit tests look good to me.
@
4 months ago
Refresh patch for #63611 – Decodes HTML entities, strips tags, and sanitizes output to fix entity handling.
#20
@
4 months ago
@sachinrajcp123, We are discussing the RSS title bug fixes. If improvements are needed for the RSS description, we can discuss this in a separate ticket.
#21
@
4 months ago
Decode HTML entities before stripping tags in wp_widget_rss_output() titles to prevent escaped HTML tags (e.g., <em>) from appearing in RSS widget output.
#22
@
4 months ago
@sachinrajcp123 Please don't submit patches that are not directly related to this ticket or that are the same as a patch that has already been submitted.
This ticket was mentioned in Slack in #core by welcher. View the logs.
5 weeks ago
#24
@
5 weeks ago
- Keywords commit added
This was reviewed in the 6.9 bug scrub and this seems ready to commit. We'll need to have it committed before RC1 next week. cc @wildworks
@wildworks commented on PR #9042:
5 weeks ago
#26
I have merged the latest trunk branch into this branch. Once all CI checks pass, I will commit this pull request.
#27
@
5 weeks ago
- Owner set to wildworks
- Status changed from new to assigned
@SirLouen @mukesh27, I plan to commit PR 9042 with the following message before the Beta 3 release. If you have any feedback regarding the commit message, please leave a comment.
Widgets: Decode HTML entities in RSS widget titles before escaping. Some RSS feeds include HTML tags that have been escaped to entities (for example `<em>` appears as `<em>`), causing the literal entity text to appear in the title. The `wp_widget_rss_output()` function now runs `html_entity_decode()` on the title before `strip_tags()` and escaping, ensuring titles render cleanly without displaying escaped tags. Fixes #63611. Props ankitkumarshah, SirLouen, mukesh27, n8finch, rollybueno, sachinrajcp123.
@wildworks commented on PR #9042:
5 weeks ago
#28
@dmsnell, I noticed your feedback regarding html_entity_decode() on a separate ticket: https://core.trac.wordpress.org/ticket/64177#comment:10
I was planning to commit this PR before beta3, but do you have any feedback regarding this PR? In this case, I'm wondering whether simply using html_entity_decode() is sufficient.
This ticket was mentioned in PR #10463 on WordPress/wordpress-develop by @SirLouen.
5 weeks ago
#29
- Keywords has-unit-tests added
I'm refactoring here the tests for #9042
#30
@
5 weeks ago
@wildworks I've reviewed the unit tests, and they don't look good to me.
Also I've run a XML validation test
https://validator.w3.org/feed/docs/warning/ContainsHTML.html
It appears that HTML entities in titles are not recommended by the standard.
I've added a new PR (10463), with slightly improved tests (taking advantage of already made tests just to introduce this specific scenario)
PS: Rather than adding that sachinraj bro to props, I would ban the guy from Trac. He has been adding a dozen+ useless patches all over the place (plus the ones already removed), just adding no value or copying already existing patches. Not sure if the guy is aura farming this way, he is simply clueless, or he has gone wild. The fact is that he has been confusing many testers, that saw his patch as the last added to many tickets and thought it was the patch to test mistakenly. It's not 100% his fault, because testers should be more aware of what they are testing, but especially newbs or new contributors in contributor days are being confused by the person. Anyway, it's going to be funny to see the gentleman with a couple of props by the end of the period. Maybe he wants to be the living proof of the dubious utility of the system. Just making you aware of this situation.
#31
@
5 weeks ago
- Milestone changed from 6.9 to 7.0
Since the RC1 release is coming soon, I'd like to punt this ticket to 7.0.
I believe this issue has likely existed for a long time, so its urgency is low.
Both of you have submitted pull requests, but I would appreciate it if you could review the feedback from @dmsnell.
#34
@
19 seconds ago
Today I ran some analysis on a set of around 30,000 RSS feeds I found, which were source from ingesting a Bluesky feed. Following are some insights. For context, we currently rely on SimplePie for parsing the RSS feeds, which seems to be based on the concept of various RSS specifications and ATOM specifications. Unfortunately, with RSS/Atom feeds, producers are frequently implementing the specifications in diverse ways.
There are potentials to switch to a content-based approach where WordPress infers content type based on what it sees. For example, let us consider content-carrying elements like TITLE, DESCRIPTION, CONTENT, and CONTENT:ENCODED (unfortunately there’s no universal agreement on what encoded means here, as it could be HTML or XML).
<?php // Some malformed HTML contains things which look like CDATA sections and aren’t, // but usually in an RSS feed if one is present, it’s XML. Common RSS feeds also contain // elements comprising only of a single CDATA section, which could also be checked for. // These CDATA sections are purely for packaging the content, not for indicating what // type of content they are; so unpack it and try again. if ( contains_cdata_section( $content ) ) { return 'xml-decode-data-then-reassess'; } // Assuming there are no CDATA sections, there could still be raw tags, but // these raw tags might be XHTML embedded within the XML of the feed, or // HTML found inside the feed. A giveaway of directly-embedded XHTML is // the presence of namespace directives. These should not contain encoded // HTML because they are the content themselves. if ( contains_tag( $content ) ) { if ( contains_xmlns_attribute( $content ) ) { return 'parse-as-xhtml'; } else { return 'parse-as-html'; } } // With no tags and no character references it’s all plaintext, it’s all the same. if ( ! contains_character_reference( $content ) ) { return 'plaintext-nothing-to-do'; } // XML 1.0 only defines > < & " and ' so if other named character references // are present it should be decoded as an HTML text node. if ( contains_named_character_reference_other_than_big_5( $content ) ) { return 'parse-as-html'; } // At this point the content could be HTML or HTML encoded inside XML. The only character // references are the syntax characters and numeric character references, which do not give // away the nature of the content. The guessing comes from detecting the pattern of <div> // as these are unlikely to occur in normal text. Unfortunately, this leads to mis-detection if someone // is writing _about_ HTML tags and literally encoded the syntax to preserve it. There should be a // heuristic here to make a choice in the presence of ambiguity, but it’s likely best to assume that // encodings of tags are actually tags. $decoded = WP_HTML_Decoder::decode_text_node( $content ); if ( contains_tag( $decoded ) ) { return 'decode-then-parse-as-html'; } return 'decode-then-plaintext';
Unfortunately, tags like <content:encoded> suggest that we have some underlying HTML or XHTML content inside them, but that indicator doesn’t tell us which, and its absence doesn’t imply there isn’t underlying HTML or XHTML.
We might look to best practices such as in a feed like https://nijigen-daily.com/atom.xml which provides tags like this…
<content type="text/html" mode="escaped" xml:lang="ja" xml:base="https://nijigen-daily.com/archives/12944226.html"> <![CDATA[<a target="_blank" href="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg"><img src="https://livedoor.blogimg.jp/nijigen_daily/imgs/3/e/3ebc3caf.jpg" class="res-img" alt="【Key】無自覚にイケメン4人侍らせてるやつ|にじげん!デイリー"></a><div class="res-thread"> </div> <div class="res-thread"><div class="res-block"><div class="res-head"><span class="res-name">1: 名無しさん</span><span class="res-datetime">25/12/08(月)20:42</span><span class="res-likes"></span></div><div class="res-text">女子からしたら結構羨ましい立ち位置なんだろうか</div></div> <div class="res-replies"><div class="res-block res-reply"><div class="res-head"><span class="res-name">21: 名無しさん</span><span class="res-datetime">25/12/08(月)20:52</span><span class="res-likes">そうだねx12</span></div><div class="res-text pink"><span class="res-anchor">>>1</span><br />女子から嫌がらせされるくらいガチで嫌われてたはず</div></div> </div></div> <div class="res-thread"><div class="res-block"><div class="res-head"><span class="res-name">2: 名無しさん</span><span class="res-datetime">25/12/08(月)20:43</span><span class="res-likes">そうだねx5</span></div><div class="res-text purple">小毬ちゃん入るまで女友達0だったからな…</div></div> </div> <div class="res-thread"><div class="res-block"><div class="res-head"><span class="res-name">3: 名無しさん</span><span class="res-datetime">25/12/08(月)20:43</span><span class="res-likes">そうだねx25</span></div><div class="res-text red">本当に侍らせるのは理樹</div></div> </div> <link href="https://nijigen-daily.com/nijigen_daily.css" rel="stylesheet"> <a href="https://nijigen-daily.com/archives/12944238.html">続きを読む</a>]]> </content>
And we can say, “yes, thankfully someone indicates the encoding in the attributes” because indeed, the content is HTML serialized inside XML, not as XHTML but as an opaque text value of the element. However, earlier in the same document we find this…
<summary type="text/plain"> &gt;一体誰なのだ…だ…だれがいうかーーーーっ!! 風のようすがへんなのだ 雲じゃねーか! 新一という秘孔を突いた ユ… ユ…!! ゆうかーーーっ!! ぬぅ!志村けんのカキタレ…!! |北斗の拳|ジャンプ|漫画・アニメ・ゲーム記事のまとめサイトならにじげん!デイリー </summary>
So while this positively identifies the content as plaintext, we find after properly decoding the XML text node that we start with >一体誰なのだ… which almost certainly should start >一体誰なのだ…, meaning the type should be type="text/html" mode="escaped".
Another example is https://www.gadgetguy.com.au/feed/. Here we find <description> with no attributes and it contains an encoded form of what would parse equally as HTML or XHTML. Later in the same feed we find <content:encoded> containing the same thing, which would seem to imply that CONTENT is encoded but DESCRIPTION isn’t, but that’s not true. This comes from WordPress.
For https://www.tagesschau.de/index~atom.xml we find this interesting oddity:
<summary type="text/html" mode="escaped">Die USA haben…</summary>
<content mode="escaped"><![CDATA[<p> <a href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-tanker-100.html"><img src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920" alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA haben…[<a href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-tanker-100.html">mehr</a>]</p>]]></content>
Here, both SUMMARY and CONTENT are mode="escaped", but SUMMARY is implied to be different, as type="text/html". Ironically, it contains only plaintext and lacks even a single character reference. Meanwhile, the CONTENT actually has XML double-encoded as XML, which then encodes HTML. This requires some level of recursion if not intending to hard-code it.
<?php $content = get_content_element()->textContent; $first_decode = html_entity_decode( $content, ENT_XML1 | ENT_SUBSTITUTE, 'UTF-8' ); $html = parse_xml( $first_decode )->textContent;
Most feeds, for <content mode="escaped"> seem to produce this instead…
<![CDATA[<p> <a href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-tanker-100.html"><img src="https://images.tagesschau.de/image/0ff925d9-86a2-4888-9d98-56b86ee94412/AAABmwnwQxU/AAABmt42H9g/16x9-big/trump-4488.jpg?width=1920" alt="Donald Trump | AFP" /></a> <br/> <br/>Die USA haben vor der Küste Venezuelas einen Tanker unter ihre Kontrolle gebracht. Das bestätigte US-Präsident Trump. Seit Wochen erhöhen die USA den Druck auf Venezuela und verlegen Seestreitkräfte in die Region.[<a href="https://www.tagesschau.de/ausland/amerika/trump-venezuela-tanker-100.html">mehr</a>]</p>]]>
We can note how someone took the intended serialized XML and then ran it through something like htmlspecialchars() to hide it, much like what happened as the motivating case for this ticket.
Get on with it!
type=text/plainmight indicate that we should avoid decoding after deserializing from XML.mode="escaped"doesn’t communicate anything, because all HTML seems to be escaped, and if it’s missing that, it can only be plaintext or embedded XHTML; however, if there are tag-like things, it’s almost certainly XHTML. if, on the other hand, it’s missing the mode and there are things which look like tags after unescaping, it’s probably escaped anyway.- this is the kind of thing that probably has to rely on some heuristics based on the content in the item itself. feeds sometimes aggregate items and encoding models may diverge within the same XML document.
I hope to automate the scanning of all of the RSS feeds I downloaded, including categorizing these into RSS vs. ATOM explorations, but that will take more time than I had today. needless to say, I think the current approach is failing us (parsing based on our inference of the specifications). SimplePie is supposed to already decode and “sanitize” content, and that causes confusion in the diverse world of feeds.


RSS feeds with HTML entities in the title