Opened 13 years ago
Closed 13 years ago
#20999 closed defect (bug) (fixed)
WordPress 3.4 doesn't handle rss url with percent code correctly
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 3.5 | Priority: | normal |
Severity: | normal | Version: | 3.4 |
Component: | External Libraries | Keywords: | dev-feedback has-patch |
Focuses: | Cc: |
Description
I have a rss widget with this feed: http://news.google.com/news?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss
The rss handler makes all the percent code into question symbols (represent unknown characters). These characters also have differnt percent codes: http://news.google.com/news?gl=vn&pz=1&ned=vi_vn&hl=vi&q=ch%EF%BF%BD%EF%BF%BD%EF%BF%BDng+h%EF%BF%BD%EF%BF%BDng+gi%EF%BF%BD%EF%BF%BD%EF%BF%BD
Keep the rss url in unicode format doesn't help.
The bug only appears in 3.4 version.
Attachments (1)
Change History (21)
#1
@
13 years ago
- Cc kpayne@… added
- Keywords reporter-feedback added
- Severity changed from major to normal
#2
@
13 years ago
It still shows up with content because both of the link still have result from google news. Can you recheck it again and see that the content showed up in your site is the same as the original link's XML output or the error link's XML output?
I have deleted and recreated the widget many times and even installed third-party rss widget but may be because they all use wordpress internal function for parsing rss feed, the results are the same with the wrong content.
#3
follow-up:
↓ 9
@
13 years ago
- Component changed from Feeds to External Libraries
- Keywords dev-feedback added; reporter-feedback removed
Thanks for the extra information. I can reproduce this with the code below:
$url = 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss'; var_dump($url); // outputs: // string 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss' (length=117) $iri =& new SimplePie_IRI($url); var_dump($iri->get_iri()); // outputs: // string 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=chống+hàng+giả&output=rss' (length=101)
The problem is happening in SimplePie_IRI::replace_invalid_with_pct_encoding()
.
Related #18309
#5
@
13 years ago
- Summary changed from Wordpress 3.4 doesn't handle rss url with percent code correctly to WordPress 3.4 doesn't handle rss url with percent code correctly
#6
@
13 years ago
This appears to be broken in simplepie master as well: https://raw.github.com/simplepie/simplepie/master/SimplePie/IRI.php
#7
follow-up:
↓ 8
@
13 years ago
For 1.2.0, I get:
string(143) "https://news.google.com/news/feeds?gl%3Dvn%26pz%3D1%26cf%3Dall%26ned%3Dvi_vn%26hl%3Dvi%26q%3Dch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3%26output%3Drss"
Not sure if that is right or not. Seems like it worked in 3.3, so I imagine yes.
Per rmccue, relevant issue: https://github.com/simplepie/simplepie/issues/112.
#8
in reply to:
↑ 7
@
13 years ago
- Owner set to rmccue
- Status changed from new to assigned
Replying to nacin:
For 1.2.0, I get:
string(143) "https://news.google.com/news/feeds?gl%3Dvn%26pz%3D1%26cf%3Dall%26ned%3Dvi_vn%26hl%3Dvi%26q%3Dch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3%26output%3Drss"Not sure if that is right or not. Seems like it worked in 3.3, so I imagine yes.
Nope, that's incorrect. Note how the & and = in the query string have been encoded. That was issue 112, which was fixed. This appears to be a new issue, but related.
This appears to be related to https://github.com/simplepie/simplepie/issues/162 and possibly https://github.com/simplepie/simplepie/issues/58
From the issue description:
http://news.google.com/news?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss http://news.google.com/news?gl=vn&pz=1&ned=vi_vn&hl=vi&q=ch%EF%BF%BD%EF%BF%BD%EF%BF%BDng+h%EF%BF%BD%EF%BF%BDng+gi%EF%BF%BD%EF%BF%BD%EF%BF%BD
Note that &cf=all
and &output=rss
have disappeared. All the characters in the original URL have been turned into a � character (U+FFFD, aka 0xEF 0xBF 0xBD). That would indicate to me a possible issue with iconv or mbstring on the version of PHP on the server. I'll have a dig into this when I've got stable internet.
#9
in reply to:
↑ 3
@
13 years ago
Replying to kurtpayne:
Thanks for the extra information. I can reproduce this with the code below:
$url = 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss'; var_dump($url); // outputs: // string 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss' (length=117) $iri =& new SimplePie_IRI($url); var_dump($iri->get_iri()); // outputs: // string 'https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=chống+hàng+giả&output=rss' (length=101)
Note that this is not the same as the original issue, although it's likely that it's linked. SimplePie_IRI
decodes some percent-encoded strings to try and form a canonical IRI. It appears in this case that it is doing so incorrectly. The original issue though has the characters replaced by U+FFFD.
My guess is that the same code is affecting both instances, but that the original issue is also occurring because of iconv/mbstring errors.
#10
@
13 years ago
The commit which broke this was 4e022de12 for 1.3-dev and 76b5fd63 for 1.2.1. This commit was a backport of the more up-to-date IRI-parsing code from the ComplexPie fork (possibly future SimplePie 2).
The issue is that SimplePie thinks the characters are valid for a URL when it checks. While they are valid UTF-8/UCS characters, they don't appear to be valid for URLs based on my reading of RFC3986. At the moment, I'm lacking the mental power to work out exactly why that test is there, so I'll have to come back to doing this later.
#12
@
13 years ago
This should now be fixed in https://github.com/simplepie/simplepie/commit/90a771f21985e01cb0133c89631debfa951f6171
The problem was due to backporting the incomplete updated IRI library. SimplePie_IRI::get_iri()
always returned an IRI, which is as designed. The problem is that all the rest of the code pretends it returns a URI.
The above commit fixes that issue by breaking what get_iri()
should do, in favour of making it work. Let me know if this fixes your issue (I believe it should).
#13
follow-up:
↓ 14
@
13 years ago
Testing the user supplied URL -- RSS widget now displays the same content as visiting the URL directly.
Sample code from above also displays identical output for get_iri()
and the original URL.
Also tested with the WordPress RSS feed URL -- passed.
Patch works.
#14
in reply to:
↑ 13
@
13 years ago
- Keywords has-patch added
Replying to kurtpayne:
Patch works.
Thanks for confirming and patchifying. Marking as has-patch; this is now ready for commit.
#15
@
13 years ago
- Keywords has-patch removed
- Resolution set to invalid
- Status changed from assigned to closed
Previous patch is obsoleted by #21183.
I created a new RSS feed widget on my site and pasted in this URL:
https://news.google.com/news/feeds?gl=vn&pz=1&cf=all&ned=vi_vn&hl=vi&q=ch%E1%BB%91ng+h%C3%A0ng+gi%E1%BA%A3&output=rss
The widget showed up on my site with content. Also, the latest changes to
esc_url
were in [18469] for 3.2.1.If you delete and recreate the widget, does this solve the problem?