Make WordPress Core

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#11219 closed defect (bug) (fixed)

could not be converted to UTF-8 / WordPress should cache failed feed fetches so as to avoid overloading feed sources

Reported by: denis-de-bernardy's profile Denis-de-Bernardy Owned by:
Milestone: 2.9.1 Priority: high
Severity: major Version: 2.9
Component: Optimization Keywords: has-patch tested
Focuses: Cc:

Description (last modified by Denis-de-Bernardy)

SimplePie should not fail to convert UTF-8 to UTF-8.

Also, in some situations SimplePie cannot parse a feed due to it being malformed we should cache that feed so as to not DOS servers.

Attachments (5)

10877.diff (482 bytes) - added by Denis-de-Bernardy 14 years ago.
10877.png (20.7 KB) - added by Denis-de-Bernardy 14 years ago.
11219.diff (471 bytes) - added by Denis-de-Bernardy 14 years ago.
11219.2.diff (624 bytes) - added by ryan 14 years ago.
11219.3.diff (488 bytes) - added by Denis-de-Bernardy 14 years ago.

Download all attachments as: .zip

Change History (39)

#1 @Denis-de-Bernardy
14 years ago

Basically, the problem is this loop:

			// Loop through each possible encoding, till we return something, or run out of possibilities
			foreach ($encodings as $encoding)
			{
				// Change the encoding to UTF-8 (as we always use UTF-8 internally)
				if ($utf8_data = SimplePie_Misc::change_encoding($data, $encoding, 'UTF-8'))
				{
					// Create new parser
					$parser =& new $this->parser_class();
...
			if(isset($parser))
			{
				// We have an error, just set SimplePie_Misc::error to it and quit
				$this->error = sprintf('XML error: %s at line %d, column %d', $parser->get_error_string(), $parser->get_current_line(), $parser->get_current_column());
			}
			else
			{
				$this->error = 'The data could not be converted to UTF-8';
			}

and then change_encoding() doesn't check if it needs to convert nothing at all. and it ends up returning false.

#2 @Denis-de-Bernardy
14 years ago

  • Keywords tested added

I haven't experienced any problems while running this for the past 3 days.

#4 @Denis-de-Bernardy
14 years ago

  • Keywords blocker added

#5 @Denis-de-Bernardy
14 years ago

Marked this as blocker... without the patch applied on my desktop, my dev site's admin area takes loads of time to load on grounds that the UTF-8 feed, which gets retrieved and parsed perfectly fine, could not be converted to UTF-8.

The site's settings are as follows:

define('DB_CHARSET', 'utf8');
define('DB_COLLATE', '');

A show create table claims it's utf8 collated needed.

Feed returns the following header and data:

$ curl -I http://www.semiologic.com/news/wordpress/feed/
HTTP/1.1 200 OK
Date: Thu, 10 Dec 2009 13:56:46 GMT
Server: Apache/2.2.13 (FreeBSD) mod_ssl/2.2.13 OpenSSL/0.9.8e DAV/2 mod_python/3.3.1 Python/2.6.2 SVN/1.6.5
Last-Modified: Mon, 07 Dec 2009 21:39:38 GMT
X-Pingback: http://www.semiologic.com/xmlrpc.php
ETag: "ca1a1329c5c1f838063c134d9b5f3f39"
Content-Type: text/xml; charset=utf-8

$ curl http://www.semiologic.com/news/wordpress/feed/ -s | head -n 1
<?xml version="1.0" encoding="utf-8"?>

#6 @link92
14 years ago

The point of change_encoding in that case is to return a string that is guaranteed to be valid under that encoding. In the UTF-8 case, something like "\xff" is invalid, and should go through change_encoding and come out as whatever the UTF-8 encoding of U+FFFD is. This is needed for web compat with feeds. If there are bugs in the impl is change_encoding, they should be fixed.

#7 @ryan
14 years ago

  • Keywords has-patch tested blocker removed
  • Severity changed from normal to major

#8 @Denis-de-Bernardy
14 years ago

  • Component changed from Administration to Optimization
  • Severity changed from major to normal

If there are bugs in the impl is change_encoding, they should be fixed.

It was actually related to an issue on my localhost (it had neither of mbstring or iconv).

I've attached a patch to make it bail in this case.

What would more ideally be needed, however, is that WP caches simplepie errors for an hour or so. Else, trying to load a site full of RSS widgets on a misconfigured development box ends up crippling.

#9 @Denis-de-Bernardy
14 years ago

Loading the dashboard just as I write, for instance, hammers WP.org on every page load because of:

http://planet.wordpress.org/feed/

XML Parsing Error: not well-formed
Location: http://planet.wordpress.org/feed/
Line Number 169, Column 1:03[13:09] * stas (n=stas@86-107-52-231.Asconet.ro) has joined #bbpress&lt;br /&gt;
^

#10 follow-up: @westi
14 years ago

  • Description modified (diff)
  • Keywords needs-patch added
  • Milestone changed from 2.9 to 3.0
  • Summary changed from SimplePie fails when trying to convert UTF-8 to UTF-8 to WordPress should cache failed feed fetches so as to avoid overloading feed sources

The current patch doesn't seem to be the right thing to do here.

SimplePie has determined that it wants to change the encoding and if this fails the function is meant to return false.

As for the caching of failed lookups it is something we could do but not for 2.9.

We need to work out what we should cache, and for how long.

#11 @miqrogroove
14 years ago

See also #11168. There are lots of complaints about this in the forum now.

#12 @Denis-de-Bernardy
14 years ago

  • Description modified (diff)
  • Summary changed from WordPress should cache failed feed fetches so as to avoid overloading feed sources to could not be converted to UTF-8 / WordPress should cache failed feed fetches so as to avoid overloading feed sources

#13 in reply to: ↑ 10 @Denis-de-Bernardy
14 years ago

  • Priority changed from normal to high
  • Severity changed from normal to major

Replying to westi:

The current patch doesn't seem to be the right thing to do here.

the current patch does exactly what should be done.

SimplePie has determined that it wants to change the encoding and if this fails the function is meant to return false.

No, SimplePie has determined that it cannot convert an UTF-8 string into UTF-8, not because it couldn't convert it or even needed to convert it, but because of crappy logic in its conversion function.

I'll buy the security argument that we need to pass it through iconv or mb_string in the event the UTF-8 string is malformed, but if neither of those are available, we should succeed to "convert" UTF-8 into UTF-8.

#14 @Denis-de-Bernardy
14 years ago

  • Keywords has-patch tested added; needs-patch removed

#15 @ryan
14 years ago

link92 is a SimplePie maintainer, I believe. Any feedback link92?

#16 follow-up: @hakre
14 years ago

about converting UTF8 into UTF8: It might look a bit akward at first but I know about that principle from programming and it often works well. When I remember correctly there is actually an encoding in the iconv lib that is for converting "utf8 into utf8" so to validate / ensure that a "hey I guess it's utf8" afterwards definetly is utf8. But right now I do not remember in which PHP code I've seen that. But I'm pretty shure it exists.

#17 in reply to: ↑ 16 @miqrogroove
14 years ago

Replying to hakre:

right now I do not remember in which PHP code I've seen that.

I've seen it done in WordPress.. it's at... http://core.trac.wordpress.org/browser/trunk/wp-includes/formatting.php#L459

#18 follow-up: @Denis-de-Bernardy
14 years ago

  • Milestone changed from 3.0 to 2.9.1

I hate to be punting this back to 2.9.1, but this ticket is the main reason that RSS widgets have caused so much trouble in WP 2.9. It *should* get fixed before we even think about rushing WP 2.9.1 in the wild; or at the very least, we should have an acceptable workaround.

#19 in reply to: ↑ 18 ; follow-up: @westi
14 years ago

  • Keywords reporter-feedback added

Replying to Denis-de-Bernardy:

I hate to be punting this back to 2.9.1, but this ticket is the main reason that RSS widgets have caused so much trouble in WP 2.9. It *should* get fixed before we even think about rushing WP 2.9.1 in the wild; or at the very least, we should have an acceptable workaround.

In order to understand this in more detail are you able to provide answers to the following questions:

  • Is this a behavioural change between SimplePie 1.1.3 and SimplePie 1.2 that means 2.9 is worse than 2.8?
  • Does a 2.8 install choke in the same way?
  • Is there an easy step by step reproduction method for the issue?
  • Have you tried SimplePie 1.1.3 + WP 2.9 to see if it helps?

#20 in reply to: ↑ 19 ; follow-up: @Denis-de-Bernardy
14 years ago

  • Keywords blocker added; reporter-feedback removed

Replying to westi:

Replying to Denis-de-Bernardy:

I hate to be punting this back to 2.9.1, but this ticket is the main reason that RSS widgets have caused so much trouble in WP 2.9. It *should* get fixed before we even think about rushing WP 2.9.1 in the wild; or at the very least, we should have an acceptable workaround.

In order to understand this in more detail are you able to provide answers to the following questions:

  • Is this a behavioural change between SimplePie 1.1.3 and SimplePie 1.2 that means 2.9 is worse than 2.8?

yes

  • Does a 2.8 install choke in the same way?

no

  • Is there an easy step by step reproduction method for the issue?

yes. I've been screaming how to reproduce this over and over further up. disable multibyte functions, and disable iconv. create a new site, using the UTF8 charset, and watch -- with amusement -- how the dashboard chokes in WP 2.9.

  • Have you tried SimplePie 1.1.3 + WP 2.9 to see if it helps?

no, I haven't. the only thing I do know is where it comes from in WP 2.9: when SimplePie seeks to "convert" the UTF8 feed into UTF8, and finds neither of multibyte functions nor iconv, it fails miserably -- in spite of the fact that it technically has nothing to convert.

so again, what my patch does is only the following: it adds a use case *after* it tries MB functions and iconv, which basically goes: if input charset = output charset, then don't convert a thing and just return the original data.

I'll happily buy that there might be security implications if we accept a UTF8 string *as is*, so maybe we could try to at least validate the mess if we've the available tools and functions. the main difference between my first and second patch, in fact, is that I've simply placed the code further down in the convert function.

Still, the point stands: SimplePie and WP should not fail, as it currently does, when trying to convert data from a charset to that same charset when there are neither of MB function and iconv.

#21 @miqrogroove
14 years ago

Also remember a standalone UTF-8 validation function can be written very easily.

#22 @ryan
14 years ago

The latest in the SimplePie repo seems to be doing the same thing still.

http://github.com/rmccue/SimplePie/blob/master/simplepie.inc

Now that the check is moved down, it seems fine to me. How many hosts will have neither extension? We either need to fix it this way, add some validation that we probably don't have time for, or cache the false returns and live with ugly errors.

@ryan
14 years ago

#23 @ryan
14 years ago

Perhaps a silly patch, but I made an attempt at using SimplePie_Misc::utf8_bad_replace().

#24 follow-up: @westi
14 years ago

  • Keywords reporter-feedback added; blocker removed

OK the 2.9 SimplePie has one change over the 2.8 one in the function change_encoding

The test which starts {{{function_exists('mb_convert_encoding')}} has an extra check for
@mb_convert_encoding("\x80", 'UTF-16BE', $input) !== "\x00\x80"

So if 2.8 works on the server then is it likely this check that is failing now not that function_exists('mb_convert_encoding') fails?

This check passes if $input is UTF-8 as 0x80 is an invalid UTF-8 sequence - It has the top bit set and therefore is the second, third, or fourth byte of a multi-byte sequence.

This check doesn't pass for ASCII as the $input.

If you revert just this change does it work in 2.9?

It looks like in 2.9 that part of change_encoding may only work if $input is UTF-8.

I can't see how that function can work in 2.8 and not in 2.9 if neither iconv or mb_convert_encoding are present.

#25 @westi
14 years ago

This change in SimplePie 1.2 was introduced here:
http://github.com/skyzyx/simplepie/commit/343fdd24e616f7ab9ad4bf911bc3f0581e2fff0a

For this ticket:
http://bugs.simplepie.org/issues/show/116

Maybe we should revert that one change

#26 in reply to: ↑ 20 @Denis-de-Bernardy
14 years ago

Replying to Denis-de-Bernardy:

Replying to westi:

  • Is there an easy step by step reproduction method for the issue?

yes. I've been screaming how to reproduce this over and over further up. disable multibyte functions, and disable iconv. create a new site, using the UTF8 charset, and watch -- with amusement -- how the dashboard chokes in WP 2.9.

  • Have you tried SimplePie 1.1.3 + WP 2.9 to see if it helps?

no, I haven't.

I just did, and I can confirm it doesn't chock when I take the 2.8.6 file over on a 2.9 install without mb_string and iconv.

#27 in reply to: ↑ 24 @Denis-de-Bernardy
14 years ago

Replying to westi:

OK the 2.9 SimplePie has one change over the 2.8 one in the function change_encoding

The test which starts {{{function_exists('mb_convert_encoding')}} has an extra check for
@mb_convert_encoding("\x80", 'UTF-16BE', $input) !== "\x00\x80"

If you revert just this change does it work in 2.9?

I'm afraid it doesn't. and that's normal, since it checks for mb_string beforehand.

I can't see how that function can work in 2.8 and not in 2.9 if neither iconv or mb_convert_encoding are present.

Because it doesn't try to convert in SP 1.1.3. I've yet to identify why it tries to so in 2.9 though.

D.

#28 @Denis-de-Bernardy
14 years ago

  • Keywords reporter-feedback removed

update: SP 1.1.3 works because it improperly detects UTF-8 feeds as ISO-8859-1.

				// Change the encoding to UTF-8 (as we always use UTF-8 internally)
				if ($utf8_data = SimplePie_Misc::change_encoding($data, $encoding, 'UTF-8'))
				{
					// Create new parser
					$parser =& new $this->parser_class();

					// If it's parsed fine
					if ($parser->parse($utf8_data, 'UTF-8'))
					{
						var_dump($encoding);die; // ISO-8859-1

Placing the same code in SP 1.2 never dies (it fails to create a parser in the first place).

#29 follow-up: @Denis-de-Bernardy
14 years ago

11219.3.diff works too, and it should be applied to WP 2.9.1.

In 1.1.3, the encoding() function returns Windows-1252, whereas 1.2 returns windows-1252 (lowercase), when the input is ISO-8859-1. The patch fixes the check in change_encoding() accordingly.

I think we should also consider applying Ryan's patch, or something similar, in WP 3.0. In order to handle UTF-8 to UTF-8 "conversion" in a secure manner.

#30 @ryan
14 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [12528]) windows-1252 is canonical, not Windows-1252. Props Denis-de-Bernardy. fixes #11219

#31 @ryan
14 years ago

[12529] for 2.9

#32 in reply to: ↑ 29 ; follow-up: @hakre
14 years ago

Replying to Denis-de-Bernardy:

I think we should also consider applying Ryan's patch, or something similar, in WP 3.0. In order to handle UTF-8 to UTF-8 "conversion" in a secure manner.

After this changeset now I suggest to create a new ticket for this.

#33 in reply to: ↑ 32 @Denis-de-Bernardy
14 years ago

Replying to hakre:

Replying to Denis-de-Bernardy:

I think we should also consider applying Ryan's patch, or something similar, in WP 3.0. In order to handle UTF-8 to UTF-8 "conversion" in a secure manner.

After this changeset now I suggest to create a new ticket for this.

we could also leave that to the folks at SimplePie. Has link92 or another SimplePie maintainer been monitoring this ticket?

#34 @link92
14 years ago

I've been vaguely watching this, but have been generally busy over the past few weeks with the Opera 10.50 pre-alpha release, and am now away over the holidays.

Basically, probably the best quick fix is to use utf8_bad_replace() (which older releases of SP did, though it violates the XML spec). I'll try and fix this some better way when I next look at SP (which is fairly rare).

Note: See TracTickets for help on using tickets.