Make WordPress Core

Opened 17 years ago

Closed 16 years ago

Last modified 16 years ago

#7563 closed defect (bug) (fixed)

html_entity_decode at RSS Feed import doesn't respect charset of Blog

Reported by: codestyling's profile codestyling Owned by: ryan's profile ryan
Milestone: 2.8 Priority: high
Severity: critical Version: 2.5.1
Component: Feeds Keywords: rss bug feed encoding damage has-patch
Focuses: Cc:

Description

Error: If the dashboard rss or the rss widgets imports feeds with content containing german special chars like ä equal to ä this results into display of �

Solution: Using get_option('blog_charset') solves the correct display of feed content.

file: wp-admin/includes/dashboard.php

line: 431


$description = wp_specialchars( strip_tags(html_entity_decode($item['description'], ENT_QUOTES, get_option('blog_charset'))) );

file: wp-includes/widgets.php

lines: 1130 and 1132


if ( isset( $item['description'] ) && is_string( $item['description'] ) )
				$desc = $summary = str_replace(array("\n", "\r"), ' ', attribute_escape(strip_tags(html_entity_decode($item['description'], ENT_QUOTES, get_option('blog_charset')))));
			elseif ( isset( $item['summary'] ) && is_string( $item['summary'] ) )
				$desc = $summary = str_replace(array("\n", "\r"), ' ', attribute_escape(strip_tags(html_entity_decode($item['summary'], ENT_QUOTES, get_option('blog_charset')))));

Using the blog charset the feed content will be shown qualified depending on blogs configuration!

Attachments (2)

rss.zip (6.5 KB) - added by codestyling 17 years ago.
patched WP 2.6.1 file: /wp-includes/rss.php
html_entity_decode.patch (5.3 KB) - added by adferguson 16 years ago.
Actual patch to change all html_entity_decode calls to respect blog charset

Download all attachments as: .zip

Change History (21)

#1 @codestyling
17 years ago

additional note: because of unsolved PHP4 bug http://bugs.php.net/bug.php?id=27626 it may be nessessary to prepend @ to html_entity_decode to avoid warnings like this:

Warning: cannot yet handle MBCS in html_entity_decode()! in /www/htdocs/xxx/wp-includes/widgets.php on line 1130

PHP5 works as expected but this PHP4 Bug seems to remain unsolved anyway, so it should be handled with warning suppression.

@codestyling
17 years ago

patched WP 2.6.1 file: /wp-includes/rss.php

#2 @codestyling
17 years ago

  • Keywords rss bug feed encoding damage added
  • Version set to 2.5.1

I have created a patch for MagpieRSS class to be able to handle the imported Feeds correctly.
The patch is made for PHP4 versions, which doesn't detect the feeds encoding (UTF-8 feeds will be handled as ISO feeds and also for PHP5 versions (with detection) to ensure qualified ISO based html entities gets converted into UTF-8 target.
Here are 2 feeds gets damaged, if added to dashboard:

-> ISO-8859-1 feed

http://www.maerkischeallgemeine.de/cms/list/6947650?style_only=J&cms_encoding=iso

-> UTF-8 Feed with ISO entities (like ä)

http://blog.wordpress-deutschland.org/feed

The patch has been tested at PHP4 and PHP5 with both example feeds and show them now correctly. Also the database doesn't store anymore damaged option values (broken serialize using original rss.php, sometimes dependend on feed content)
Input encoding will be detected using regular expression at raw data and output enconding will be set using charset of blog by given option value.

#3 @adferguson
17 years ago

In the current trunk, html_entity_decode occurs in 4 files:

wp-admin/import/blogger.php (4 times)
wp-admin/includes/dashboard.php (1 time, covered above)
wp-includes/feed.php (1 time)
wp-includes/widgets.php (3 times, 2 covered above)

Can this bug please be fixed? The patch is *minimal* and without the fix, the W3C Validator refuses to validate any pages which have an RSS feed incorrectly decoded to the default ISO-8859-1.

thanks,
Andrew

#4 @jtatum
17 years ago

  • Keywords has-patch added

#5 @DD32
17 years ago

  • Milestone changed from 2.7 to 2.7.1

#6 @ryan
16 years ago

  • Component changed from General to Feeds
  • Owner anonymous deleted

#7 @hakre
16 years ago

Please do not place an "@" in front of the function, because:

a) this is php4, php4 is not supported any longer. if you use php4 you have to live with it.
b) this makes wordpress harder to debug.
c) this increases execution time.
d) this is bad practise and signals a location in the code that needs fixing.

thanks for your cooperation.

and additional thought I had is: RSS is XML that is UTF-8 Encoded. Why are certain characters placed in entities instead of their correct UTF-8 Encoding? Isn't this the root of evil?

#8 @westi
16 years ago

  • Keywords needs-patch added; rss bug feed encoding damage has-patch removed

This needs an actual patch to progress rather than a zip file.

Setting back to needs-patch

@adferguson
16 years ago

Actual patch to change all html_entity_decode calls to respect blog charset

#9 follow-up: @adferguson
16 years ago

  • Keywords rss bug feed encoding damage has-patch added; needs-patch removed

#10 @ryan
16 years ago

  • Milestone changed from 2.7.2 to 2.8
  • Resolution set to fixed
  • Status changed from new to closed

#11 follow-ups: @kretzschmar
16 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

The title of the feed (in the rss-widget) is still corrupt (the rest seems to be fixed here).

Title should be "Löns-Realschule" but "Löns-Realschule" is displayed (http://loensschule.de/feed/).

#12 @ryan
16 years ago

  • Owner set to ryan
  • Status changed from reopened to new

#13 in reply to: ↑ 9 ; follow-up: @hakre
16 years ago

Replying to adferguson:

You patch does use an option that is not always set. That option is optional. That renders your coding guessing or fuzzying. This will at least create more - not less - problems.

Core-Developer Feedback is needed here. Is there at least some kind of documentation doing a strict point about the encoding concept in wordpress? Or is it still the good'ol'legacy-as-hell guessing code fragments spread over the whole codebase?

#14 in reply to: ↑ 11 @adferguson
16 years ago

Replying to kretzschmar:

The title of the feed (in the rss-widget) is still corrupt (the rest seems to be fixed here).

Title should be "Löns-Realschule" but "Löns-Realschule" is displayed (http://loensschule.de/feed/).

If you look at the source of that feed, it has the ö character in it directly -- is that the correct behavior? If so, then the character is not being charset-decoded properly, but that is a separate issue than this bug, which is about decoding HTML entities.

#15 in reply to: ↑ 13 @adferguson
16 years ago

Replying to hakre:

Replying to adferguson:

You patch does use an option that is not always set. That option is optional. That renders your coding guessing or fuzzying. This will at least create more - not less - problems.

I'm not sure I understand what you're talking about -- get_option('blog_charset') is used throughout the wordpress 2.7.1 source for situations like this.

#16 in reply to: ↑ 11 @Denis-de-Bernardy
16 years ago

  • Resolution set to fixed
  • Status changed from new to closed

Re-closing as fixed. Magpie is no longer used

#17 @Denis-de-Bernardy
16 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

This ticket causes #9616

#18 @Denis-de-Bernardy
16 years ago

the changeset (r10688) that fixed it, even...

#19 @Denis-de-Bernardy
16 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.