#7563 closed defect (bug) (fixed)
html_entity_decode at RSS Feed import doesn't respect charset of Blog
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | 2.8 | Priority: | high |
Severity: | critical | Version: | 2.5.1 |
Component: | Feeds | Keywords: | rss bug feed encoding damage has-patch |
Focuses: | Cc: |
Description
Error: If the dashboard rss or the rss widgets imports feeds with content containing german special chars like ä equal to ä this results into display of �
Solution: Using get_option('blog_charset') solves the correct display of feed content.
file: wp-admin/includes/dashboard.php
line: 431
$description = wp_specialchars( strip_tags(html_entity_decode($item['description'], ENT_QUOTES, get_option('blog_charset'))) );
file: wp-includes/widgets.php
lines: 1130 and 1132
if ( isset( $item['description'] ) && is_string( $item['description'] ) ) $desc = $summary = str_replace(array("\n", "\r"), ' ', attribute_escape(strip_tags(html_entity_decode($item['description'], ENT_QUOTES, get_option('blog_charset'))))); elseif ( isset( $item['summary'] ) && is_string( $item['summary'] ) ) $desc = $summary = str_replace(array("\n", "\r"), ' ', attribute_escape(strip_tags(html_entity_decode($item['summary'], ENT_QUOTES, get_option('blog_charset')))));
Using the blog charset the feed content will be shown qualified depending on blogs configuration!
Attachments (2)
Change History (21)
#2
@
17 years ago
- Keywords rss bug feed encoding damage added
- Version set to 2.5.1
I have created a patch for MagpieRSS class to be able to handle the imported Feeds correctly.
The patch is made for PHP4 versions, which doesn't detect the feeds encoding (UTF-8 feeds will be handled as ISO feeds and also for PHP5 versions (with detection) to ensure qualified ISO based html entities gets converted into UTF-8 target.
Here are 2 feeds gets damaged, if added to dashboard:
-> ISO-8859-1 feed
http://www.maerkischeallgemeine.de/cms/list/6947650?style_only=J&cms_encoding=iso
-> UTF-8 Feed with ISO entities (like ä)
http://blog.wordpress-deutschland.org/feed
The patch has been tested at PHP4 and PHP5 with both example feeds and show them now correctly. Also the database doesn't store anymore damaged option values (broken serialize using original rss.php, sometimes dependend on feed content)
Input encoding will be detected using regular expression at raw data and output enconding will be set using charset of blog by given option value.
#3
@
17 years ago
In the current trunk, html_entity_decode occurs in 4 files:
wp-admin/import/blogger.php (4 times)
wp-admin/includes/dashboard.php (1 time, covered above)
wp-includes/feed.php (1 time)
wp-includes/widgets.php (3 times, 2 covered above)
Can this bug please be fixed? The patch is *minimal* and without the fix, the W3C Validator refuses to validate any pages which have an RSS feed incorrectly decoded to the default ISO-8859-1.
thanks,
Andrew
#7
@
16 years ago
Please do not place an "@" in front of the function, because:
a) this is php4, php4 is not supported any longer. if you use php4 you have to live with it.
b) this makes wordpress harder to debug.
c) this increases execution time.
d) this is bad practise and signals a location in the code that needs fixing.
thanks for your cooperation.
and additional thought I had is: RSS is XML that is UTF-8 Encoded. Why are certain characters placed in entities instead of their correct UTF-8 Encoding? Isn't this the root of evil?
#8
@
16 years ago
- Keywords needs-patch added; rss bug feed encoding damage has-patch removed
This needs an actual patch to progress rather than a zip file.
Setting back to needs-patch
#9
follow-up:
↓ 13
@
16 years ago
- Keywords rss bug feed encoding damage has-patch added; needs-patch removed
#10
@
16 years ago
- Milestone changed from 2.7.2 to 2.8
- Resolution set to fixed
- Status changed from new to closed
#11
follow-ups:
↓ 14
↓ 16
@
16 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
The title of the feed (in the rss-widget) is still corrupt (the rest seems to be fixed here).
Title should be "Löns-Realschule" but "Löns-Realschule" is displayed (http://loensschule.de/feed/).
#13
in reply to:
↑ 9
;
follow-up:
↓ 15
@
16 years ago
Replying to adferguson:
You patch does use an option that is not always set. That option is optional. That renders your coding guessing or fuzzying. This will at least create more - not less - problems.
Core-Developer Feedback is needed here. Is there at least some kind of documentation doing a strict point about the encoding concept in wordpress? Or is it still the good'ol'legacy-as-hell guessing code fragments spread over the whole codebase?
#14
in reply to:
↑ 11
@
16 years ago
Replying to kretzschmar:
The title of the feed (in the rss-widget) is still corrupt (the rest seems to be fixed here).
Title should be "Löns-Realschule" but "Löns-Realschule" is displayed (http://loensschule.de/feed/).
If you look at the source of that feed, it has the ö character in it directly -- is that the correct behavior? If so, then the character is not being charset-decoded properly, but that is a separate issue than this bug, which is about decoding HTML entities.
#15
in reply to:
↑ 13
@
16 years ago
Replying to hakre:
Replying to adferguson:
You patch does use an option that is not always set. That option is optional. That renders your coding guessing or fuzzying. This will at least create more - not less - problems.
I'm not sure I understand what you're talking about -- get_option('blog_charset') is used throughout the wordpress 2.7.1 source for situations like this.
#16
in reply to:
↑ 11
@
16 years ago
- Resolution set to fixed
- Status changed from new to closed
Re-closing as fixed. Magpie is no longer used
additional note: because of unsolved PHP4 bug http://bugs.php.net/bug.php?id=27626 it may be nessessary to prepend @ to html_entity_decode to avoid warnings like this:
PHP5 works as expected but this PHP4 Bug seems to remain unsolved anyway, so it should be handled with warning suppression.