Opened 16 months ago
Last modified 3 months ago
#19998 new defect (bug)
Feeds can contain characters that are not valid XML
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | Feeds | Version: | 3.3.1 |
| Severity: | normal | Keywords: | has-patch |
| Cc: | westi, luke.gedeon@… |
Description
It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm
When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.
http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.
I guess we need a strip_for_xml() function or something.
Attachments (1)
Change History (5)
comment:1
solarissmoke — 16 months ago
Should we just remove the offending characters or replace them with a space or ?
How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?
- Cc luke.gedeon@… added
I found an answer for the second question in /wp-includes/formatting.php:
$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );

One approach could be to filter using the set of valid characters from the spec:
function strip_for_xml( $utf8 ) { return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 ); }This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.