Opened 17 months ago
Last modified 4 months ago
#19998 new defect (bug)
Feeds can contain characters that are not valid XML
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | Feeds | Version: | 3.3.1 |
| Severity: | normal | Keywords: | has-patch |
| Cc: | westi, luke.gedeon@… |
Description
It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm
When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.
http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.
I guess we need a strip_for_xml() function or something.
Attachments (1)
Change History (5)
comment:1
solarissmoke
— 17 months ago
comment:2
lgedeon
— 4 months ago
Should we just remove the offending characters or replace them with a space or ?
How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?
One approach could be to filter usgin the set of valid characters from the spec:
function strip_for_xml( $utf8 ) { return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 ); }This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.