WordPress.org

Make WordPress Core

Opened 17 months ago

Last modified 4 months ago

#19998 new defect (bug)

Feeds can contain characters that are not valid XML

Reported by: westi Owned by:
Priority: normal Milestone: Awaiting Review
Component: Feeds Version: 3.3.1
Severity: normal Keywords: has-patch
Cc: westi, luke.gedeon@…

Description

It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm

When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.

http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.

I guess we need a strip_for_xml() function or something.

Attachments (1)

19998b.patch (1.2 KB) - added by lgedeon 4 months ago.

Download all attachments as: .zip

Change History (5)

comment:1 solarissmoke17 months ago

One approach could be to filter usgin the set of valid characters from the spec:

function strip_for_xml( $utf8 ) {
  return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 );
}

This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.

Version 0, edited 17 months ago by solarissmoke (next)

comment:2 lgedeon4 months ago

Should we just remove the offending characters or replace them with a space or ?

How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?

comment:3 lgedeon4 months ago

  • Cc luke.gedeon@… added

I found an answer for the second question in /wp-includes/formatting.php:

$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );

lgedeon4 months ago

comment:4 lgedeon4 months ago

  • Keywords has-patch added; needs-patch removed

Patch 19998b.patch takes approach proposed by solarissmoke and uses function name proposed by Westi.

Note: See TracTickets for help on using tickets.