Make WordPress Core

Opened 3 years ago

Last modified 5 days ago

#19998 new defect (bug)

Feeds can contain characters that are not valid XML

Reported by: westi Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.3.1
Component: Feeds Keywords: has-patch
Focuses: Cc:


It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm

When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.

http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.

I guess we need a strip_for_xml() function or something.

Attachments (1)

19998b.patch (1.2 KB) - added by lgedeon 2 years ago.

Download all attachments as: .zip

Change History (8)

comment:1 @solarissmoke3 years ago

One approach could be to filter using the set of valid characters from the spec:

function strip_for_xml( $utf8 ) {
  return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 );

This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.

Last edited 3 years ago by solarissmoke (previous) (diff)

comment:2 @lgedeon2 years ago

Should we just remove the offending characters or replace them with a space or ?

How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?

comment:3 @lgedeon2 years ago

  • Cc luke.gedeon@… added

I found an answer for the second question in /wp-includes/formatting.php:

$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );

@lgedeon2 years ago

comment:4 @lgedeon2 years ago

  • Keywords has-patch added; needs-patch removed

Patch 19998b.patch takes approach proposed by solarissmoke and uses function name proposed by Westi.

comment:5 @SergeyBiryukov20 months ago

#24701 was marked as a duplicate.

comment:7 @mdgl5 days ago

See #3670 for some initial ideas about a broader esc_xml() abstraction that would allow us to more easily clean-up problems such as this one.

Many characters are invalid within XML (see http://www.w3.org/TR/REC-xml/) and these need to be removed whether they occur directly or as part of (numeric) character references (e.g. an explicit £#x0b; in the text).

Note that function wp_kses_normalize_entities() already deals with invalid (numeric) character references by escaping them where necessary. Unfortunately, this function also allows a hard-wired list of HTML named entities, many of which are not valid in XML. Some simple re-factoring could allow this function to support both HTML and XML.

The situation is slightly complicated for XML fields that will subsequently be processed as HTML, as occurs in RSS and Atom. Here, if we encode as CDATA we could leave such invalid named entities alone for later processing.

I notice also we recently added some clean-up of directly-occurring control characters to wp_kses_no_null() as part of #28506. This strips, rather than escapes the characters, however and does not deal with many of the other UTF-8 characters that are not valid in XML. That might be slightly harder to address, given there appears to be some confusion over exactly what character encoding is being used in all situations (is using blog_charset sufficient?) as well as availability of PHP support for UTF-8.

See also #25872.

Last edited 5 days ago by mdgl (previous) (diff)
Note: See TracTickets for help on using tickets.