Make WordPress Core

Opened 4 years ago

Last modified 7 months ago

#19998 new defect (bug)

Feeds can contain characters that are not valid XML

Reported by: westi Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.3.1
Component: Feeds Keywords: has-patch
Focuses: Cc:


It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm

When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.

http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.

I guess we need a strip_for_xml() function or something.

Attachments (1)

19998b.patch (1.2 KB) - added by lgedeon 3 years ago.

Download all attachments as: .zip

Change History (11)

#1 @solarissmoke
4 years ago

One approach could be to filter usgin the set of valid characters from the spec:

function strip_for_xml( $utf8 ) {
  return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 );

This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.

Version 0, edited 4 years ago by solarissmoke (next)

#2 @lgedeon
3 years ago

Should we just remove the offending characters or replace them with a space or ?

How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?

#3 @lgedeon
3 years ago

  • Cc luke.gedeon@… added

I found an answer for the second question in /wp-includes/formatting.php:

$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );

3 years ago

#4 @lgedeon
3 years ago

  • Keywords has-patch added; needs-patch removed

Patch 19998b.patch takes approach proposed by solarissmoke and uses function name proposed by Westi.

#5 @SergeyBiryukov
2 years ago

#24701 was marked as a duplicate.

#7 @mdgl
9 months ago

See #3670 for some initial ideas about a broader esc_xml() abstraction that would allow us to more easily clean-up problems such as this one.

Many characters are invalid within XML (see http://www.w3.org/TR/REC-xml/) and these need to be removed whether they occur directly or as part of (numeric) character references (e.g. an explicit £#x0b; in the text).

Note that function wp_kses_normalize_entities() already deals with invalid (numeric) character references by escaping them where necessary. Unfortunately, this function also allows a hard-wired list of HTML named entities, many of which are not valid in XML. Some simple re-factoring could allow this function to support both HTML and XML.

The situation is slightly complicated for XML fields that will subsequently be processed as HTML, as occurs in RSS and Atom. Here, if we encode as CDATA we could leave such invalid named entities alone for later processing.

I notice also we recently added some clean-up of directly-occurring control characters to wp_kses_no_null() as part of #28506. This strips, rather than escapes the characters, however and does not deal with many of the other UTF-8 characters that are not valid in XML. That might be slightly harder to address, given there appears to be some confusion over exactly what character encoding is being used in all situations (is using blog_charset sufficient?) as well as availability of PHP support for UTF-8.

See also #25872.

Last edited 9 months ago by mdgl (previous) (diff)

#8 follow-up: @stevenkword
9 months ago

@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem. The plugin introduces the methods strip_for_xml() and esc_xml(). It also includes a few patches from throughout the feeds component which can be toggled via filters. I also have a few test cases in there, but are by no means complete yet.

#9 in reply to: ↑ 8 @mdgl
8 months ago

Replying to stevenkword:

@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem.

Sorry for the delay but I've been busy for a while and have had only limited time to take a look at your plugin. It appears to be at the beginning of its development and I'm afraid I can't quite see where you are headed with your XML escaping abstraction.

It may be old-fashioned to do so but I think we first need to identify the use cases and requirements for an esc_xml() function. For example, there appear to be four main situations where we may need to perform XML escaping:

  1. Source data is plain text and XML element supports just plain text.
  2. Source data is HTML but XML element supports just plain text.
  3. Source data is plain text but XML element supports embedded HTML.
  4. Source data is HTML and XML element supports embedded HTML.

I think cases (1), (3) and (4) might be a relatively straightforward matter of just escaping the XML special characters. Case (2) is more interesting as here we would first need to strip HTML tags and replace any HTML entities that are not also valid in XML before escaping the other XML special characters. Perhaps this is best left to two separate functions but that does make it harder to avoid double escaping. In each case, we also need to deal with any invalid character encodings (whether UTF-8 or otherwise). The choice of whether to escape using a CDATA block or just using basic escaping of the XML special characters should be configurable through a filter, perhaps defaulting on a field-by-field basis.

Anyway, I'm away for the next couple of weeks but will try to find time to take another look at this sometime next month.

This ticket was mentioned in Slack in #core by jorbin. View the logs.

7 months ago

Note: See TracTickets for help on using tickets.