Make WordPress Core

Opened 12 years ago

Last modified 4 years ago

#19998 new defect (bug)

Feeds can contain characters that are not valid XML

Reported by: westi's profile westi Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.3.1
Component: Feeds Keywords: has-patch needs-unit-tests
Focuses: Cc:

Description

It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm

When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.

http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.

I guess we need a strip_for_xml() function or something.

Attachments (1)

19998b.patch (1.2 KB) - added by lgedeon 11 years ago.

Download all attachments as: .zip

Change History (18)

#1 @solarissmoke
12 years ago

One approach could be to filter using the set of valid characters from the spec:

function strip_for_xml( $utf8 ) {
  return preg_replace( '/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $utf8 );
}

This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.

Last edited 12 years ago by solarissmoke (previous) (diff)

#2 @lgedeon
11 years ago

Should we just remove the offending characters or replace them with a space or ?

How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?

#3 @lgedeon
11 years ago

  • Cc luke.gedeon@… added

I found an answer for the second question in /wp-includes/formatting.php:

$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );

@lgedeon
11 years ago

#4 @lgedeon
11 years ago

  • Keywords has-patch added; needs-patch removed

Patch 19998b.patch takes approach proposed by solarissmoke and uses function name proposed by Westi.

#5 @SergeyBiryukov
11 years ago

#24701 was marked as a duplicate.

#7 @mdgl
9 years ago

See #3670 for some initial ideas about a broader esc_xml() abstraction that would allow us to more easily clean-up problems such as this one.

Many characters are invalid within XML (see http://www.w3.org/TR/REC-xml/) and these need to be removed whether they occur directly or as part of (numeric) character references (e.g. an explicit £#x0b; in the text).

Note that function wp_kses_normalize_entities() already deals with invalid (numeric) character references by escaping them where necessary. Unfortunately, this function also allows a hard-wired list of HTML named entities, many of which are not valid in XML. Some simple re-factoring could allow this function to support both HTML and XML.

The situation is slightly complicated for XML fields that will subsequently be processed as HTML, as occurs in RSS and Atom. Here, if we encode as CDATA we could leave such invalid named entities alone for later processing.

I notice also we recently added some clean-up of directly-occurring control characters to wp_kses_no_null() as part of #28506. This strips, rather than escapes the characters, however and does not deal with many of the other UTF-8 characters that are not valid in XML. That might be slightly harder to address, given there appears to be some confusion over exactly what character encoding is being used in all situations (is using blog_charset sufficient?) as well as availability of PHP support for UTF-8.

See also #25872.

Last edited 9 years ago by mdgl (previous) (diff)

#8 follow-up: @stevenkword
9 years ago

@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem. The plugin introduces the methods strip_for_xml() and esc_xml(). It also includes a few patches from throughout the feeds component which can be toggled via filters. I also have a few test cases in there, but are by no means complete yet.

#9 in reply to: ↑ 8 @mdgl
9 years ago

Replying to stevenkword:

@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem.

Sorry for the delay but I've been busy for a while and have had only limited time to take a look at your plugin. It appears to be at the beginning of its development and I'm afraid I can't quite see where you are headed with your XML escaping abstraction.

It may be old-fashioned to do so but I think we first need to identify the use cases and requirements for an esc_xml() function. For example, there appear to be four main situations where we may need to perform XML escaping:

  1. Source data is plain text and XML element supports just plain text.
  2. Source data is HTML but XML element supports just plain text.
  3. Source data is plain text but XML element supports embedded HTML.
  4. Source data is HTML and XML element supports embedded HTML.

I think cases (1), (3) and (4) might be a relatively straightforward matter of just escaping the XML special characters. Case (2) is more interesting as here we would first need to strip HTML tags and replace any HTML entities that are not also valid in XML before escaping the other XML special characters. Perhaps this is best left to two separate functions but that does make it harder to avoid double escaping. In each case, we also need to deal with any invalid character encodings (whether UTF-8 or otherwise). The choice of whether to escape using a CDATA block or just using basic escaping of the XML special characters should be configurable through a filter, perhaps defaulting on a field-by-field basis.

Anyway, I'm away for the next couple of weeks but will try to find time to take another look at this sometime next month.

This ticket was mentioned in Slack in #core by jorbin. View the logs.


9 years ago

#11 @stevenkword
8 years ago

  • Keywords needs-unit-tests added

#12 follow-up: @marco.marsala
7 years ago

You can simply use PHP DomDocument->saveXml() to escape XML

#13 @marco.marsala
7 years ago

#41212 was marked as a duplicate.

#14 in reply to: ↑ 12 @mdgl
7 years ago

Replying to marco.marsala:

You can simply use PHP DomDocument->saveXml() to escape XML

Only if you have a DOM tree to output! WordPress doesn't (yet) work like that. XML output is just assembled with a series of "print" statements.

#15 @programmin
5 years ago

Still a problem - even vertical tab character ( ) is enough to break anything with the post in its feed. Any chance this can be fixed in wp core soon?

#16 @swissspidy
4 years ago

In 48072:

Sitemaps: Add XML sitemaps functionality to WordPress.

While web crawlers are able to discover pages from links within the site and from other sites, XML sitemaps supplement this approach by allowing crawlers to quickly and comprehensively identify all URLs included in the sitemap and learn other signals about those URLs using the associated metadata.

See https://make.wordpress.org/core/2020/06/10/merge-announcement-extensible-core-sitemaps/ for more details.

This feature exposes the sitemap index via /wp-sitemap.xml and exposes a variety of new filters and hooks for developers to modify the behavior. Users can disable sitemaps completely by turning off search engine visibility in WordPress admin.

This change also introduces a new esc_xml() function to escape strings for output in XML, as well as XML support to wp_kses_normalize_entities().

Props Adrian McShane, afragen, adamsilverstein, casiepa, flixos90, garrett-eclipse, joemcgill, kburgoine, kraftbj, milana_cap, pacifika, pbiron, pfefferle, Ruxandra Gradina, swissspidy, szepeviktor, tangrufus, tweetythierry.
Fixes #50117.
See #3670. See #19998.

#17 @pbiron
4 years ago

As mentioned in the commit message above, core now has an esc_xml() function (and accompanying esc_xml filter).

However, that function does not (yet) strip characters that are not legal in XML. As stated in the PR that added esc_xml() to the sitemaps feature plugin:

it is an open question what to do when the text passed to these functions/filter contain characters which are not allowed in XML.

It would be easy to add functionality like 19998b.patch to esc_xml() (once the regex is adjusted to include code points beyond the Basic Multilingual Plane, see below), if that is the consensus of what people want.

Stripping illegal characters may be appropriate for feeds (and sitemaps and oembeds), but may not be appropriate for exports (i.e., WXR) as there is a small (but not 0) probability that those illegal chars actually have meaning in the context of the plugin that wrote them to the DB (e.g., stored in post meta) and stripping them out might cause a problem when the content with them stripped out is imported into another site.

So, maybe esc_xml() could take a 2nd parameter indicating what to do when there are illegal characters (that accepted values like strip, empty_string and wp_error)? Default would be strip; emtpy_string would indicate that an empty string should be returned if there are any illegal chars; wp_error would indicate that a WP_Error should be returned...so that, for instance, the exporter could skip exporting something that contained illegal chars.

Instead of a 2nd parameter, another possibility would be have the strip, empty_string, or wp_error behavior handled by hooking into the esc_xml filter with various callbacks.

p.s. the regex to strip illegal chars should actually be as follows (and the replacement should be the empty string, and not a space character):

preg_replace(
        '/[^\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u',
        '',
        $string
);
Note: See TracTickets for help on using tickets.