Opened 13 years ago
Last modified 4 years ago
#19998 new defect (bug)
Feeds can contain characters that are not valid XML
Reported by: | westi | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | 3.3.1 |
Component: | Feeds | Keywords: | has-patch needs-unit-tests |
Focuses: | Cc: |
Description
It is possible for any of the feeds to contain control characters which are not valid in XML e.g. http://www.fileformat.info/info/unicode/char/1c/index.htm
When outputting user supplied content in an XML context we should strip these control characters - they are unprintable and just break feed parsers.
http://en.wikipedia.org/wiki/Valid_characters_in_XML has a good list of what is and isn't valid.
I guess we need a strip_for_xml()
function or something.
Attachments (1)
Change History (18)
#2
@
12 years ago
Should we just remove the offending characters or replace them with a space or ?
How do we find out if we are in utf8? do we just skip this step if we are not? At least for now?
#3
@
12 years ago
- Cc luke.gedeon@… added
I found an answer for the second question in /wp-includes/formatting.php:
$is_utf8 = in_array( get_option( 'blog_charset' ), array( 'utf8', 'utf-8', 'UTF8', 'UTF-8' ) );
#4
@
12 years ago
- Keywords has-patch added; needs-patch removed
Patch 19998b.patch takes approach proposed by solarissmoke and uses function name proposed by Westi.
#7
@
10 years ago
See #3670 for some initial ideas about a broader esc_xml()
abstraction that would allow us to more easily clean-up problems such as this one.
Many characters are invalid within XML (see http://www.w3.org/TR/REC-xml/) and these need to be removed whether they occur directly or as part of (numeric) character references (e.g. an explicit £#x0b;
in the text).
Note that function wp_kses_normalize_entities()
already deals with invalid (numeric) character references by escaping them where necessary. Unfortunately, this function also allows a hard-wired list of HTML named entities, many of which are not valid in XML. Some simple re-factoring could allow this function to support both HTML and XML.
The situation is slightly complicated for XML fields that will subsequently be processed as HTML, as occurs in RSS and Atom. Here, if we encode as CDATA we could leave such invalid named entities alone for later processing.
I notice also we recently added some clean-up of directly-occurring control characters to wp_kses_no_null()
as part of #28506. This strips, rather than escapes the characters, however and does not deal with many of the other UTF-8 characters that are not valid in XML. That might be slightly harder to address, given there appears to be some confusion over exactly what character encoding is being used in all situations (is using blog_charset
sufficient?) as well as availability of PHP support for UTF-8.
See also #25872.
#8
follow-up:
↓ 9
@
10 years ago
@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem. The plugin introduces the methods strip_for_xml()
and esc_xml()
. It also includes a few patches from throughout the feeds component which can be toggled via filters. I also have a few test cases in there, but are by no means complete yet.
#9
in reply to:
↑ 8
@
9 years ago
Replying to stevenkword:
@mdgl I've created a project on GitHub (https://github.com/stevenkword/Finely-Tuned-Feeds) where I am taking a look a the bigger XML escaping problem.
Sorry for the delay but I've been busy for a while and have had only limited time to take a look at your plugin. It appears to be at the beginning of its development and I'm afraid I can't quite see where you are headed with your XML escaping abstraction.
It may be old-fashioned to do so but I think we first need to identify the use cases and requirements for an esc_xml()
function. For example, there appear to be four main situations where we may need to perform XML escaping:
- Source data is plain text and XML element supports just plain text.
- Source data is HTML but XML element supports just plain text.
- Source data is plain text but XML element supports embedded HTML.
- Source data is HTML and XML element supports embedded HTML.
I think cases (1), (3) and (4) might be a relatively straightforward matter of just escaping the XML special characters. Case (2) is more interesting as here we would first need to strip HTML tags and replace any HTML entities that are not also valid in XML before escaping the other XML special characters. Perhaps this is best left to two separate functions but that does make it harder to avoid double escaping. In each case, we also need to deal with any invalid character encodings (whether UTF-8 or otherwise). The choice of whether to escape using a CDATA block or just using basic escaping of the XML special characters should be configurable through a filter, perhaps defaulting on a field-by-field basis.
Anyway, I'm away for the next couple of weeks but will try to find time to take another look at this sometime next month.
This ticket was mentioned in Slack in #core by jorbin. View the logs.
9 years ago
#14
in reply to:
↑ 12
@
7 years ago
Replying to marco.marsala:
You can simply use PHP DomDocument->saveXml() to escape XML
Only if you have a DOM tree to output! WordPress doesn't (yet) work like that. XML output is just assembled with a series of "print" statements.
#15
@
5 years ago
Still a problem - even vertical tab character ( ) is enough to break anything with the post in its feed. Any chance this can be fixed in wp core soon?
#17
@
4 years ago
As mentioned in the commit message above, core now has an esc_xml()
function (and accompanying esc_xml
filter).
However, that function does not (yet) strip characters that are not legal in XML. As stated in the PR that added esc_xml()
to the sitemaps feature plugin:
it is an open question what to do when the text passed to these functions/filter contain characters which are not allowed in XML.
It would be easy to add functionality like 19998b.patch to esc_xml()
(once the regex is adjusted to include code points beyond the Basic Multilingual Plane, see below), if that is the consensus of what people want.
Stripping illegal characters may be appropriate for feeds (and sitemaps and oembeds), but may not be appropriate for exports (i.e., WXR) as there is a small (but not 0) probability that those illegal chars actually have meaning in the context of the plugin that wrote them to the DB (e.g., stored in post meta) and stripping them out might cause a problem when the content with them stripped out is imported into another site.
So, maybe esc_xml()
could take a 2nd parameter indicating what to do when there are illegal characters (that accepted values like strip
, empty_string
and wp_error
)? Default would be strip
; emtpy_string
would indicate that an empty string should be returned if there are any illegal chars; wp_error
would indicate that a WP_Error
should be returned...so that, for instance, the exporter could skip exporting something that contained illegal chars.
Instead of a 2nd parameter, another possibility would be have the strip
, empty_string
, or wp_error
behavior handled by hooking into the esc_xml
filter with various callbacks.
p.s. the regex to strip illegal chars should actually be as follows (and the replacement should be the empty string, and not a space character):
preg_replace(
'/[^\x{9}\x{A}\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]/u',
'',
$string
);
One approach could be to filter using the set of valid characters from the spec:
This assumes that the feed is served as UTF-8. I've no idea what it would do to XML in other charsets.