WordPress.org

Make WordPress Core

Opened 21 months ago

Last modified 5 months ago

#25872 new defect (bug)

WXR export tool generates XML which is not well-formed

Reported by: tomdxw Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.7.1
Component: Export Keywords:
Focuses: Cc:

Description

  1. Paste a form feed character (aka \f or U+000C) into a post
  2. Tools > Export > Download Export File
  3. Validate the exported file (i.e. xmlstarlet validate --well-formed ~/Downloads/test.wordpress.2013-11-07.xml)

The resulting file is not well-formed XML because WordPress has failed to strip characters which are not allowed by the XML specification ( http://www.w3.org/TR/REC-xml/#charsets ).

Change History (4)

comment:1 @tomdxw21 months ago

  • Cc tom@… added

comment:2 @GaryJ11 months ago

How would you propose that invalid characters are stripped / converted?

comment:3 @tomdxw10 months ago

I'd just iterate through the codepoints in wxr_cdata() and replace disallowed codepoints with U+FFFD (the replacement character). I'm not sure of the best way to iterate through codepoints in PHP - but UTF-8 parsers aren't hard to write if there isn't already a function that does it.

comment:4 @mdgl5 months ago

Related #19998.

Note: See TracTickets for help on using tickets.