Make WordPress Core

Opened 11 years ago

Last modified 6 years ago

#25872 new defect (bug)

WXR export tool generates XML which is not well-formed

Reported by: tomdxw's profile tomdxw Owned by:
Milestone: Priority: normal
Severity: normal Version: 3.7.1
Component: Export Keywords:
Focuses: Cc:

Description

  1. Paste a form feed character (aka \f or U+000C) into a post
  2. Tools > Export > Download Export File
  3. Validate the exported file (i.e. xmlstarlet validate --well-formed ~/Downloads/test.wordpress.2013-11-07.xml)

The resulting file is not well-formed XML because WordPress has failed to strip characters which are not allowed by the XML specification ( http://www.w3.org/TR/REC-xml/#charsets ).

Change History (4)

#1 @tomdxw
11 years ago

  • Cc tom@… added

#2 @GaryJ
10 years ago

How would you propose that invalid characters are stripped / converted?

#3 @tomdxw
10 years ago

I'd just iterate through the codepoints in wxr_cdata() and replace disallowed codepoints with U+FFFD (the replacement character). I'm not sure of the best way to iterate through codepoints in PHP - but UTF-8 parsers aren't hard to write if there isn't already a function that does it.

#4 @mdgl
10 years ago

Related #19998.

Note: See TracTickets for help on using tickets.