Opened 11 years ago
Last modified 6 years ago
#25872 new defect (bug)
WXR export tool generates XML which is not well-formed
Reported by: | tomdxw | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.7.1 |
Component: | Export | Keywords: | |
Focuses: | Cc: |
Description
- Paste a form feed character (aka \f or U+000C) into a post
- Tools > Export > Download Export File
- Validate the exported file (i.e. xmlstarlet validate --well-formed ~/Downloads/test.wordpress.2013-11-07.xml)
The resulting file is not well-formed XML because WordPress has failed to strip characters which are not allowed by the XML specification ( http://www.w3.org/TR/REC-xml/#charsets ).
Change History (4)
#3
@
10 years ago
I'd just iterate through the codepoints in wxr_cdata() and replace disallowed codepoints with U+FFFD (the replacement character). I'm not sure of the best way to iterate through codepoints in PHP - but UTF-8 parsers aren't hard to write if there isn't already a function that does it.
Note: See
TracTickets for help on using
tickets.
How would you propose that invalid characters are stripped / converted?