Opened 10 years ago
Last modified 7 years ago
#27896 new defect (bug)
wordpress-importer's lack of understanding of XML Namespaces causing compatibility issues
Reported by: | tomdxw | Owned by: | |
---|---|---|---|
Milestone: | Awaiting Review | Priority: | normal |
Severity: | normal | Version: | 3.9 |
Component: | Import | Keywords: | has-patch |
Focuses: | Cc: |
Description
This plugin doesn't understand namespaces when parsing XML.
Correct me if I'm wrong but I think the following XML documents are equivalent:
<rss xmlns:wp="http://wordpress.org/export/1.2/"> <channel> <wp:wxr_version>1.2</wp:wxr_version> </channel> </rss>
<rss> <channel> <wxr_version xmlns="http://wordpress.org/export/1.2/">1.2</wxr_version> </channel> </rss>
<rss> <channel> <wp:wxr_version xmlns:wp="http://wordpress.org/export/1.2/">1.2</wp:wxr_version> </channel> </rss>
<rss xmlns:ns1="http://wordpress.org/export/1.2/"> <channel> <ns1:wxr_version>1.2</ns1:wxr_version> </channel> </rss>
Importing the first document leads to the next step with the "Download and import file attachments" checkbox and the "Submit" button. The other documents produce "This does not appear to be a WXR file, missing/invalid WXR version number".
This bug makes it difficult to write tools which generate WXR files (for instance when migrating content from an existing site into a WordPress installation).
Attachments (2)
Change History (12)
#2
@
10 years ago
- Keywords has-patch added
SimpleXML actually seems to support namespaces. If you add some calls to $xml->registerXPathNamespace() it prevents the "does not appear to be a WXR file" message from appearing. And then if you update the namespaces it actually imports content (the 1.2 namespaces are what are currently produced by WordPress' built-in exporter).
Here's a patch:
https://gist.github.com/tomdxw/ca851f05b088165e25bd
(This may or may not break importing from XML using the 1.1 namespaces, I haven't tested).
#4
follow-up:
↓ 5
@
7 years ago
- Keywords has-patch added; needs-patch removed
I just uploaded 2 patches against 0.6.3
:
- 27896.diff is unrelated to making
wordpress-importer
namespace aware, but I discovered while writing namespace awareness patch that termmeta is not imported when theWXR_Parser_XML
parser is used. This might justify it's own ticket. Just let me know and I'll create that. - 27896.1.diff makes both
WXR_Parser_SimpleXML
&WXR_Parser_XML
fully namespace aware. It assumes that 27896.diff has been applied. As noted in a comment I added toWXR_Parser_Regex
: it is not worth (or probably even possible) to do fully namespace aware XML parsing with regexes.
The mods in 27896.1.diff that apply to WXR_Parser_SimpleXML
are fairly simple (pun intended) to understand and I don't think need any further explanation.
The mods in 27896.1.diff that apply to WXR_Parser_XML
deserve a little explanation.
- The parser is created in namespace-aware mode by calling xml_parser_create_ns() instead of xml_parser_create().
- When parsing in namespace-aware mode, XML Parser passes a "namespace-qualified" tag name to the callables registered with set_element_handler() (i.e.,
WXR_Parser_XML::open_tag()
andWXR_Parser_XML::close_tag()
).- That is, the tag name is of the form
URI:tag
, e.g.http://wordpress.org/export/1.2/:term
(instead ofprefix:tag
, e.g.,wp:term
when running in non-namespace-aware mode).
- That is, the tag name is of the form
It might also be useful to write an XMLReader-based parser as well. I can work on that (tho probably not for a couple of weeks) if others think it would be a good thing.
#5
in reply to:
↑ 4
;
follow-up:
↓ 6
@
7 years ago
Replying to pbiron:
It might also be useful to write an XMLReader-based parser as well. I can work on that (tho probably not for a couple of weeks) if others think it would be a good thing.
FYI, I started work on that a while ago with the WordPress Importer Redux project; would love additional contributions there. :)
#6
in reply to:
↑ 5
@
7 years ago
Replying to rmccue:
FYI, I started work on that a while ago with the WordPress Importer Redux project; would love additional contributions there. :)
I just forked that repo and took a quick look. Should be really easy to make that implementation namespace-aware. Will send a pull request when I get to it.
#7
follow-up:
↓ 8
@
7 years ago
@rmccue p.s. is the importer redux an a "feature" plugin that is targeted to replace the current importer when it's complete?
#8
in reply to:
↑ 7
@
7 years ago
Replying to pbiron:
@rmccue p.s. is the importer redux an a "feature" plugin that is targeted to replace the current importer when it's complete?
That's the idea, yeah, it's essentially the beta of v2. It's stalled a bit because I don't have time to dedicate to it, but happy to give out commit access :)
And yeah, right now it's not namespace-aware mainly because I was lazy, but pretty easy to fix that. Basically just needs a normalisation step, I think.
#9
@
7 years ago
as I've started to add namespace awareness to WordPress Importer Redux, a number of issues have come up about how to properly implement namespace awareness in the importer (e.g., Issue 117: detecting the version of WXR in namespace-aware parsing).
Because of that, I think it's best to put the patch I submitted above on hold until those general issues are resolved.
I encourage anyone interested in this top to join the discussion on that issue over at the redux GitHub repo.
#10
@
7 years ago
Some might find the following might be off-topic to this ticket (and I apologize if it is), but because of the work I did on the patch to this ticket and the Importer Redux I felt it might be helpful to have an XML Schema for WXR.
I have put that XML Schema up at An XML Schema 1.1 schema for WXR. Comments on that schema are greatly appreciated.
Insofar as I can tell, it's using either of PHP's
SimpleXML
(preferred) orxml
extension, so I'm pretty sure this ought to be reported upstream if neither work: