WordPress.org

Make WordPress Core

Opened 4 years ago

Last modified 7 months ago

#27896 new defect (bug)

wordpress-importer's lack of understanding of XML Namespaces causing compatibility issues

Reported by: tomdxw Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 3.9
Component: Import Keywords: has-patch
Focuses: Cc:

Description

This plugin doesn't understand namespaces when parsing XML.

Correct me if I'm wrong but I think the following XML documents are equivalent:

<rss xmlns:wp="http://wordpress.org/export/1.2/">
  <channel>
    <wp:wxr_version>1.2</wp:wxr_version>
  </channel>
</rss>
<rss>
  <channel>
    <wxr_version xmlns="http://wordpress.org/export/1.2/">1.2</wxr_version>
  </channel>
</rss>
<rss>
  <channel>
    <wp:wxr_version xmlns:wp="http://wordpress.org/export/1.2/">1.2</wp:wxr_version>
  </channel>
</rss>
<rss xmlns:ns1="http://wordpress.org/export/1.2/">
  <channel>
    <ns1:wxr_version>1.2</ns1:wxr_version>
  </channel>
</rss>

Importing the first document leads to the next step with the "Download and import file attachments" checkbox and the "Submit" button. The other documents produce "This does not appear to be a WXR file, missing/invalid WXR version number".

This bug makes it difficult to write tools which generate WXR files (for instance when migrating content from an existing site into a WordPress installation).

Attachments (2)

27896.diff (786 bytes) - added by pbiron 8 months ago.
fixes bug whereby termmeta is not imported when the WXR_Parser_XML is used
27896.1.diff (10.5 KB) - added by pbiron 8 months ago.
make both WXR_Parser_SimpleXML & WXR_Parser_XML fully namespace aware

Download all attachments as: .zip

Change History (12)

#1 @Denis-de-Bernardy
4 years ago

Insofar as I can tell, it's using either of PHP's SimpleXML (preferred) or xml extension, so I'm pretty sure this ought to be reported upstream if neither work:

#2 @tomdxw
4 years ago

  • Keywords has-patch added

SimpleXML actually seems to support namespaces. If you add some calls to $xml->registerXPathNamespace() it prevents the "does not appear to be a WXR file" message from appearing. And then if you update the namespaces it actually imports content (the 1.2 namespaces are what are currently produced by WordPress' built-in exporter).

Here's a patch:

https://gist.github.com/tomdxw/ca851f05b088165e25bd

(This may or may not break importing from XML using the 1.1 namespaces, I haven't tested).

#3 @chriscct7
2 years ago

  • Keywords needs-patch added; has-patch removed

@pbiron
8 months ago

fixes bug whereby termmeta is not imported when the WXR_Parser_XML is used

@pbiron
8 months ago

make both WXR_Parser_SimpleXML & WXR_Parser_XML fully namespace aware

#4 follow-up: @pbiron
8 months ago

  • Keywords has-patch added; needs-patch removed

I just uploaded 2 patches against 0.6.3:

  1. 27896.diff is unrelated to making wordpress-importer namespace aware, but I discovered while writing namespace awareness patch that termmeta is not imported when the WXR_Parser_XML parser is used. This might justify it's own ticket. Just let me know and I'll create that.
  2. 27896.1.diff makes both WXR_Parser_SimpleXML & WXR_Parser_XML fully namespace aware. It assumes that 27896.diff has been applied. As noted in a comment I added to WXR_Parser_Regex: it is not worth (or probably even possible) to do fully namespace aware XML parsing with regexes.

The mods in 27896.1.diff that apply to WXR_Parser_SimpleXML are fairly simple (pun intended) to understand and I don't think need any further explanation.

The mods in 27896.1.diff that apply to WXR_Parser_XML deserve a little explanation.

  1. The parser is created in namespace-aware mode by calling xml_parser_create_ns() instead of xml_parser_create().
  2. When parsing in namespace-aware mode, XML Parser passes a "namespace-qualified" tag name to the callables registered with set_element_handler() (i.e., WXR_Parser_XML::open_tag() and WXR_Parser_XML::close_tag()).
    1. That is, the tag name is of the form URI:tag, e.g. http://wordpress.org/export/1.2/:term (instead of prefix:tag, e.g., wp:term when running in non-namespace-aware mode).

It might also be useful to write an XMLReader-based parser as well. I can work on that (tho probably not for a couple of weeks) if others think it would be a good thing.

#5 in reply to: ↑ 4 ; follow-up: @rmccue
8 months ago

Replying to pbiron:

It might also be useful to write an XMLReader-based parser as well. I can work on that (tho probably not for a couple of weeks) if others think it would be a good thing.

FYI, I started work on that a while ago with the WordPress Importer Redux project; would love additional contributions there. :)

#6 in reply to: ↑ 5 @pbiron
8 months ago

Replying to rmccue:

FYI, I started work on that a while ago with the WordPress Importer Redux project; would love additional contributions there. :)

I just forked that repo and took a quick look. Should be really easy to make that implementation namespace-aware. Will send a pull request when I get to it.

#7 follow-up: @pbiron
8 months ago

@rmccue p.s. is the importer redux an a "feature" plugin that is targeted to replace the current importer when it's complete?

#8 in reply to: ↑ 7 @rmccue
8 months ago

Replying to pbiron:

@rmccue p.s. is the importer redux an a "feature" plugin that is targeted to replace the current importer when it's complete?

That's the idea, yeah, it's essentially the beta of v2. It's stalled a bit because I don't have time to dedicate to it, but happy to give out commit access :)

And yeah, right now it's not namespace-aware mainly because I was lazy, but pretty easy to fix that. Basically just needs a normalisation step, I think.

#9 @pbiron
7 months ago

as I've started to add namespace awareness to WordPress Importer Redux, a number of issues have come up about how to properly implement namespace awareness in the importer (e.g., Issue 117: detecting the version of WXR in namespace-aware parsing).

Because of that, I think it's best to put the patch I submitted above on hold until those general issues are resolved.

I encourage anyone interested in this top to join the discussion on that issue over at the redux GitHub repo.

#10 @pbiron
7 months ago

Some might find the following might be off-topic to this ticket (and I apologize if it is), but because of the work I did on the patch to this ticket and the Importer Redux I felt it might be helpful to have an XML Schema for WXR.

I have put that XML Schema up at An XML Schema 1.1 schema for WXR. Comments on that schema are greatly appreciated.

Note: See TracTickets for help on using tickets.