Make WordPress Core

Opened 13 years ago

Closed 13 years ago

#15197 closed defect (bug) (fixed)

WXR export/import umbrella ticket

Reported by: duck_'s profile duck_ Owned by: duck_'s profile duck_
Milestone: 3.1 Priority: normal
Severity: normal Version: 3.1
Component: Export Keywords: has-patch, ux-feedback
Focuses: Cc:

Description (last modified by duck_)

Umbrella ticket for a number of upgrades to the WXR export/import process.

Export

  • Bump WXR version to 1.1
  • Removed filtering for now (see explanation below)
  • Removed wxr_missing_parents (local function), seems to be a remnant from pre-get_categories
  • Added author information to export (for better import UX) - #11118
  • Greater usage of slug-like identifiers, e.g. login instead of name in <dc:creator>
  • Don't export auto-drafts
  • Filled in docs
  • Ignore _edit_lock and _edit_last meta keys
  • Only use the 'forward compatible' term tags, <category domain="foo" nicename="bar">, within post items

Import

  • Use an XML parser (where available). 3 parser options: SimpleXML (yay!), XML Parser (yay!), or regular expressions (boo!)
  • Proper import support for nav menus - #14750
    • Menu items for missing content will be skipped, there should be no problems when an associated object is further down the import file than the menu item
    • Orphaned menu items (e.g. their parent was skipped due to above point) will become top-level
  • Greater usage of slug-like identifiers, e.g. Use <category domain="..." nicename="..."> tags to fix a bunch of category issues
  • Either import author as is (i.e. from information stored in WXR file, this allows us to create a user with more data by default) or map to an existing user - #10319
  • Less direct feedback (ignoring errors, currently none :( !), as it is unwieldy for a large import.

All accompanied by a number of smaller changes and anything I forgot to write down.

Further work

Backwards Compatibility

The main problem for now is ensuring backwards compatibility with WXR 1.0 files. That said, no major faults should occur when importing a 1.0 file. Excluding all the problems you will come across already in an export/import in 3.0.1:

  • No author import (the current importer takes author data from each post)
    SOLVED: if we get an empty author array then loop through posts grabbing unique authors and offering to map them (but not to import)
  • All term menu items will be skipped due to missing term_id XML tags Possible solution: slugs instead of IDs for processed_terms mapping? In fact, as far as I can see, filling imported menus is actually impossible with WXR 1.0 since the file doesn't contain custom terms for post items, see #13453 and #14306, so we don't know which menu to assign the menu items to
  • Probably some indexes and vars which need to be checked with isset and fallback provided (for when the XML tag doesn't exist in 1.0 files)
  • ... and possibly more with further testing

How far should this go back?
Example: 3 years ago [6375] introduced forwards compatible category tags including the slug and taxonomy. These are the only category tags the parsers currently read, is it worth checking the really old style XML tags if no terms are found for a post (should be easy for SimpleXML and regular expressions, but I think will be harder for XML Parser)?

The problem of filtering

  • Potential to export a pretty useless file, e.g. choose Category: Uncategorized and Content Type: Pages
  • Makes reliable importing of nav menus harder (worse UX when importer is creating half made menus)

Moving forward I am currently imagining some sort of grid of post types selectable by checkbox. Each post type lists its taxonomies below, these are only activated/recognised if the post type is selected. But what filters to include and how to show them are probably for another ticket.

See nacin's comment and mockup for the current plan for export filtering.

Other

The feedback from the importer needs to be completed (see above), I was thinking of listing errors (default hidden with JS show?) and a table of results showing the number of successes and failures for each of authors, posts, terms, ...

The can_export property of a post type only enables it to show up in the Content Types dropdown for export filtering, but if "All Content" is selected then all post types are exported including those with can_export set to false. Fix based on export patch here could be something like:

$post_types = get_post_types( array( 'public' => true, 'can_export' => true ) );
$where = "post_type IN ('" . implode( "','", $post_types ) . "') AND post_status != 'auto-draft'";
// grab a snapshot of post IDs, just in case it changes during the export
$post_ids = $wpdb->get_col( "SELECT ID FROM $wpdb->posts WHERE $where ORDER BY post_date_gmt ASC" );

(NB: would need to look into exactly which builtin posts are and should be can_export => false)

Docs in the importer.

Currently I have unit tests for the parsers and hopefully coming soon will be more for the whole process (need to think up a full checklist of tests for edge and problem cases)


This is still partly a work in progress so feedback and a lot of testing please. Thank you.

This ticket aims to fix the following:
#5447 #5460 #7400 #7973 #8471 #9237 #10319 #11118 #11144 #11354 #11574 #12685 #13364 #13394 #13453 #13454 #13627 #14306 #14442 #14524 #14750 #15055 #15091 #15108

Attachments (12)

15197-export.diff (27.0 KB) - added by duck_ 13 years ago.
Export against core trunk
15197-import.diff (75.2 KB) - added by duck_ 13 years ago.
Import patch against wordpress-importer plugin checkout
15197.filtering.png (71.0 KB) - added by nacin 13 years ago.
15197-import.002.diff (76.8 KB) - added by duck_ 13 years ago.
15197.import.keep-trying.diff (791 bytes) - added by nacin 13 years ago.
If WP_Error, fall back to regexp. Very temporary.
testingtrunk.wordpress.2010-10-26.xml.gz (1.6 KB) - added by lloydbudd 13 years ago.
Trivial test case
15197.import.keep-trying.2.diff (1.4 KB) - added by duck_ 13 years ago.
15197.export-filtering.diff (12.8 KB) - added by duck_ 13 years ago.
15197.export-filtering.002.diff (15.4 KB) - added by duck_ 13 years ago.
15197.export-filtering.002.png (51.4 KB) - added by duck_ 13 years ago.
15197.filtering-js.diff (2.3 KB) - added by duck_ 13 years ago.
15941.tt-id.diff (697 bytes) - added by duck_ 13 years ago.

Download all attachments as: .zip

Change History (66)

@duck_
13 years ago

Export against core trunk

@duck_
13 years ago

Import patch against wordpress-importer plugin checkout

#1 @westi
13 years ago

  • Cc westi added

#2 @beaulebens
13 years ago

  • Cc beau@… added

#3 @nacin
13 years ago

Attached a screenshot of a mockup for filtering, devised while talking to duck_ about the patches here.

We do need filtering, but it needs to be done sanely. The 3.0 'improvements' ended up making it a difficult form to use, could produce some choppy XML files, and also posed a huge issue in terms of how to integrate menus.

My suggestion is to take a step back and think about it in terms of post types. First option would be to allow you to export everything. That would mean all post types for which can_export is true, all linked taxonomy data, attachments, all navigation menus and their relationships, etc. This would allow you to export things such as nav menus, attachments, etc. that are too intertwined with other objects to be sanely exported alone.

The next option would be Posts, which you can then filter by author, category, or date. Pages, you can filter by author. Then each subsequent post type, which is can_export && public, can also be selected. These options allow for bite-sized XML files as appropriate.

We can surely drop in some hooks in the right place to allow plugins to reach into here and modify it further. But I think this hierarchical approach is better than what we had in 2.9 (author only) and 3.0 (bunch of fields).

On the code overhaul, +100. Would love to see this in 3.1.

#4 @duck_
13 years ago

  • Description modified (diff)

New import patch coming soon with WXR 1.0 author fix and a few other things.

The current todo:

  • Give better feedback to the user at the end of import
  • Re-implement export filtering
  • Double check for undefined variables/indexes on WXR 1.0 import

#5 @ryan
13 years ago

Looks great.

#6 @duck_
13 years ago

Importer patch take two is up:

  • Fixes author mapping for WXR 1.0
  • Back compat with WXR 1.0 for adding post tags which are only defined in <item> XML tags
  • No longer assume that the 'nav_menu' term appears first in menu items
  • Translation calls for all strings (though still some changing to go I guess)
  • Some other small stuff (e.g. removing cruft comments)

Importer todo:

  • Give better feedback to the user at the end of import (pending ux-feedback)
  • More testing! and attempt to break it

Exporter todo:

  • Reimplement filtering (pending ux-feedback)

#7 follow-up: @nacin
13 years ago

New idea: Hide comment_status = 'spam', the same way we hide post_status = 'auto-draft' and post_type 'revision'. They should never be included in an export.

I'm working on a new mockup for the filtering, based on a discussion with Jane.

#8 in reply to: ↑ 7 @duck_
13 years ago

Replying to nacin:

New idea: Hide comment_status = 'spam', the same way we hide post_status = 'auto-draft' and post_type 'revision'. They should never be included in an export.

I've already done this as comment_approved = 1

#10 @nacin
13 years ago

(In [15961]) Importer and exporter overhaul, mega props duck.

Exporter overhaul:

  • Add author information to export
  • Greater usage of slug identifiers
  • Don't export auto-drafts, spam comments, or edit lock/last meta keys
  • Inline documentation improvements
  • Remove filtering for now (@todo)
  • Bump WXR version to 1.1, but remain back compat in the importer

Importer overhaul (http://plugins.trac.wordpress.org/changeset/304249):

  • Use an XML parser where available (SimpleXML, XML Parser)
  • Proper import support for navigation menus
  • Many bug fixes, specifically improvements to category and custom taxonomy handling
  • Better author/user mapping

Fixes #5447 #5460 #7400 #7973 #8471 #9237 #10319 #11118 #11144 #11354 #11574 #12685 #13364 #13394 #13453 #13454 #13627 #14306 #14442 #14524 #14750 #15055 #15091 #15108.

See #15197.

#11 @Viper007Bond
13 years ago

C- C- C- Combo kill!

#12 @lloydbudd
13 years ago

  • Milestone changed from Awaiting Review to 3.1

#13 follow-up: @lloydbudd
13 years ago

Created #15217 "wp importer trunk regression: no longer accepts gz files"

#14 in reply to: ↑ 13 ; follow-up: @lloydbudd
13 years ago

Replying to lloydbudd:

Created #15217 "wp importer trunk regression: no longer accepts gz files"

Actually, I might be mistaken... or I can't tell because wp importer trunk doesn't seem to want to work for anything I throw at it.

@nacin
13 years ago

If WP_Error, fall back to regexp. Very temporary.

#15 follow-ups: @nacin
13 years ago

  • Owner set to duck_
  • Status changed from new to assigned

Quick patch to the importer allows fallback if an invalid XML file is provided.

One export I made earlier today from 3.1-alpha (pre r15961) did not work on trunk import, and libxml_get_errors() returned quite a few malformation errors. This allows that file to be processed.

Patch is pretty lame. If we're receiving a WP_Error for malformed WXR, we should probably just go straight to regexp instead of stopping at libxml first. Also, the many WP_Errors that get returned in $class::parse() should probably have unique IDs (to the class), that way we can better detect invalid XML versus, say, incorrect format.

#16 in reply to: ↑ 14 ; follow-up: @duck_
13 years ago

Replying to lloydbudd:

Actually, I might be mistaken... or I can't tell because wp importer trunk doesn't seem to want to work for anything I throw at it.

I've tried the testingtrunk.wordpress.2010-10-26.xml attachment above with wordpress-importer trunk on PHP 5.3.2 (WP 3.0.1 and 3.1-alpha) and PHP 4.4.9 (WP 3.0.1) using all the different parsers I could and I got no errors (excluding one's reported by not checking import attachments)

#17 in reply to: ↑ 15 @duck_
13 years ago

Replying to nacin:

One export I made earlier today from 3.1-alpha (pre r15961) did not work on trunk import, and libxml_get_errors() returned quite a few malformation errors. This allows that file to be processed.

In this case there was an error with an encoded character in a CDATA section (the apple command key ⌘ to be precise). See this patch from #14584 for a potential fix. Though would definitely need regex fallback to handle this kind of problem for WXR 1.0 exports (assuming we get a fix in for 3.1).

#18 @duck_
13 years ago

Created #15219, so that no UI stuff gets missed if this ticket becomes clogged with new bugs and fixes.

#19 @jane
13 years ago

  • Keywords ux-feedback removed

Removing UX feedback tag, as went through UI with @nacin the other day, and @duck_ has created a separate ticket for future UI discussion.

#20 in reply to: ↑ 16 @lloydbudd
13 years ago

Replying to duck_:

Replying to lloydbudd:

Actually, I might be mistaken... or I can't tell because wp importer trunk doesn't seem to want to work for anything I throw at it.

I've tried the testingtrunk.wordpress.2010-10-26.xml attachment above with wordpress-importer trunk on PHP 5.3.2 (WP 3.0.1 and 3.1-alpha) and PHP 4.4.9 (WP 3.0.1) using all the different parsers I could and I got no errors (excluding one's reported by not checking import attachments)

Found my issue. It was my bad. Sorry for spinning your wheels.

#21 @lloydbudd
13 years ago

Reopened #15217 "wp importer trunk regression: no longer accepts gz files".

@lloydbudd
13 years ago

Trivial test case

#22 follow-up: @lloydbudd
13 years ago

Importing testingtrunk.wordpress.2010-10-26.xml again doesn't provide any feedback that the post already exists and has not been imported.

#23 in reply to: ↑ 22 @duck_
13 years ago

Replying to lloydbudd:

Importing testingtrunk.wordpress.2010-10-26.xml again doesn't provide any feedback that the post already exists and has not been imported.

http://plugins.trac.wordpress.org/changeset/304726/

Also, as per nacin's suggestion on IRC, introduced IMPORT_DEBUG which, when true, enables verbose errors. This is default true whilst the importer is in development. And return line, column and error string for XML parsing failures (only shown when IMPORT_DEBUG true).

#24 follow-up: @lloydbudd
13 years ago

@duck_ is it expected that the trunk importer will work with current WordPress.com exports?

If so, how do I interpret:

Sorry, there has been an error.

There was an error when reading this WXR file

2102605:41 CData section not finished
Wow,Search " nicegradebags " on google to get the 

2102605:41 PCDATA invalid Char value 31

2102605:42 Sequence ']]>' not allowed in content

2102605:42 internal error
2102605:42 Extra content at the end of the document

#25 in reply to: ↑ 24 @westi
13 years ago

Replying to lloydbudd:

@duck_ is it expected that the trunk importer will work with current WordPress.com exports?

If so, how do I interpret:

Sorry, there has been an error.

There was an error when reading this WXR file

2102605:41 CData section not finished
Wow,Search " nicegradebags " on google to get the 

2102605:41 PCDATA invalid Char value 31

2102605:42 Sequence ']]>' not allowed in content

2102605:42 internal error
2102605:42 Extra content at the end of the document

This sounds like the XML is invalid :-(

#26 follow-up: @duck_
13 years ago

Yeah. Looks like the same error that nacin got, I think that if you were to go to line 2102605 you would find a Unicode character of some kind, if you were to delete that your problems should go away. See [http://core.trac.wordpress.org/ticket/15197#comment:17 my comment above about the previous error involving ⌘. The fix would be to apply encoding to CDATA sections for all new WXR files (if that is the correct thing to do, I'm not big on character encoding and XML etc.) and also to have a fallback to the previous parser which doesn't care about this kind of thing.

#27 in reply to: ↑ 15 ; follow-up: @duck_
13 years ago

Replying to nacin:

Patch is pretty lame. If we're receiving a WP_Error for malformed WXR, we should probably just go straight to regexp instead of stopping at libxml first. Also, the many WP_Errors that get returned in $class::parse() should probably have unique IDs (to the class), that way we can better detect invalid XML versus, say, incorrect format.

Updated version so SimpleXML and XMLParser skip straight to regex if malformed XML error, otherwise return results/other error. Should we print anything about the malformed XML error even though we're gonna continue trying anyway?

#28 in reply to: ↑ 27 @lloydbudd
13 years ago

Replying to duck_:

Should we print anything about the malformed XML error even though we're gonna continue trying anyway?

I don't think so. It's not actionable by the customer, and will worry them unnecessarily.

If WP_DEBUG then a hint seems appropriate.

#29 in reply to: ↑ 26 @lloydbudd
13 years ago

Replying to duck_:

Yeah. Looks like the same error that nacin got, I think that if you were to go to line 2102605 you would find a Unicode character of some kind, if you were to delete that your problems should go away. See [http://core.trac.wordpress.org/ticket/15197#comment:17 my comment above about the previous error involving ⌘. The fix would be to apply encoding to CDATA sections for all new WXR files (if that is the correct thing to do, I'm not big on character encoding and XML etc.) and also to have a fallback to the previous parser which doesn't care about this kind of thing.

Thanks for the detailed info. I fired up a hex editor, and confirmed that is the nature of the issue. A comment has "2102605:41 PCDATA invalid Char value 31" a unit separator.

It seems pretty easy for a comment to invalidate the WXR. I'm very interested in seeing this fixed in the exporter, so we can one day phase out the out importer with confidence.

#30 @duck_
13 years ago

http://plugins.trac.wordpress.org/changeset/306039

Fall back to regular expressions for malformed XML files, display a note with details about the error if IMPORT_DEBUG is true (may need to change the wording).

#31 @lloydbudd
13 years ago

Created related Ticket #15272 wp importer trunk regression: no longer can create users

#32 @lloydbudd
13 years ago

Created related Ticket #15274 "wp importer trunk: provides no detailed feedback of success"

#33 @lloydbudd
13 years ago

Created new enhancement: Ticket #15275 "wordpress-importer trunk would benefit from listing: <display name> (<username>)"

#34 follow-up: @cailen
13 years ago

  • Version set to 3.1

Tested the 3.1 alpha trunk and successfully imported a large database (about 1000 posts) largely successfully; my custom hierarchical taxonomy is intact, and all associations seem to be drawn properly. Users were created properly.

One problem I noticed: post thumbnails are not imported (I have defined a post thumbnail featured-image).

#35 in reply to: ↑ 34 @duck_
13 years ago

Replying to cailen:

One problem I noticed: post thumbnails are not imported (I have defined a post thumbnail featured-image).

#14847

#36 @Viper007Bond
13 years ago

Opened a new ticket. Windows hates the importer: #15325

#37 follow-up: @duck_
13 years ago

Remember #14058 when redoing export filtering.

#38 @markel
13 years ago

  • Cc rmarkel@… added

#39 @jshreve
13 years ago

  • Cc justin.shreve@… added

#40 @duck_
13 years ago

First pass at export filtering.

Sorry, it uses tables for the form elements. Was getting the UI done quickly without CSS and focusing more on the backend part.

I introduced a function to build the date dropdowns as I wasn't sure on the best route for this. Thoughts? Should this be inline for now? Moved and made more general and also used in class-wp-list-table.php?

If you filter by post and category then the export file will now only export the selected category (previously all cats, tags and terms were still in the file). However, posts will bring any extra cats, tags and terms with them and the importer will create these as well. Thoughts on this?

Should there be restrictions on the authors exported when filtering posts/pages by author?

Got rid of the ORDER BY parts of the post table DB queries since it doesn't matter for an export. Also changed the pubDate tag just to contain the time of export file generation, because we don't need the precision of the previous approach in this context and it gets rid of two queries.

Other thoughts:

  • Should there be a check for post_type_supports comments to save on a query for every single item in the export?

#41 @duck_
13 years ago

  • Keywords ux-feedback added

Filtering take two.

As above but also:

  • Only include can_export == true for all content exports
  • Removed can_export => false from attachments, added it to revisions
  • No more tables
  • No redundant checkboxes
  • Dropdown of all statuses, not just publish checkbox
  • Same filters for posts and pages (except category filtering)
  • Changed label format: "Categories: [All]"

Todo:

  • JavaScript hiding/showing the extra post/page filters (basically done, just need to decide how to include it)

Feedback on the following please:

  • I introduced a function to build the date dropdowns as I wasn't sure on the best route for this. Thoughts? Should this be inline for now? Moved and made more general and also used in class-wp-list-table.php?
  • Moved from can_export AND public post types to can_export because of nav_menu_items
  • Should there be a check for post_type_supports comments to save on a query for every single item in the export?

Patch and screenshot attached.

#42 @duck_
13 years ago

Also need to decide what to do with the descriptive text next to the all content radio. In my opinion it looks bad in the position shown in my screenshot.

#43 @mikeschinkel
13 years ago

  • Cc mikeschinkel@… added

#44 @dougwrites
13 years ago

  • Cc heymrpro@… added
  • Keywords ux-feedback removed

#45 @dougwrites
13 years ago

  • Keywords ux-feedback added

Sorry about keyword.

#46 @dougwrites
13 years ago

  • Keywords added; ux-feedback removed

#47 @dougwrites
13 years ago

  • Keywords ux-feedback added; removed

#48 @JohnONolan
13 years ago

Screenshot looks pretty good - out of interest, how would I export just posts and pages, but not comments or anything else?

#49 @ryan
13 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

(In [16652]) Export filtering. Props duck_. fixes #15197

#50 @nacin
13 years ago

(In [16733]) Export filtering JS and minor tweaks. props duck_, see #15197.

@duck_
13 years ago

#51 in reply to: ↑ 37 @duck_
13 years ago

Replying to duck_:

Remember #14058 when redoing export filtering.

I cannot believe I didn't... 15941.tt-id.diff

#52 follow-up: @kbiglione
13 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Still having problems importing custom taxonomy in version 3.1 RC3. I have two types of custom taxonomy that are associated with standard Posts. I've created both custom taxonomy on the WP installation that I'm importing to. The configuration is identical on both the export and import sites.

Here's what happens.

  • Post are imported
  • Terms for both taxonomy are imported
  • Terms are *not* associated with posts

I've verified that the terms are present for each item record in the export file. Example:

<category domain="book_publisher" nicename="hachette"><![CDATA[Hachette]]></category>

This is becoming a problem as custom taxonomy become more widely used.

#53 in reply to: ↑ 52 @duck_
13 years ago

Replying to kbiglione:

Still having problems importing custom taxonomy in version 3.1 RC3.

Could you please confirm that you are using the development version of the importer which is currently 0.3-beta5 (the plugin screen will show you the version number). If it's not working with 0.3-beta5 could you please open another ticket rather than continuing to use this one.

#54 @kbiglione
13 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Success! The beta importer works as anticipated.

Sorry about the confusion.

Note: See TracTickets for help on using tickets.