WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 3 months ago

#22435 new enhancement

Export API

Reported by: nbachiyski Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Export Keywords: dev-feedback has-patch
Focuses: Cc:

Description (last modified by nbachiyski)

From experience and from tickets (#19864, #19307, #17379) it's evident that we need to update the export API.

High level goals:

  • To be usable from different parts of the code. From the web backend, from a CLI script, from an async job.
  • To allow more control of the output format – serve over HTTP, write a single XML file to disk, split it and write many smaller XML files, write a big zip with many XML files, etc.
  • To allow exporting the data without querying all the posts at once, so that we can fit the exports to memory.
  • Keep export_wp() for backwards compatibility without the need to keep all (even any) of its code.

Here's my idea for the part of the API 99% of the developers touching export would use and be happy:

<?php
// WP_WXR_Export is an aimmutable representing all the data needed for the export and allows us to have it in multiple formats
$export = new WP_WXR_Export( array( 'start_date' => '2011-10-10', 'post_type' => 'event',  ) );

backup( $export->get_xml() ); // string

$export->export_to_xml_file( 'mom.xml' );
send_to_mom_to_import( 'mom.xml');

$export->serve_xml(); // with all the headers and stuff

$export->export_to_xml_files( '/files/exports-for-my-awesome-website/', 'export-%02d.wxr.xml', 5 * MB_IN_BYTES );

Before I dive into implementation details (in the comments, not to pollute the ticket), I'd like to hear what use cases for extending this code you have in mind and where should we draw the line. Adding more output writers? Adding custom export data? Adding formats different from WXR?

Attachments (5)

export-api.diff (36.2 KB) - added by nbachiyski 2 years ago.
export.diff (3.3 KB) - added by dllh 17 months ago.
A modified approach to error handling, plus tax loop handling
export.2.diff (4.7 KB) - added by dllh 16 months ago.
export.3.diff (1.4 KB) - added by rodrigosprimo 3 months ago.
Fixes categories tags when exporting posts from a single category
22435-unittests.patch (16.9 KB) - added by boonebgorges 6 weeks ago.
Unit tests, previously in the repo but removed as per #30284.

Download all attachments as: .zip

Change History (54)

comment:1 @nbachiyski2 years ago

  • Description modified (diff)

comment:2 @DrewAPicture2 years ago

  • Cc xoodrew@… added

comment:3 @scribu2 years ago

The only other format that would make sense is JSON, but the difference to WXR would be superficial.

I think the ability to add custom data to the file is pretty important, particularly when multiple export files are used.

In general, we should be very careful how we split the data across files; each one should be independent from all the rest.

Version 0, edited 2 years ago by scribu (next)

comment:4 @nbachiyski2 years ago

Going deeper, here are some notes on a sample implementations: classes and public methods are listed here with some reasoning behind them.

There are three layers:

  • Export Data
  • XML Generation – takes data and turns it into WXR XML
  • Writer – takes XML and gives it to the user

In the beginning I tried to make them only 2, but the upper layer class was getting huge and also was doing two very different tasks, so I split them up.

class WP_WXR_Export – represents a set of posts and other site data to be exported

Public methods:

  • __construct( $filters ) – creates the object and queries for the data specified by the filters.
  • get_xml() – returns the export as a string of XML
  • export_to_xml_file( $file_name )
  • export_to_xml_files( $destination_directory, $filename_template, $max_file_size = null )
  • serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
  • export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
  • export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, we can pass the writer object here directly we have also some methods to get the raw, but structured data
  • post_ids() – the post ids of the posts, which will be exported, based on the filters
  • posts() – returns an iterator over the posts, not an array
  • charset()
  • site_metadata() – title, description, language, URLs, etc.
  • authors()
  • categories()
  • tags()
  • custom_taxonomies_terms()
  • nav_menu_terms()

class WP_WXR_XML_Generator – responsible for generating WXR XML from export data

Public methods:

  • __construct( $export ) – it needs to know where to get the export data from
  • before_posts() – returns the XML from the start to the posts definitions
  • posts() – returns an iterator, which on each iteration returns the XML for a post
  • after_posts – returns the XML after the posts definitions
  • more fine-grained access to the XML of different parts in case the writer needs it: header, site_metadata, authors, categories, etc.

For most of the writers (even for all I've written) the before/after posts distinction is good enough, but we'll have the more fine-grained methods, so it wouldn't hurt to expose them.

class WP_WXR_*_Writer – responsible for putting the XML from the generator to some useful place – STDOUT, a file, mutliple files, zip file, network, whatever.

class WP_WXR_Base_Writer – abstract base class with default export functionality

  • __construct( $xml_generator ) – it needs to know where to get the XML from
  • abstract protected write( $xml ) – writes a small piece of XML somewhere
  • export() – passes all the data from the XML generator to the write method

After writing a couple of writers, I found I wanted to be able to change these two main aspects: where am I writing (to a file, to STDOUT, etc.) and what's the logic of what I'm writing (just at once, stop when the file is too big and start a new one, etc.)

Feedback on the classes, their relationships, naming, etc. would be very much appreciated.

Last edited 2 years ago by nbachiyski (previous) (diff)

comment:5 @scribu2 years ago

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

comment:6 @danielbachhuber2 years ago

  • Cc danielbachhuber added

comment:7 follow-up: @nbachiyski2 years ago

Replying to scribu:

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

Because they have subtle differences. For example there is a get_tags filter, which isn't run if we just run get_terms( 'post_tag' ). Also for the custom taxonomies, we need to list all the taxonomies.

I don't have a strong opinion on this, though. I just didn't spend any time trying to combine them into one method.

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

Sure, it makes sense.

comment:8 follow-up: @nbachiyski2 years ago

Last, but not least, some implementation :-)

Notes:

  • It's not 100% complete, some of the post XML generation and data retrieval is missing. It's tedious and I left it for later :-)
  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
  • I had to use exceptions in a couple of places. I know they are not the WordPress thing, but otherwise they are practically unavoidable. For example, if the writer issues a error and I need to catch it in the export class, I would need to have probably a hundred if ( is_wp_error() checks on every single call for virtually anything.
  • I haven't looked into having minimal export when splitting, so I am just repeating everything before the posts in each export file.
  • It'd be nice to have a _gmt time filter.
  • In addittion to the patches, you can follow the development here: https://github.com/nb/WordPress/tree/export-api

comment:9 @nbachiyski2 years ago

  • Keywords has-patch added

comment:10 @batmoo2 years ago

  • Cc batmoo@… added

comment:11 @jkudish2 years ago

  • Cc jkudish@… added

comment:12 in reply to: ↑ 8 @rmccue2 years ago

Replying to nbachiyski:

  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.

This should definitely be using a proper XML serializer. At the moment, the exporter (along with the various feed endpoints) can produce invalid XML since it uses concatenation. If we're going to be redoing this, it should be done properly.

comment:13 @toscho2 years ago

  • Cc info@… added

Please do not omit users without posts. This is a very annoying bug in our current exporter. Especially when WordPress is used more as a CMS you need users without posts quite often.

comment:14 in reply to: ↑ 7 @scribu2 years ago

Replying to nbachiyski:

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

Because you need to generate the file before serving it and to serve a XML file you don't need any of the information you passed to WP_WXR_Export; you could just pass it directly to nginx. That's why it's different.

comment:15 @scribu2 years ago

I knew there was something else I didn't like about the method names: they contain the format name, so they're not generic.

export_to_xml_file() should be export_to_file(), serve_xml() should be serve() etc.

Otherwise, you'd end up with weird things like this:

$exporter = new WP_JSON_Writer();
$exporter->export_to_xml_file();

comment:16 @rmccue2 years ago

Regarding serialization, here's justification for why. WordPress' current generator is horrible with that, and will generate invalid XML fairly easily.

comment:17 @scribu2 years ago

Nikolay, rmccue and I had a nice chat in IRC about this thing.

Instead of having helper methods like export_to_xml_file(), we could have an even simpler interface:

function wp_export( $filters = array(), $additional_args = array() ) {
  $filters_defaults = array(
    'post_type' => 'post',
    'posts_per_page' => -1
    ...
  );

  $additional_args_defaults = array(
    'format' => 'xml',
    'writer' => 'WP_WXR_File_Writer',
    ...
  );

  // instantiate things, etc.
}

and of course devs can skip wp_export() and instantiate the classes themselves.

comment:18 @sirzooro2 years ago

  • Cc sirzooro added

+1 on this. Please make it flexible to allow exporting to any custom-defined file format - I think specifically about using it to generate XML Sitemap.

comment:19 @ocean902 years ago

  • Cc ocean90 added

comment:20 @beaulebens2 years ago

  • Cc beau@… added

comment:21 @Viper007Bond2 years ago

Awesome.

Date/time: You might want to use #18694 to generate the WHERE values for added flexibility.

XML: It'll likely take more memory but it might be worth using DOMDocument to generate the HTML instead of doing it by hand. For example: http://www.viper007bond.com/2011/06/29/easily-create-xml-in-php-using-a-data-array/

comment:22 @jghazally2 years ago

  • Cc jghazally@… added

comment:23 @scribu2 years ago

Regarding [UT1195]: array( self, '_get_term_ids_cb' ) is not a valid way to specify a callback.

You probably meant array( __CLASS__, '_get_term_ids_cb' ).

comment:24 @nbachiyski2 years ago

You're right, fixed in [UT1198].

@nbachiyski2 years ago

comment:25 @nbachiyski2 years ago

Here is an updated patch using an XML builder and tested to make sure t gets the same XML as the old importer.

For those interested in the XML builder, here it is: https://github.com/nb/oxymel

And as always, it's much easier to follow at https://github.com/nb/WordPress/commits/export-api

comment:26 @rmccue2 years ago

Looking good! One thing on the XML serialization aspect: I see tags are getting passed in as 'prefix:tag'. To be technical, they should take 'namespace:tag' with the namespace being the IRI identifying the namespace, and the serializer working out internally what the prefix part is. It's not as great to use, but it's the correct way to work with XML, and avoids the possibility of having conflicting namespaces.

comment:27 @nbachiyski2 years ago

mccue, I agree, using true XML namespaces makes a lot of sense, but it fell out of the initial scope. Patches welcome! :-)

Last edited 2 years ago by nbachiyski (previous) (diff)

comment:28 @idealien21 months ago

That I can tell, one can't currently filter or augment the export query based on meta_key/values with the existing filter / action.

Can:

  • Add to the form via export_filters actions
  • Ensure args are validated via export_args filter

Cannot

  • Build the query which limits results to those which have a matching meta field.

The action export_wp exists at the top of the function before join / where conditions are created. It would be good to have similar functionality as categories and taxonomies as a part of the options for export if it is being enhanced.

comment:29 @idealien21 months ago

  • Cc jamie@… added

comment:30 @scribu21 months ago

I don't think wpdb is the best place for the build_IN_condition() helper:

  • it doesn't use any variables or methods from wpdb
    • it uses $wpdb->prepare()
  • custom wpdb implementations would all need to add it
Last edited 21 months ago by scribu (previous) (diff)

comment:31 @scribu21 months ago

I started using the export API in WP-CLI: https://github.com/wp-cli/wp-cli/pull/525

The only complaint I have (so far) is that WP_Export_Split_Files_Writer is a black box.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

It would be nice to support "skin" classes, similar to WP_Upgrader_Skin.

Last edited 21 months ago by scribu (previous) (diff)

comment:32 @nbachiyski21 months ago

I don't think wpdb is the best place for the build_IN_condition() helper.

I can’t imagine a custom wpdb implementation, which wouldn’t want to inherit from wpdb. Can you think of any? Currently an implementation like this would need to define a lot of helper methods like replace, insert, update, delete, get_var. All of those probably won’t change in a custom wpdb implementation.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

I agree this isn’t cool.

Since logging is a cross-cutting concern, I would implement it via actions, not via a “skin” class.
Pros:

  • We won’t introduce another class dependency.
  • In the default implementation we won’t need an empty skin.
  • Individual actions could be helpful separately.
  • Better forward compatibility – if we add a step, older skins will be incomplete, but still working.

Cons:

  • In cases where one needs to log every step, you will need to hook to a lot of actions, instead of just writing specific methods in a class.

comment:33 @dllh17 months ago

  • Cc daryl@… added

In working out how to handle taxonomy loops that can cause a WSOD, I encountered some problems with the approach given here. For example, the try/catch stuff in wp_export() returns things, but we never do anything with it, which means that the error handling is basically moot. Also, WP_Export_XML_Over_HTTP::write() just spews headers and whatever output there is (even if it's empty because of a silent error), which makes error handling problematic. The forthcoming patch includes some tax loop handling but also rejiggers a couple of things to provide better error handling (I hope). I'm not sure it's 100% of the way there, but I thought I'd submit for consideration. It's a diff of the diff, to isolate the changes I'm proposing.

Last edited 17 months ago by nbachiyski (previous) (diff)

@dllh17 months ago

A modified approach to error handling, plus tax loop handling

comment:34 follow-ups: @nbachiyski17 months ago

Oh, we should have definitely done something with the error result of wp_export() and also, we should’ve buffered the output, I totally agree.

Few notes on the patch:

  • Instead of adding wp_die()s to wp_export(), we should process the return value of wp_export() in wp-admin/export.php and add the possible wp_die() there. One of the ideas of wp_export() is that it can be used in any context: web, cli, other libraries. If we add a wp_die() there, we will deprive the callers of control what to do in case of an error.
  • It’s strange that in WP_Export_XML_Over_HTTP we are both explicitly echoing to standard output and then using output buffering to collect the result. This commit fixes that and makes sure we don't output anything before the export process is over. (a diff file, if needed).
  • I am on the fence if we should be including the tax loop handling in the export, at all. My intuition tells me that we are fixing the problem at the wrong level. I have a few questions:
    • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
    • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
    • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

comment:35 in reply to: ↑ 34 @dllh17 months ago

Replying to nbachiyski:

  • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.

I'm not aware of any other places where it causes trouble; I've only ever seen it affect exports. But I can't say that it's not causing issues elsewhere.

  • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.

Seems like it would affect any WP install, but then we do have some weird tax stuff going on on wpcom. I've never managed to reproduce the natural creation of a tax loop (that is, without hacking one in), so my impression is that these are mostly legacy issues. Maybe it's possible to import incorrectly parented terms, but I haven't repro'd such a case.

  • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

This probably isn't practical for wpcom. If we're reluctant to add tax loop stuff in core, I'd be cool with just adding an action we could fire instead and hook onto that on wpcom.

comment:36 in reply to: ↑ 34 ; follow-up: @westi16 months ago

Replying to nbachiyski:

I have a few questions:

  • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
  • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
  • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

Orphaned terms are possible to create in core WordPress too, we have some code which tries to help prevent orphans and loops but it isn't perfect.

From what I recall there isn't a single root cause.

There are other tickets open around improving the core code to prevent more loops/orphans but I'm not sure we will ever be able to 100% prevent them and we should code defensively for there existence IMHO.

comment:37 in reply to: ↑ 36 @SergeyBiryukov16 months ago

Replying to westi:

There are other tickets open around improving the core code to prevent more loops/orphans

Related: #20903, #20635, #24461.

@dllh16 months ago

comment:38 @dllh16 months ago

I've added export.2.diff, which incorporates nbachiyski's feedback re maintaining caller control. Basically:

  1. The HTTP writer now handles the die() statements when errors are caught.
  2. The returner writer returns WP_Error objects when an exception is caught.
  3. The base writer returns WP_Error objects when an exception is caught.
  4. The file writer re-throws any caught exceptions so that they can be handled further up the stack.


The patch also incorporates the echo/output buffering issue that nbachiyski highlighted and linked to a github commit for.

Based on westi's feedback, I'm leaving the tax loop stuff in for now.

I've tested these changes both with the web exporter and with a script that calls wp_export() at the command line, and so far so good. I'd gratefully receive further suggestions for improvements.

comment:39 @rodrigosprimo15 months ago

  • Cc rodrigosprimo@… added

comment:40 @ircbot10 months ago

This ticket was mentioned in IRC in #wordpress-dev by netweb. View the logs.

comment:41 @danielbachhuber10 months ago

  • Milestone changed from Awaiting Review to Future Release

Working to get Daryl's patch into WP-CLI in https://github.com/wp-cli/wp-cli/pull/1153

I too would love to see movement on this, but it probably fits within a more holistic conversation about importing and exporting.

comment:42 @ircbot10 months ago

This ticket was mentioned in IRC in #wordpress-dev by danielbachhuber. View the logs.

comment:43 @ircbot5 months ago

This ticket was mentioned in IRC in #wordpress-dev by netweb. View the logs.

comment:44 follow-up: @danielbachhuber5 months ago

Nacin and I chatted about this a while back in a private Skype channel. Apologies for not posting a summary earlier.

WP-CLI has been using Nikolay's exporter for 10 months and 3 major releases (ref). From a functional perspective, it does the trick — no blocking bugs discovered. It solved performance issues when exporting large sites, and offers a degree of run-time flexibility. However, from a "WordPress architecture" perspective, which will always be hotly debated, it remains an open question as to whether the code fits.

Specifically, this approach includes a new, one-off library for writing XML, dubbed Oxymel. It uses this new library by chaining, a foreign-to-WordPress code pattern. These two considerations, plus the other feedback in this thread, mean committing the patch would not be an inconsequential decision.

In our chat, Nacin said "needs to be significantly simplified while still maintaining some good level of flexibility and extendability". Given the legacy of the exporter (which produces an invented, inconsistent XML format), it remains to be seen whether this is possible. Consider it a challenge to anyone interested.

Personally, I'm inclined to revisit the high-level goals of export & import. In part because any enhancement to one will need to be an enhancement to another, but also because a given enhancement will need to be *really good* to overcome the problems inherent in everyone's minor differences of opinion. To me, the problem space really is "how can sync my two WordPress sites?", either as a one-off, one-way sync, a two-way, but manual sync, an automatic sync, etc. etc. If we were to embark on this space mission, I'm sure we'd make sure to solve this particular issue's high-level goals along the way.

comment:45 @slackbot3 months ago

This ticket was mentioned in Slack in #core by boone. View the logs.

comment:46 @slackbot3 months ago

This ticket was mentioned in Slack in #core by netweb. View the logs.

comment:47 @rodrigosprimo3 months ago

When using WP-CLI's export command with --category parameter I was getting the PHP notice below:

PHP Notice: Trying to get property of non-object in php/export/class-wp-export-wxr-formatter.php on line 117

This resulted in several empty <wp:category> tags in the generated file. The attached patch fixes this problem by making sure that WP_Export_Query::categories() always return an array as expected by WP_Export_WXR_Formatter::categories().

Note that line 117 of class-wp-export-wxr-formatter.php in WP-CLI is equivalent to line 135 of the same file in this ticket.

See the link below for the related WP-CLI pull request:

https://github.com/wp-cli/wp-cli/pull/1568

@rodrigosprimo3 months ago

Fixes categories tags when exporting posts from a single category

comment:48 in reply to: ↑ 44 @rmccue3 months ago

Replying to danielbachhuber:

Specifically, this approach includes a new, one-off library for writing XML, dubbed Oxymel. It uses this new library by chaining, a foreign-to-WordPress code pattern. These two considerations, plus the other feedback in this thread, mean committing the patch would not be an inconsequential decision.

Just poking my head in here to say that another huge benefit of Oyxmel is that AFAIK it uses a proper XML serializer rather than string concatenation. XML is one of the things that's super easy to output wrong, and the spec on parsing means that you pretty much have to be super strict when parsing. Right now, it's easy to output invalid XML unintentionally, and have parsers fail.

(And I don't mean this in a theoretical way either, you can easily export WXR from a site that fails when you try to import it back into the same version of WP, thanks to PHP's XML parser following the correct rules.)

comment:49 @dllh3 months ago

I noticed just recently that we're exporting the author info with incorrect xml. The importer step that creates author mappings expects an XML node named <wp:author> (citation) but we're creating <wp:wp_author>. This is easily corrected by updating the authors() function in wp-includes/export/class-wp-export-wxr-formatter.php to output $oxymel->tag( 'wp:author' )->contains rather than $oxymel->tag( 'wp:wp_author' )->contains.

@boonebgorges6 weeks ago

Unit tests, previously in the repo but removed as per #30284.

Note: See TracTickets for help on using tickets.