WordPress.org

Make WordPress Core

Opened 18 months ago

Last modified 4 hours ago

#22435 new enhancement

Export API

Reported by: nbachiyski Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Export Keywords: dev-feedback has-patch
Focuses: Cc:

Description (last modified by nbachiyski)

From experience and from tickets (#19864, #19307, #17379) it's evident that we need to update the export API.

High level goals:

  • To be usable from different parts of the code. From the web backend, from a CLI script, from an async job.
  • To allow more control of the output format – serve over HTTP, write a single XML file to disk, split it and write many smaller XML files, write a big zip with many XML files, etc.
  • To allow exporting the data without querying all the posts at once, so that we can fit the exports to memory.
  • Keep export_wp() for backwards compatibility without the need to keep all (even any) of its code.

Here's my idea for the part of the API 99% of the developers touching export would use and be happy:

<?php
// WP_WXR_Export is an aimmutable representing all the data needed for the export and allows us to have it in multiple formats
$export = new WP_WXR_Export( array( 'start_date' => '2011-10-10', 'post_type' => 'event',  ) );

backup( $export->get_xml() ); // string

$export->export_to_xml_file( 'mom.xml' );
send_to_mom_to_import( 'mom.xml');

$export->serve_xml(); // with all the headers and stuff

$export->export_to_xml_files( '/files/exports-for-my-awesome-website/', 'export-%02d.wxr.xml', 5 * MB_IN_BYTES );

Before I dive into implementation details (in the comments, not to pollute the ticket), I'd like to hear what use cases for extending this code you have in mind and where should we draw the line. Adding more output writers? Adding custom export data? Adding formats different from WXR?

Attachments (3)

export-api.diff (36.2 KB) - added by nbachiyski 14 months ago.
export.diff (3.3 KB) - added by dllh 6 months ago.
A modified approach to error handling, plus tax loop handling
export.2.diff (4.7 KB) - added by dllh 6 months ago.

Download all attachments as: .zip

Change History (43)

comment:1 nbachiyski18 months ago

  • Description modified (diff)

comment:2 DrewAPicture18 months ago

  • Cc xoodrew@… added

comment:3 scribu18 months ago

The only other format that would make sense is JSON, but the difference to WXR would be superficial.

I think the ability to add custom data to the file is pretty important, and it's tricky to get right when multiple export files are used.

In general, we should be very careful how we split the data across files; each one should be independent from all the rest.

Last edited 18 months ago by scribu (previous) (diff)

comment:4 nbachiyski18 months ago

Going deeper, here are some notes on a sample implementations: classes and public methods are listed here with some reasoning behind them.

There are three layers:

  • Export Data
  • XML Generation – takes data and turns it into WXR XML
  • Writer – takes XML and gives it to the user

In the beginning I tried to make them only 2, but the upper layer class was getting huge and also was doing two very different tasks, so I split them up.

class WP_WXR_Export – represents a set of posts and other site data to be exported

Public methods:

  • __construct( $filters ) – creates the object and queries for the data specified by the filters.
  • get_xml() – returns the export as a string of XML
  • export_to_xml_file( $file_name )
  • export_to_xml_files( $destination_directory, $filename_template, $max_file_size = null )
  • serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
  • export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
  • export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, we can pass the writer object here directly we have also some methods to get the raw, but structured data
  • post_ids() – the post ids of the posts, which will be exported, based on the filters
  • posts() – returns an iterator over the posts, not an array
  • charset()
  • site_metadata() – title, description, language, URLs, etc.
  • authors()
  • categories()
  • tags()
  • custom_taxonomies_terms()
  • nav_menu_terms()

class WP_WXR_XML_Generator – responsible for generating WXR XML from export data

Public methods:

  • __construct( $export ) – it needs to know where to get the export data from
  • before_posts() – returns the XML from the start to the posts definitions
  • posts() – returns an iterator, which on each iteration returns the XML for a post
  • after_posts – returns the XML after the posts definitions
  • more fine-grained access to the XML of different parts in case the writer needs it: header, site_metadata, authors, categories, etc.

For most of the writers (even for all I've written) the before/after posts distinction is good enough, but we'll have the more fine-grained methods, so it wouldn't hurt to expose them.

class WP_WXR_*_Writer – responsible for putting the XML from the generator to some useful place – STDOUT, a file, mutliple files, zip file, network, whatever.

class WP_WXR_Base_Writer – abstract base class with default export functionality

  • __construct( $xml_generator ) – it needs to know where to get the XML from
  • abstract protected write( $xml ) – writes a small piece of XML somewhere
  • export() – passes all the data from the XML generator to the write method

After writing a couple of writers, I found I wanted to be able to change these two main aspects: where am I writing (to a file, to STDOUT, etc.) and what's the logic of what I'm writing (just at once, stop when the file is too big and start a new one, etc.)

Feedback on the classes, their relationships, naming, etc. would be very much appreciated.

Last edited 18 months ago by nbachiyski (previous) (diff)

comment:5 scribu18 months ago

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

comment:6 danielbachhuber18 months ago

  • Cc danielbachhuber added

comment:7 follow-up: nbachiyski18 months ago

Replying to scribu:

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

Because they have subtle differences. For example there is a get_tags filter, which isn't run if we just run get_terms( 'post_tag' ). Also for the custom taxonomies, we need to list all the taxonomies.

I don't have a strong opinion on this, though. I just didn't spend any time trying to combine them into one method.

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

Sure, it makes sense.

comment:8 follow-up: nbachiyski18 months ago

Last, but not least, some implementation :-)

Notes:

  • It's not 100% complete, some of the post XML generation and data retrieval is missing. It's tedious and I left it for later :-)
  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
  • I had to use exceptions in a couple of places. I know they are not the WordPress thing, but otherwise they are practically unavoidable. For example, if the writer issues a error and I need to catch it in the export class, I would need to have probably a hundred if ( is_wp_error() checks on every single call for virtually anything.
  • I haven't looked into having minimal export when splitting, so I am just repeating everything before the posts in each export file.
  • It'd be nice to have a _gmt time filter.
  • In addittion to the patches, you can follow the development here: https://github.com/nb/WordPress/tree/export-api

comment:9 nbachiyski18 months ago

  • Keywords has-patch added

comment:10 batmoo18 months ago

  • Cc batmoo@… added

comment:11 jkudish18 months ago

  • Cc jkudish@… added

comment:12 in reply to: ↑ 8 rmccue18 months ago

Replying to nbachiyski:

  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.

This should definitely be using a proper XML serializer. At the moment, the exporter (along with the various feed endpoints) can produce invalid XML since it uses concatenation. If we're going to be redoing this, it should be done properly.

comment:13 toscho18 months ago

  • Cc info@… added

Please do not omit users without posts. This is a very annoying bug in our current exporter. Especially when WordPress is used more as a CMS you need users without posts quite often.

comment:14 in reply to: ↑ 7 scribu18 months ago

Replying to nbachiyski:

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

Because you need to generate the file before serving it and to serve a XML file you don't need any of the information you passed to WP_WXR_Export; you could just pass it directly to nginx. That's why it's different.

comment:15 scribu18 months ago

I knew there was something else I didn't like about the method names: they contain the format name, so they're not generic.

export_to_xml_file() should be export_to_file(), serve_xml() should be serve() etc.

Otherwise, you'd end up with weird things like this:

$exporter = new WP_JSON_Writer();
$exporter->export_to_xml_file();

comment:16 rmccue18 months ago

Regarding serialization, here's justification for why. WordPress' current generator is horrible with that, and will generate invalid XML fairly easily.

comment:17 scribu18 months ago

Nikolay, rmccue and I had a nice chat in IRC about this thing.

Instead of having helper methods like export_to_xml_file(), we could have an even simpler interface:

function wp_export( $filters = array(), $additional_args = array() ) {
  $filters_defaults = array(
    'post_type' => 'post',
    'posts_per_page' => -1
    ...
  );

  $additional_args_defaults = array(
    'format' => 'xml',
    'writer' => 'WP_WXR_File_Writer',
    ...
  );

  // instantiate things, etc.
}

and of course devs can skip wp_export() and instantiate the classes themselves.

comment:18 sirzooro18 months ago

  • Cc sirzooro added

+1 on this. Please make it flexible to allow exporting to any custom-defined file format - I think specifically about using it to generate XML Sitemap.

comment:19 ocean9018 months ago

  • Cc ocean90 added

comment:20 beaulebens18 months ago

  • Cc beau@… added

comment:21 Viper007Bond18 months ago

Awesome.

Date/time: You might want to use #18694 to generate the WHERE values for added flexibility.

XML: It'll likely take more memory but it might be worth using DOMDocument to generate the HTML instead of doing it by hand. For example: http://www.viper007bond.com/2011/06/29/easily-create-xml-in-php-using-a-data-array/

comment:22 jghazally15 months ago

  • Cc jghazally@… added

comment:23 scribu15 months ago

Regarding [UT1195]: array( self, '_get_term_ids_cb' ) is not a valid way to specify a callback.

You probably meant array( __CLASS__, '_get_term_ids_cb' ).

comment:24 nbachiyski15 months ago

You're right, fixed in [UT1198].

nbachiyski14 months ago

comment:25 nbachiyski14 months ago

Here is an updated patch using an XML builder and tested to make sure t gets the same XML as the old importer.

For those interested in the XML builder, here it is: https://github.com/nb/oxymel

And as always, it's much easier to follow at https://github.com/nb/WordPress/commits/export-api

comment:26 rmccue14 months ago

Looking good! One thing on the XML serialization aspect: I see tags are getting passed in as 'prefix:tag'. To be technical, they should take 'namespace:tag' with the namespace being the IRI identifying the namespace, and the serializer working out internally what the prefix part is. It's not as great to use, but it's the correct way to work with XML, and avoids the possibility of having conflicting namespaces.

comment:27 nbachiyski14 months ago

mccue, I agree, using true XML namespaces makes a lot of sense, but it fell out of the initial scope. Patches welcome! :-)

Last edited 14 months ago by nbachiyski (previous) (diff)

comment:28 idealien11 months ago

That I can tell, one can't currently filter or augment the export query based on meta_key/values with the existing filter / action.

Can:

  • Add to the form via export_filters actions
  • Ensure args are validated via export_args filter

Cannot

  • Build the query which limits results to those which have a matching meta field.

The action export_wp exists at the top of the function before join / where conditions are created. It would be good to have similar functionality as categories and taxonomies as a part of the options for export if it is being enhanced.

comment:29 idealien11 months ago

  • Cc jamie@… added

comment:30 scribu10 months ago

I don't think wpdb is the best place for the build_IN_condition() helper:

  • it doesn't use any variables or methods from wpdb
    • it uses $wpdb->prepare()
  • custom wpdb implementations would all need to add it
Last edited 10 months ago by scribu (previous) (diff)

comment:31 scribu10 months ago

I started using the export API in WP-CLI: https://github.com/wp-cli/wp-cli/pull/525

The only complaint I have (so far) is that WP_Export_Split_Files_Writer is a black box.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

It would be nice to support "skin" classes, similar to WP_Upgrader_Skin.

Last edited 10 months ago by scribu (previous) (diff)

comment:32 nbachiyski10 months ago

I don't think wpdb is the best place for the build_IN_condition() helper.

I can’t imagine a custom wpdb implementation, which wouldn’t want to inherit from wpdb. Can you think of any? Currently an implementation like this would need to define a lot of helper methods like replace, insert, update, delete, get_var. All of those probably won’t change in a custom wpdb implementation.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

I agree this isn’t cool.

Since logging is a cross-cutting concern, I would implement it via actions, not via a “skin” class.
Pros:

  • We won’t introduce another class dependency.
  • In the default implementation we won’t need an empty skin.
  • Individual actions could be helpful separately.
  • Better forward compatibility – if we add a step, older skins will be incomplete, but still working.

Cons:

  • In cases where one needs to log every step, you will need to hook to a lot of actions, instead of just writing specific methods in a class.

comment:33 dllh6 months ago

  • Cc daryl@… added

In working out how to handle taxonomy loops that can cause a WSOD, I encountered some problems with the approach given here. For example, the try/catch stuff in wp_export() returns things, but we never do anything with it, which means that the error handling is basically moot. Also, WP_Export_XML_Over_HTTP::write() just spews headers and whatever output there is (even if it's empty because of a silent error), which makes error handling problematic. The forthcoming patch includes some tax loop handling but also rejiggers a couple of things to provide better error handling (I hope). I'm not sure it's 100% of the way there, but I thought I'd submit for consideration. It's a diff of the diff, to isolate the changes I'm proposing.

Last edited 6 months ago by nbachiyski (previous) (diff)

dllh6 months ago

A modified approach to error handling, plus tax loop handling

comment:34 follow-ups: nbachiyski6 months ago

Oh, we should have definitely done something with the error result of wp_export() and also, we should’ve buffered the output, I totally agree.

Few notes on the patch:

  • Instead of adding wp_die()s to wp_export(), we should process the return value of wp_export() in wp-admin/export.php and add the possible wp_die() there. One of the ideas of wp_export() is that it can be used in any context: web, cli, other libraries. If we add a wp_die() there, we will deprive the callers of control what to do in case of an error.
  • It’s strange that in WP_Export_XML_Over_HTTP we are both explicitly echoing to standard output and then using output buffering to collect the result. This commit fixes that and makes sure we don't output anything before the export process is over. (a diff file, if needed).
  • I am on the fence if we should be including the tax loop handling in the export, at all. My intuition tells me that we are fixing the problem at the wrong level. I have a few questions:
    • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
    • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
    • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

comment:35 in reply to: ↑ 34 dllh6 months ago

Replying to nbachiyski:

  • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.

I'm not aware of any other places where it causes trouble; I've only ever seen it affect exports. But I can't say that it's not causing issues elsewhere.

  • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.

Seems like it would affect any WP install, but then we do have some weird tax stuff going on on wpcom. I've never managed to reproduce the natural creation of a tax loop (that is, without hacking one in), so my impression is that these are mostly legacy issues. Maybe it's possible to import incorrectly parented terms, but I haven't repro'd such a case.

  • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

This probably isn't practical for wpcom. If we're reluctant to add tax loop stuff in core, I'd be cool with just adding an action we could fire instead and hook onto that on wpcom.

comment:36 in reply to: ↑ 34 ; follow-up: westi6 months ago

Replying to nbachiyski:

I have a few questions:

  • Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
  • Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
  • What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?

Orphaned terms are possible to create in core WordPress too, we have some code which tries to help prevent orphans and loops but it isn't perfect.

From what I recall there isn't a single root cause.

There are other tickets open around improving the core code to prevent more loops/orphans but I'm not sure we will ever be able to 100% prevent them and we should code defensively for there existence IMHO.

comment:37 in reply to: ↑ 36 SergeyBiryukov6 months ago

Replying to westi:

There are other tickets open around improving the core code to prevent more loops/orphans

Related: #20903, #20635, #24461.

dllh6 months ago

comment:38 dllh6 months ago

I've added export.2.diff, which incorporates nbachiyski's feedback re maintaining caller control. Basically:

  1. The HTTP writer now handles the die() statements when errors are caught.
  2. The returner writer returns WP_Error objects when an exception is caught.
  3. The base writer returns WP_Error objects when an exception is caught.
  4. The file writer re-throws any caught exceptions so that they can be handled further up the stack.


The patch also incorporates the echo/output buffering issue that nbachiyski highlighted and linked to a github commit for.

Based on westi's feedback, I'm leaving the tax loop stuff in for now.

I've tested these changes both with the web exporter and with a script that calls wp_export() at the command line, and so far so good. I'd gratefully receive further suggestions for improvements.

comment:39 rodrigosprimo5 months ago

  • Cc rodrigosprimo@… added

comment:40 ircbot4 hours ago

This ticket was mentioned in IRC in #wordpress-dev by netweb. View the logs.

Note: See TracTickets for help on using tickets.