WordPress.org

Make WordPress Core

Opened 7 months ago

Last modified 29 hours ago

#22435 new enhancement

Export API

Reported by: nbachiyski Owned by:
Priority: normal Milestone: Awaiting Review
Component: Export Version:
Severity: normal Keywords: dev-feedback has-patch
Cc: ocean90, xoodrew@…, danielbachhuber, batmoo@…, jkudish@…, info@…, sirzooro, beau@…, jghazally@…, jamie@…

Description (last modified by nbachiyski)

From experience and from tickets (#19864, #19307, #17379) it's evident that we need to update the export API.

High level goals:

  • To be usable from different parts of the code. From the web backend, from a CLI script, from an async job.
  • To allow more control of the output format – serve over HTTP, write a single XML file to disk, split it and write many smaller XML files, write a big zip with many XML files, etc.
  • To allow exporting the data without querying all the posts at once, so that we can fit the exports to memory.
  • Keep export_wp() for backwards compatibility without the need to keep all (even any) of its code.

Here's my idea for the part of the API 99% of the developers touching export would use and be happy:

<?php
// WP_WXR_Export is an aimmutable representing all the data needed for the export and allows us to have it in multiple formats
$export = new WP_WXR_Export( array( 'start_date' => '2011-10-10', 'post_type' => 'event',  ) );

backup( $export->get_xml() ); // string

$export->export_to_xml_file( 'mom.xml' );
send_to_mom_to_import( 'mom.xml');

$export->serve_xml(); // with all the headers and stuff

$export->export_to_xml_files( '/files/exports-for-my-awesome-website/', 'export-%02d.wxr.xml', 5 * MB_IN_BYTES );

Before I dive into implementation details (in the comments, not to pollute the ticket), I'd like to hear what use cases for extending this code you have in mind and where should we draw the line. Adding more output writers? Adding custom export data? Adding formats different from WXR?

Attachments (1)

export-api.diff (36.2 KB) - added by nbachiyski 4 months ago.

Download all attachments as: .zip

Change History (33)

comment:1 nbachiyski7 months ago

  • Description modified (diff)

comment:2 DrewAPicture7 months ago

  • Cc xoodrew@… added

comment:3 scribu7 months ago

The only other format that would make sense is JSON, but the difference to WXR would be superficial.

I think the ability to add custom data to the file is pretty important, and it's tricky to get right when multiple export files are used.

In general, we should be very careful how we split the data across files; each one should be independent from all the rest.

Last edited 7 months ago by scribu (previous) (diff)

comment:4 nbachiyski7 months ago

Going deeper, here are some notes on a sample implementations: classes and public methods are listed here with some reasoning behind them.

There are three layers:

  • Export Data
  • XML Generation – takes data and turns it into WXR XML
  • Writer – takes XML and gives it to the user

In the beginning I tried to make them only 2, but the upper layer class was getting huge and also was doing two very different tasks, so I split them up.

class WP_WXR_Export – represents a set of posts and other site data to be exported

Public methods:

  • __construct( $filters ) – creates the object and queries for the data specified by the filters.
  • get_xml() – returns the export as a string of XML
  • export_to_xml_file( $file_name )
  • export_to_xml_files( $destination_directory, $filename_template, $max_file_size = null )
  • serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
  • export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
  • export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, we can pass the writer object here directly we have also some methods to get the raw, but structured data
  • post_ids() – the post ids of the posts, which will be exported, based on the filters
  • posts() – returns an iterator over the posts, not an array
  • charset()
  • site_metadata() – title, description, language, URLs, etc.
  • authors()
  • categories()
  • tags()
  • custom_taxonomies_terms()
  • nav_menu_terms()

class WP_WXR_XML_Generator – responsible for generating WXR XML from export data

Public methods:

  • __construct( $export ) – it needs to know where to get the export data from
  • before_posts() – returns the XML from the start to the posts definitions
  • posts() – returns an iterator, which on each iteration returns the XML for a post
  • after_posts – returns the XML after the posts definitions
  • more fine-grained access to the XML of different parts in case the writer needs it: header, site_metadata, authors, categories, etc.

For most of the writers (even for all I've written) the before/after posts distinction is good enough, but we'll have the more fine-grained methods, so it wouldn't hurt to expose them.

class WP_WXR_*_Writer – responsible for putting the XML from the generator to some useful place – STDOUT, a file, mutliple files, zip file, network, whatever.

class WP_WXR_Base_Writer – abstract base class with default export functionality

  • __construct( $xml_generator ) – it needs to know where to get the XML from
  • abstract protected write( $xml ) – writes a small piece of XML somewhere
  • export() – passes all the data from the XML generator to the write method

After writing a couple of writers, I found I wanted to be able to change these two main aspects: where am I writing (to a file, to STDOUT, etc.) and what's the logic of what I'm writing (just at once, stop when the file is too big and start a new one, etc.)

Feedback on the classes, their relationships, naming, etc. would be very much appreciated.

Last edited 7 months ago by nbachiyski (previous) (diff)

comment:5 scribu7 months ago

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

comment:6 danielbachhuber7 months ago

  • Cc danielbachhuber added

comment:7 follow-up: nbachiyski7 months ago

Replying to scribu:

categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()

What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?

Because they have subtle differences. For example there is a get_tags filter, which isn't run if we just run get_terms( 'post_tag' ). Also for the custom taxonomies, we need to list all the taxonomies.

I don't have a strong opinion on this, though. I just didn't spend any time trying to combine them into one method.

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, 

I think export_using_writer() should be enough.

Sure, it makes sense.

comment:8 follow-up: nbachiyski7 months ago

Last, but not least, some implementation :-)

Notes:

  • It's not 100% complete, some of the post XML generation and data retrieval is missing. It's tedious and I left it for later :-)
  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
  • I had to use exceptions in a couple of places. I know they are not the WordPress thing, but otherwise they are practically unavoidable. For example, if the writer issues a error and I need to catch it in the export class, I would need to have probably a hundred if ( is_wp_error() checks on every single call for virtually anything.
  • I haven't looked into having minimal export when splitting, so I am just repeating everything before the posts in each export file.
  • It'd be nice to have a _gmt time filter.
  • In addittion to the patches, you can follow the development here: https://github.com/nb/WordPress/tree/export-api

comment:9 nbachiyski7 months ago

  • Keywords has-patch added

comment:10 batmoo7 months ago

  • Cc batmoo@… added

comment:11 jkudish7 months ago

  • Cc jkudish@… added

comment:12 in reply to: ↑ 8 rmccue7 months ago

Replying to nbachiyski:

  • I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.

This should definitely be using a proper XML serializer. At the moment, the exporter (along with the various feed endpoints) can produce invalid XML since it uses concatenation. If we're going to be redoing this, it should be done properly.

comment:13 toscho7 months ago

  • Cc info@… added

Please do not omit users without posts. This is a very annoying bug in our current exporter. Especially when WordPress is used more as a CMS you need users without posts quite often.

comment:14 in reply to: ↑ 7 scribu7 months ago

Replying to nbachiyski:

serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML

I think this should be split out into a utility function or something. It's an operation that you do after the actual export.

To me it seems totally equal to the rest of the export methods. How is it different?

Because you need to generate the file before serving it and to serve a XML file you don't need any of the information you passed to WP_WXR_Export; you could just pass it directly to nginx. That's why it's different.

comment:15 scribu7 months ago

I knew there was something else I didn't like about the method names: they contain the format name, so they're not generic.

export_to_xml_file() should be export_to_file(), serve_xml() should be serve() etc.

Otherwise, you'd end up with weird things like this:

$exporter = new WP_JSON_Writer();
$exporter->export_to_xml_file();

comment:16 rmccue7 months ago

Regarding serialization, here's justification for why. WordPress' current generator is horrible with that, and will generate invalid XML fairly easily.

comment:17 scribu7 months ago

Nikolay, rmccue and I had a nice chat in IRC about this thing.

Instead of having helper methods like export_to_xml_file(), we could have an even simpler interface:

function wp_export( $filters = array(), $additional_args = array() ) {
  $filters_defaults = array(
    'post_type' => 'post',
    'posts_per_page' => -1
    ...
  );

  $additional_args_defaults = array(
    'format' => 'xml',
    'writer' => 'WP_WXR_File_Writer',
    ...
  );

  // instantiate things, etc.
}

and of course devs can skip wp_export() and instantiate the classes themselves.

comment:18 sirzooro7 months ago

  • Cc sirzooro added

+1 on this. Please make it flexible to allow exporting to any custom-defined file format - I think specifically about using it to generate XML Sitemap.

comment:19 ocean907 months ago

  • Cc ocean90 added

comment:20 beaulebens7 months ago

  • Cc beau@… added

comment:21 Viper007Bond7 months ago

Awesome.

Date/time: You might want to use #18694 to generate the WHERE values for added flexibility.

XML: It'll likely take more memory but it might be worth using DOMDocument to generate the HTML instead of doing it by hand. For example: http://www.viper007bond.com/2011/06/29/easily-create-xml-in-php-using-a-data-array/

comment:22 jghazally5 months ago

  • Cc jghazally@… added

comment:23 scribu5 months ago

Regarding [UT1195]: array( self, '_get_term_ids_cb' ) is not a valid way to specify a callback.

You probably meant array( __CLASS__, '_get_term_ids_cb' ).

comment:24 nbachiyski5 months ago

You're right, fixed in [UT1198].

nbachiyski4 months ago

comment:25 nbachiyski4 months ago

Here is an updated patch using an XML builder and tested to make sure t gets the same XML as the old importer.

For those interested in the XML builder, here it is: https://github.com/nb/oxymel

And as always, it's much easier to follow at https://github.com/nb/WordPress/commits/export-api

comment:26 rmccue4 months ago

Looking good! One thing on the XML serialization aspect: I see tags are getting passed in as 'prefix:tag'. To be technical, they should take 'namespace:tag' with the namespace being the IRI identifying the namespace, and the serializer working out internally what the prefix part is. It's not as great to use, but it's the correct way to work with XML, and avoids the possibility of having conflicting namespaces.

comment:27 nbachiyski4 months ago

mccue, I agree, using true XML namespaces makes a lot of sense, but it fell out of the initial scope. Patches welcome! :-)

Last edited 4 months ago by nbachiyski (previous) (diff)

comment:28 idealien3 weeks ago

That I can tell, one can't currently filter or augment the export query based on meta_key/values with the existing filter / action.

Can:

  • Add to the form via export_filters actions
  • Ensure args are validated via export_args filter

Cannot

  • Build the query which limits results to those which have a matching meta field.

The action export_wp exists at the top of the function before join / where conditions are created. It would be good to have similar functionality as categories and taxonomies as a part of the options for export if it is being enhanced.

comment:29 idealien3 weeks ago

  • Cc jamie@… added

comment:30 scribu2 days ago

I don't think wpdb is the best place for the build_IN_condition() helper:

  • it doesn't use any variables or methods from wpdb
    • it uses $wpdb->prepare()
  • custom wpdb implementations would all need to add it
Last edited 2 days ago by scribu (previous) (diff)

comment:31 scribu2 days ago

I started using the export API in WP-CLI: https://github.com/wp-cli/wp-cli/pull/525

The only complaint I have (so far) is that WP_Export_Split_Files_Writer is a black box.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

It would be nice to support "skin" classes, similar to WP_Upgrader_Skin.

Last edited 2 days ago by scribu (previous) (diff)

comment:32 nbachiyski29 hours ago

I don't think wpdb is the best place for the build_IN_condition() helper.

I can’t imagine a custom wpdb implementation, which wouldn’t want to inherit from wpdb. Can you think of any? Currently an implementation like this would need to define a lot of helper methods like replace, insert, update, delete, get_var. All of those probably won’t change in a custom wpdb implementation.

If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.

I agree this isn’t cool.

Since logging is a cross-cutting concern, I would implement it via actions, not via a “skin” class.
Pros:

  • We won’t introduce another class dependency.
  • In the default implementation we won’t need an empty skin.
  • Individual actions could be helpful separately.
  • Better forward compatibility – if we add a step, older skins will be incomplete, but still working.

Cons:

  • In cases where one needs to log every step, you will need to hook to a lot of actions, instead of just writing specific methods in a class.
Note: See TracTickets for help on using tickets.