Opened 6 months ago
Last modified 3 months ago
#22435 new enhancement
Export API
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | Export | Version: | |
| Severity: | normal | Keywords: | dev-feedback has-patch |
| Cc: | ocean90, xoodrew@…, danielbachhuber, batmoo@…, jkudish@…, info@…, sirzooro, beau@…, jghazally@… |
Description (last modified by nbachiyski)
From experience and from tickets (#19864, #19307, #17379) it's evident that we need to update the export API.
High level goals:
- To be usable from different parts of the code. From the web backend, from a CLI script, from an async job.
- To allow more control of the output format – serve over HTTP, write a single XML file to disk, split it and write many smaller XML files, write a big zip with many XML files, etc.
- To allow exporting the data without querying all the posts at once, so that we can fit the exports to memory.
- Keep export_wp() for backwards compatibility without the need to keep all (even any) of its code.
Here's my idea for the part of the API 99% of the developers touching export would use and be happy:
<?php // WP_WXR_Export is an aimmutable representing all the data needed for the export and allows us to have it in multiple formats $export = new WP_WXR_Export( array( 'start_date' => '2011-10-10', 'post_type' => 'event', … ) ); backup( $export->get_xml() ); // string $export->export_to_xml_file( 'mom.xml' ); send_to_mom_to_import( 'mom.xml'); $export->serve_xml(); // with all the headers and stuff $export->export_to_xml_files( '/files/exports-for-my-awesome-website/', 'export-%02d.wxr.xml', 5 * MB_IN_BYTES );
Before I dive into implementation details (in the comments, not to pollute the ticket), I'd like to hear what use cases for extending this code you have in mind and where should we draw the line. Adding more output writers? Adding custom export data? Adding formats different from WXR?
Attachments (1)
Change History (28)
comment:1
nbachiyski — 6 months ago
- Description modified (diff)
comment:2
DrewAPicture — 6 months ago
- Cc xoodrew@… added
comment:4
nbachiyski — 6 months ago
Going deeper, here are some notes on a sample implementations: classes and public methods are listed here with some reasoning behind them.
There are three layers:
- Export Data
- XML Generation – takes data and turns it into WXR XML
- Writer – takes XML and gives it to the user
In the beginning I tried to make them only 2, but the upper layer class was getting huge and also was doing two very different tasks, so I split them up.
class WP_WXR_Export – represents a set of posts and other site data to be exported
Public methods:
- __construct( $filters ) – creates the object and queries for the data specified by the filters.
- get_xml() – returns the export as a string of XML
- export_to_xml_file( $file_name )
- export_to_xml_files( $destination_directory, $filename_template, $max_file_size = null )
- serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
- export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above
- export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator, we can pass the writer object here directly we have also some methods to get the raw, but structured data
- post_ids() – the post ids of the posts, which will be exported, based on the filters
- posts() – returns an iterator over the posts, not an array
- charset()
- site_metadata() – title, description, language, URLs, etc.
- authors()
- categories()
- tags()
- custom_taxonomies_terms()
- nav_menu_terms()
class WP_WXR_XML_Generator – responsible for generating WXR XML from export data
Public methods:
- __construct( $export ) – it needs to know where to get the export data from
- before_posts() – returns the XML from the start to the posts definitions
- posts() – returns an iterator, which on each iteration returns the XML for a post
- after_posts – returns the XML after the posts definitions
- more fine-grained access to the XML of different parts in case the writer needs it: header, site_metadata, authors, categories, etc.
For most of the writers (even for all I've written) the before/after posts distinction is good enough, but we'll have the more fine-grained methods, so it wouldn't hurt to expose them.
class WP_WXR_*_Writer – responsible for putting the XML from the generator to some useful place – STDOUT, a file, mutliple files, zip file, network, whatever.
class WP_WXR_Base_Writer – abstract base class with default export functionality
- __construct( $xml_generator ) – it needs to know where to get the XML from
- abstract protected write( $xml ) – writes a small piece of XML somewhere
- export() – passes all the data from the XML generator to the write method
After writing a couple of writers, I found I wanted to be able to change these two main aspects: where am I writing (to a file, to STDOUT, etc.) and what's the logic of what I'm writing (just at once, stop when the file is too big and start a new one, etc.)
Feedback on the classes, their relationships, naming, etc. would be very much appreciated.
categories() tags() custom_taxonomies_terms() nav_menu_terms()
What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
I think this should be split out into a utility function or something. It's an operation that you do after the actual export.
export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator,
I think export_using_writer() should be enough.
comment:6
danielbachhuber — 6 months ago
- Cc danielbachhuber added
comment:7
follow-up:
↓ 14
nbachiyski — 6 months ago
Replying to scribu:
categories() tags() custom_taxonomies_terms() nav_menu_terms()What's the difference between these 4? Why not have a generic terms( $taxonomy ) method?
Because they have subtle differences. For example there is a get_tags filter, which isn't run if we just run get_terms( 'post_tag' ). Also for the custom taxonomies, we need to list all the taxonomies.
I don't have a strong opinion on this, though. I just didn't spend any time trying to combine them into one method.
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XMLI think this should be split out into a utility function or something. It's an operation that you do after the actual export.
To me it seems totally equal to the rest of the export methods. How is it different?
export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator,I think export_using_writer() should be enough.
Sure, it makes sense.
comment:8
follow-up:
↓ 12
nbachiyski — 6 months ago
Last, but not least, some implementation :-)
Notes:
- It's not 100% complete, some of the post XML generation and data retrieval is missing. It's tedious and I left it for later :-)
- I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
- I had to use exceptions in a couple of places. I know they are not the WordPress thing, but otherwise they are practically unavoidable. For example, if the writer issues a error and I need to catch it in the export class, I would need to have probably a hundred if ( is_wp_error() checks on every single call for virtually anything.
- I haven't looked into having minimal export when splitting, so I am just repeating everything before the posts in each export file.
- It'd be nice to have a _gmt time filter.
- In addittion to the patches, you can follow the development here: https://github.com/nb/WordPress/tree/export-api
comment:9
nbachiyski — 6 months ago
- Keywords has-patch added
comment:10
batmoo — 6 months ago
- Cc batmoo@… added
comment:11
jkudish — 6 months ago
- Cc jkudish@… added
comment:12
in reply to:
↑ 8
rmccue — 6 months ago
Replying to nbachiyski:
- I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
This should definitely be using a proper XML serializer. At the moment, the exporter (along with the various feed endpoints) can produce invalid XML since it uses concatenation. If we're going to be redoing this, it should be done properly.
comment:13
toscho — 6 months ago
- Cc info@… added
Please do not omit users without posts. This is a very annoying bug in our current exporter. Especially when WordPress is used more as a CMS you need users without posts quite often.
comment:14
in reply to:
↑ 7
scribu — 6 months ago
Replying to nbachiyski:
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XMLI think this should be split out into a utility function or something. It's an operation that you do after the actual export.
To me it seems totally equal to the rest of the export methods. How is it different?
Because you need to generate the file before serving it and to serve a XML file you don't need any of the information you passed to WP_WXR_Export; you could just pass it directly to nginx. That's why it's different.
comment:15
scribu — 6 months ago
I knew there was something else I didn't like about the method names: they contain the format name, so they're not generic.
export_to_xml_file() should be export_to_file(), serve_xml() should be serve() etc.
Otherwise, you'd end up with weird things like this:
$exporter = new WP_JSON_Writer(); $exporter->export_to_xml_file();
comment:16
rmccue — 6 months ago
Regarding serialization, here's justification for why. WordPress' current generator is horrible with that, and will generate invalid XML fairly easily.
comment:17
scribu — 6 months ago
Nikolay, rmccue and I had a nice chat in IRC about this thing.
Instead of having helper methods like export_to_xml_file(), we could have an even simpler interface:
function wp_export( $filters = array(), $additional_args = array() ) {
$filters_defaults = array(
'post_type' => 'post',
'posts_per_page' => -1
...
);
$additional_args_defaults = array(
'format' => 'xml',
'writer' => 'WP_WXR_File_Writer',
...
);
// instantiate things, etc.
}
and of course devs can skip wp_export() and instantiate the classes themselves.
comment:18
sirzooro — 6 months ago
- Cc sirzooro added
+1 on this. Please make it flexible to allow exporting to any custom-defined file format - I think specifically about using it to generate XML Sitemap.
comment:19
ocean90 — 6 months ago
- Cc ocean90 added
comment:20
beaulebens — 6 months ago
- Cc beau@… added
comment:21
Viper007Bond — 6 months ago
Awesome.
Date/time: You might want to use #18694 to generate the WHERE values for added flexibility.
XML: It'll likely take more memory but it might be worth using DOMDocument to generate the HTML instead of doing it by hand. For example: http://www.viper007bond.com/2011/06/29/easily-create-xml-in-php-using-a-data-array/
comment:22
jghazally — 4 months ago
- Cc jghazally@… added
comment:23
scribu — 4 months ago
Regarding [UT1195]: array( self, '_get_term_ids_cb' ) is not a valid way to specify a callback.
You probably meant array( __CLASS__, '_get_term_ids_cb' ).
comment:24
nbachiyski — 4 months ago
You're right, fixed in [UT1198].
nbachiyski — 3 months ago
comment:25
nbachiyski — 3 months ago
Here is an updated patch using an XML builder and tested to make sure t gets the same XML as the old importer.
For those interested in the XML builder, here it is: https://github.com/nb/oxymel
And as always, it's much easier to follow at https://github.com/nb/WordPress/commits/export-api
comment:26
rmccue — 3 months ago
Looking good! One thing on the XML serialization aspect: I see tags are getting passed in as 'prefix:tag'. To be technical, they should take 'namespace:tag' with the namespace being the IRI identifying the namespace, and the serializer working out internally what the prefix part is. It's not as great to use, but it's the correct way to work with XML, and avoids the possibility of having conflicting namespaces.
comment:27
nbachiyski — 3 months ago
mccue, I agree, using true XML prefixes makes a lot of sense, but it fell out of the initial scope. Patches welcome! :-)

The only other format that would make sense is JSON, but the difference to WXR would be superficial.
I think the ability to add custom data to the file is pretty important, and it's tricky to get right when multiple export files are used.
In general, we should be very careful how we split the data across files; each one should be independent from all the rest.