Opened 12 years ago
Last modified 7 years ago
#22435 new enhancement
Export API
Reported by: | nbachiyski | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | |
Component: | Export | Keywords: | dev-feedback has-patch |
Focuses: | Cc: |
Description (last modified by )
From experience and from tickets (#19864, #19307, #17379) it's evident that we need to update the export API.
High level goals:
- To be usable from different parts of the code. From the web backend, from a CLI script, from an async job.
- To allow more control of the output format – serve over HTTP, write a single XML file to disk, split it and write many smaller XML files, write a big zip with many XML files, etc.
- To allow exporting the data without querying all the posts at once, so that we can fit the exports to memory.
- Keep
export_wp()
for backwards compatibility without the need to keep all (even any) of its code.
Here's my idea for the part of the API 99% of the developers touching export would use and be happy:
<?php // WP_WXR_Export is an aimmutable representing all the data needed for the export and allows us to have it in multiple formats $export = new WP_WXR_Export( array( 'start_date' => '2011-10-10', 'post_type' => 'event', … ) ); backup( $export->get_xml() ); // string $export->export_to_xml_file( 'mom.xml' ); send_to_mom_to_import( 'mom.xml'); $export->serve_xml(); // with all the headers and stuff $export->export_to_xml_files( '/files/exports-for-my-awesome-website/', 'export-%02d.wxr.xml', 5 * MB_IN_BYTES );
Before I dive into implementation details (in the comments, not to pollute the ticket), I'd like to hear what use cases for extending this code you have in mind and where should we draw the line. Adding more output writers? Adding custom export data? Adding formats different from WXR?
Attachments (7)
Change History (66)
#4
@
12 years ago
Going deeper, here are some notes on a sample implementations: classes and public methods are listed here with some reasoning behind them.
There are three layers:
- Export Data
- XML Generation – takes data and turns it into WXR XML
- Writer – takes XML and gives it to the user
In the beginning I tried to make them only 2, but the upper layer class was getting huge and also was doing two very different tasks, so I split them up.
class WP_WXR_Export
– represents a set of posts and other site data to be exported
Public methods:
__construct( $filters )
– creates the object and queries for the data specified by the filters.get_xml()
– returns the export as a string of XMLexport_to_xml_file( $file_name )
export_to_xml_files( $destination_directory, $filename_template, $max_file_size = null )
serve_xml( $file_name )
– outputs the necessary HTTP headers and then the export as XMLexport_xml_using_writer_class( $writer_class_name, $writer_args )
– exports the XML data, but uses a custom writer, not one of the aboveexport_using_writer( $writer )
– if we want to use a writer, which isn't coupled with the XML generator, we can pass the writer object here directly we have also some methods to get the raw, but structured datapost_ids()
– the post ids of the posts, which will be exported, based on the filtersposts()
– returns an iterator over the posts, not an arraycharset()
site_metadata()
– title, description, language, URLs, etc.authors()
categories()
tags()
custom_taxonomies_terms()
nav_menu_terms()
class WP_WXR_XML_Generator
– responsible for generating WXR XML from export data
Public methods:
__construct( $export )
– it needs to know where to get the export data frombefore_posts()
– returns the XML from the start to the posts definitionsposts()
– returns an iterator, which on each iteration returns the XML for a postafter_posts
– returns the XML after the posts definitions- more fine-grained access to the XML of different parts in case the writer needs it:
header
,site_metadata
,authors
,categories
, etc.
For most of the writers (even for all I've written) the before/after posts distinction is good enough, but we'll have the more fine-grained methods, so it wouldn't hurt to expose them.
class WP_WXR_*_Writer
– responsible for putting the XML from the generator to some useful place – STDOUT, a file, mutliple files, zip file, network, whatever.
class WP_WXR_Base_Writer
– abstract base class with default export functionality
__construct( $xml_generator )
– it needs to know where to get the XML fromabstract protected write( $xml )
– writes a small piece of XML somewhereexport()
– passes all the data from the XML generator to the write method
After writing a couple of writers, I found I wanted to be able to change these two main aspects: where am I writing (to a file, to STDOUT, etc.) and what's the logic of what I'm writing (just at once, stop when the file is too big and start a new one, etc.)
Feedback on the classes, their relationships, naming, etc. would be very much appreciated.
#5
@
12 years ago
categories() tags() custom_taxonomies_terms() nav_menu_terms()
What's the difference between these 4? Why not have a generic terms( $taxonomy )
method?
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XML
I think this should be split out into a utility function or something. It's an operation that you do after the actual export.
export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator,
I think export_using_writer()
should be enough.
#7
follow-up:
↓ 14
@
12 years ago
Replying to scribu:
categories() tags() custom_taxonomies_terms() nav_menu_terms()What's the difference between these 4? Why not have a generic
terms( $taxonomy )
method?
Because they have subtle differences. For example there is a get_tags
filter, which isn't run if we just run get_terms( 'post_tag' )
. Also for the custom taxonomies, we need to list all the taxonomies.
I don't have a strong opinion on this, though. I just didn't spend any time trying to combine them into one method.
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XMLI think this should be split out into a utility function or something. It's an operation that you do after the actual export.
To me it seems totally equal to the rest of the export methods. How is it different?
export_xml_using_writer_class( $writer_class_name, $writer_args ) – exports the XML data, but uses a custom writer, not one of the above export_using_writer( $writer ) – if we want to use a writer, which isn't coupled with the XML generator,I think
export_using_writer()
should be enough.
Sure, it makes sense.
#8
follow-up:
↓ 12
@
12 years ago
Last, but not least, some implementation :-)
Notes:
- It's not 100% complete, some of the post XML generation and data retrieval is missing. It's tedious and I left it for later :-)
- I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
- I had to use exceptions in a couple of places. I know they are not the WordPress thing, but otherwise they are practically unavoidable. For example, if the writer issues a error and I need to catch it in the export class, I would need to have probably a hundred
if ( is_wp_error()
checks on every single call for virtually anything. - I haven't looked into having minimal export when splitting, so I am just repeating everything before the posts in each export file.
- It'd be nice to have a
_gmt
time filter. - In addittion to the patches, you can follow the development here: https://github.com/nb/WordPress/tree/export-api
#12
in reply to:
↑ 8
@
12 years ago
Replying to nbachiyski:
- I'm not at all happy with the XML generator. The here-doc approach didn't turn out very good. I have two ideas for it: to use output buffering and native PHP templates or to use/build a generic XML generator.
This should definitely be using a proper XML serializer. At the moment, the exporter (along with the various feed endpoints) can produce invalid XML since it uses concatenation. If we're going to be redoing this, it should be done properly.
#13
@
12 years ago
- Cc info@… added
Please do not omit users without posts. This is a very annoying bug in our current exporter. Especially when WordPress is used more as a CMS you need users without posts quite often.
#14
in reply to:
↑ 7
@
12 years ago
Replying to nbachiyski:
serve_xml( $file_name ) – outputs the necessary HTTP headers and then the export as XMLI think this should be split out into a utility function or something. It's an operation that you do after the actual export.
To me it seems totally equal to the rest of the export methods. How is it different?
Because you need to generate the file before serving it and to serve a XML file you don't need any of the information you passed to WP_WXR_Export; you could just pass it directly to nginx. That's why it's different.
#15
@
12 years ago
I knew there was something else I didn't like about the method names: they contain the format name, so they're not generic.
export_to_xml_file()
should be export_to_file()
, serve_xml()
should be serve()
etc.
Otherwise, you'd end up with weird things like this:
$exporter = new WP_JSON_Writer(); $exporter->export_to_xml_file();
#16
@
12 years ago
Regarding serialization, here's justification for why. WordPress' current generator is horrible with that, and will generate invalid XML fairly easily.
#17
@
12 years ago
Nikolay, rmccue and I had a nice chat in IRC about this thing.
Instead of having helper methods like export_to_xml_file(), we could have an even simpler interface:
function wp_export( $filters = array(), $additional_args = array() ) { $filters_defaults = array( 'post_type' => 'post', 'posts_per_page' => -1 ... ); $additional_args_defaults = array( 'format' => 'xml', 'writer' => 'WP_WXR_File_Writer', ... ); // instantiate things, etc. }
and of course devs can skip wp_export() and instantiate the classes themselves.
#18
@
12 years ago
- Cc sirzooro added
+1 on this. Please make it flexible to allow exporting to any custom-defined file format - I think specifically about using it to generate XML Sitemap.
#21
@
12 years ago
Awesome.
Date/time: You might want to use #18694 to generate the WHERE
values for added flexibility.
XML: It'll likely take more memory but it might be worth using DOMDocument
to generate the HTML instead of doing it by hand. For example: http://www.viper007bond.com/2011/06/29/easily-create-xml-in-php-using-a-data-array/
#23
@
12 years ago
Regarding [UT1195]: array( self, '_get_term_ids_cb' )
is not a valid way to specify a callback.
You probably meant array( __CLASS__, '_get_term_ids_cb' )
.
#25
@
12 years ago
Here is an updated patch using an XML builder and tested to make sure t gets the same XML as the old importer.
For those interested in the XML builder, here it is: https://github.com/nb/oxymel
And as always, it's much easier to follow at https://github.com/nb/WordPress/commits/export-api
#26
@
12 years ago
Looking good! One thing on the XML serialization aspect: I see tags are getting passed in as 'prefix:tag'. To be technical, they should take 'namespace:tag' with the namespace being the IRI identifying the namespace, and the serializer working out internally what the prefix part is. It's not as great to use, but it's the correct way to work with XML, and avoids the possibility of having conflicting namespaces.
#27
@
12 years ago
mccue, I agree, using true XML namespaces makes a lot of sense, but it fell out of the initial scope. Patches welcome! :-)
#28
@
11 years ago
That I can tell, one can't currently filter or augment the export query based on meta_key/values with the existing filter / action.
Can:
- Add to the form via export_filters actions
- Ensure args are validated via export_args filter
Cannot
- Build the query which limits results to those which have a matching meta field.
The action export_wp exists at the top of the function before join / where conditions are created. It would be good to have similar functionality as categories and taxonomies as a part of the options for export if it is being enhanced.
#30
@
11 years ago
I don't think wpdb is the best place for the build_IN_condition()
helper:
it doesn't use any variables or methods from wpdb- it uses
$wpdb->prepare()
- it uses
- custom wpdb implementations would all need to add it
#31
@
11 years ago
I started using the export API in WP-CLI: https://github.com/wp-cli/wp-cli/pull/525
The only complaint I have (so far) is that WP_Export_Split_Files_Writer is a black box.
If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.
It would be nice to support "skin" classes, similar to WP_Upgrader_Skin.
#32
@
11 years ago
I don't think wpdb is the best place for the build_IN_condition() helper.
I can’t imagine a custom wpdb
implementation, which wouldn’t want to inherit from wpdb
. Can you think of any? Currently an implementation like this would need to define a lot of helper methods like replace
, insert
, update
, delete
, get_var
. All of those probably won’t change in a custom wpdb
implementation.
If I want to log any info while the export is happening, I have to implement my own WP_Export_Base_Writer child class.
I agree this isn’t cool.
Since logging is a cross-cutting concern, I would implement it via actions, not via a “skin” class.
Pros:
- We won’t introduce another class dependency.
- In the default implementation we won’t need an empty skin.
- Individual actions could be helpful separately.
- Better forward compatibility – if we add a step, older skins will be incomplete, but still working.
Cons:
- In cases where one needs to log every step, you will need to hook to a lot of actions, instead of just writing specific methods in a class.
#33
@
11 years ago
- Cc daryl@… added
In working out how to handle taxonomy loops that can cause a WSOD, I encountered some problems with the approach given here. For example, the try/catch stuff in wp_export()
returns things, but we never do anything with it, which means that the error handling is basically moot. Also, WP_Export_XML_Over_HTTP::write()
just spews headers and whatever output there is (even if it's empty because of a silent error), which makes error handling problematic. The forthcoming patch includes some tax loop handling but also rejiggers a couple of things to provide better error handling (I hope). I'm not sure it's 100% of the way there, but I thought I'd submit for consideration. It's a diff of the diff, to isolate the changes I'm proposing.
#34
follow-ups:
↓ 35
↓ 36
@
11 years ago
Oh, we should have definitely done something with the error result of wp_export()
and also, we should’ve buffered the output, I totally agree.
Few notes on the patch:
- Instead of adding
wp_die()
s towp_export()
, we should process the return value ofwp_export()
inwp-admin/export.php
and add the possiblewp_die()
there. One of the ideas ofwp_export()
is that it can be used in any context: web, cli, other libraries. If we add awp_die()
there, we will deprive the callers of control what to do in case of an error. - It’s strange that in
WP_Export_XML_Over_HTTP
we are both explicitly echoing to standard output and then using output buffering to collect the result. This commit fixes that and makes sure we don't output anything before the export process is over. (a diff file, if needed). - I am on the fence if we should be including the tax loop handling in the export, at all. My intuition tells me that we are fixing the problem at the wrong level. I have a few questions:
- Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
- Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
- What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?
#35
in reply to:
↑ 34
@
11 years ago
Replying to nbachiyski:
- Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
I'm not aware of any other places where it causes trouble; I've only ever seen it affect exports. But I can't say that it's not causing issues elsewhere.
- Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
Seems like it would affect any WP install, but then we do have some weird tax stuff going on on wpcom. I've never managed to reproduce the natural creation of a tax loop (that is, without hacking one in), so my impression is that these are mostly legacy issues. Maybe it's possible to import incorrectly parented terms, but I haven't repro'd such a case.
- What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?
This probably isn't practical for wpcom. If we're reluctant to add tax loop stuff in core, I'd be cool with just adding an action we could fire instead and hook onto that on wpcom.
#36
in reply to:
↑ 34
;
follow-up:
↓ 37
@
11 years ago
Replying to nbachiyski:
I have a few questions:
- Are orphaned terms causing trouble in other parts of WordPress? If yes, we should solve this at a higher level.
- Is this problem known outside of WordPress.com? If not, can we just add a filter/action in the export and deal with this locally on WordPress.com, instead of adding more code and complexity for everybody.
- What is the root cause? Can we just fix it, fix all the sites, and not bother with adding more code here?
Orphaned terms are possible to create in core WordPress too, we have some code which tries to help prevent orphans and loops but it isn't perfect.
From what I recall there isn't a single root cause.
There are other tickets open around improving the core code to prevent more loops/orphans but I'm not sure we will ever be able to 100% prevent them and we should code defensively for there existence IMHO.
#38
@
11 years ago
I've added export.2.diff, which incorporates nbachiyski's feedback re maintaining caller control. Basically:
- The HTTP writer now handles the die() statements when errors are caught.
- The returner writer returns WP_Error objects when an exception is caught.
- The base writer returns WP_Error objects when an exception is caught.
- The file writer re-throws any caught exceptions so that they can be handled further up the stack.
The patch also incorporates the echo/output buffering issue that nbachiyski highlighted and linked to a github commit for.
Based on westi's feedback, I'm leaving the tax loop stuff in for now.
I've tested these changes both with the web exporter and with a script that calls wp_export() at the command line, and so far so good. I'd gratefully receive further suggestions for improvements.
This ticket was mentioned in IRC in #wordpress-dev by netweb. View the logs.
10 years ago
#41
@
10 years ago
- Milestone changed from Awaiting Review to Future Release
Working to get Daryl's patch into WP-CLI in https://github.com/wp-cli/wp-cli/pull/1153
I too would love to see movement on this, but it probably fits within a more holistic conversation about importing and exporting.
This ticket was mentioned in IRC in #wordpress-dev by danielbachhuber. View the logs.
10 years ago
This ticket was mentioned in IRC in #wordpress-dev by netweb. View the logs.
10 years ago
#44
follow-ups:
↓ 48
↓ 50
@
10 years ago
Nacin and I chatted about this a while back in a private Skype channel. Apologies for not posting a summary earlier.
WP-CLI has been using Nikolay's exporter for 10 months and 3 major releases (ref). From a functional perspective, it does the trick — no blocking bugs discovered. It solved performance issues when exporting large sites, and offers a degree of run-time flexibility. However, from a "WordPress architecture" perspective, which will always be hotly debated, it remains an open question as to whether the code fits.
Specifically, this approach includes a new, one-off library for writing XML, dubbed Oxymel
. It uses this new library by chaining, a foreign-to-WordPress code pattern. These two considerations, plus the other feedback in this thread, mean committing the patch would not be an inconsequential decision.
In our chat, Nacin said "needs to be significantly simplified while still maintaining some good level of flexibility and extendability". Given the legacy of the exporter (which produces an invented, inconsistent XML format), it remains to be seen whether this is possible. Consider it a challenge to anyone interested.
Personally, I'm inclined to revisit the high-level goals of export & import. In part because any enhancement to one will need to be an enhancement to another, but also because a given enhancement will need to be *really good* to overcome the problems inherent in everyone's minor differences of opinion. To me, the problem space really is "how can sync my two WordPress sites?", either as a one-off, one-way sync, a two-way, but manual sync, an automatic sync, etc. etc. If we were to embark on this space mission, I'm sure we'd make sure to solve this particular issue's high-level goals along the way.
This ticket was mentioned in Slack in #core by boone. View the logs.
10 years ago
This ticket was mentioned in Slack in #core by netweb. View the logs.
10 years ago
#47
@
10 years ago
When using WP-CLI's export command with --category parameter I was getting the PHP notice below:
PHP Notice: Trying to get property of non-object in php/export/class-wp-export-wxr-formatter.php on line 117
This resulted in several empty <wp:category> tags in the generated file. The attached patch fixes this problem by making sure that WP_Export_Query::categories() always return an array as expected by WP_Export_WXR_Formatter::categories().
Note that line 117 of class-wp-export-wxr-formatter.php in WP-CLI is equivalent to line 135 of the same file in this ticket.
See the link below for the related WP-CLI pull request:
#48
in reply to:
↑ 44
@
10 years ago
Replying to danielbachhuber:
Specifically, this approach includes a new, one-off library for writing XML, dubbed
Oxymel
. It uses this new library by chaining, a foreign-to-WordPress code pattern. These two considerations, plus the other feedback in this thread, mean committing the patch would not be an inconsequential decision.
Just poking my head in here to say that another huge benefit of Oyxmel is that AFAIK it uses a proper XML serializer rather than string concatenation. XML is one of the things that's super easy to output wrong, and the spec on parsing means that you pretty much have to be super strict when parsing. Right now, it's easy to output invalid XML unintentionally, and have parsers fail.
(And I don't mean this in a theoretical way either, you can easily export WXR from a site that fails when you try to import it back into the same version of WP, thanks to PHP's XML parser following the correct rules.)
#49
@
10 years ago
I noticed just recently that we're exporting the author info with incorrect xml. The importer step that creates author mappings expects an XML node named <wp:author>
(citation) but we're creating <wp:wp_author>
. This is easily corrected by updating the authors()
function in wp-includes/export/class-wp-export-wxr-formatter.php
to output $oxymel->tag( 'wp:author' )->contains
rather than $oxymel->tag( 'wp:wp_author' )->contains
.
@
10 years ago
Use the right name for the author XML node (see https://core.trac.wordpress.org/ticket/22435#comment:49)
@
10 years ago
Consolidate all the previous patches (except export.diff) into a single patch to make it easier to test and use the new exporter
#50
in reply to:
↑ 44
@
9 years ago
Replying to danielbachhuber:
In our chat, Nacin said "needs to be significantly simplified while still maintaining some good level of flexibility and extendability". Given the legacy of the exporter (which produces an invented, inconsistent XML format), it remains to be seen whether this is possible. Consider it a challenge to anyone interested.
Personally, I'm inclined to revisit the high-level goals of export & import. In part because any enhancement to one will need to be an enhancement to another, but also because a given enhancement will need to be *really good* to overcome the problems inherent in everyone's minor differences of opinion. To me, the problem space really is "how can sync my two WordPress sites?", either as a one-off, one-way sync, a two-way, but manual sync, an automatic sync, etc. etc. If we were to embark on this space mission, I'm sure we'd make sure to solve this particular issue's high-level goals along the way.
I tend to agree With Daniel's comment and that this remains the situation. I think the best course is for a team to really explore the entire problem area and come up with a solution instead of just trying an incremental improvement.
#51
@
9 years ago
I think the best course is for a team to really explore the entire problem area and come up with a solution instead of just trying an incremental improvement.
Worth mentioning too the REST API will provide a much better foundation for this than we've had in the past, as we'll have a much more consistent interface for CRUD operations to WP.
This ticket was mentioned in Slack in #core by jorbin. View the logs.
9 years ago
#54
@
8 years ago
As far as a solid UI and UX, WP All Export does a really good job for exporting out data from WP.
If there was a way of taking elements that work to apply for however the UI ends up being for the updated WordPress Exporter.
#55
@
8 years ago
So would there be an max file size option for a larger export XML file?
As well as say so many records per file?
http://www.rangerpretzel.com/content/view/20/1/
As well as maybe automatically compressing larger files into one Zip or backup the export to external source?
#56
@
8 years ago
So would would be an max file size option for a larger export XML file?
As well as say so many records per file?
http://www.rangerpretzel.com/content/view/20/1/
As well as maybe automatically compressing larger files into one Zip or backup the export to external source?
The only other format that would make sense is JSON, but the difference to WXR would be superficial.
I think the ability to add custom data to the file is pretty important, and it's tricky to get right when multiple export files are used.
In general, we should be very careful how we split the data across files; each one should be independent from all the rest.