Make WordPress Core

Opened 4 months ago

Last modified 6 weeks ago

#60375 new enhancement

Site Transfer Protocol

Reported by: zieladam's profile zieladam Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Import Keywords:
Focuses: Cc:

Description (last modified by zieladam)

Migrating WordPress sites involves custom, error-prone logic. There are no canonical tools and the guidelines seem lacking.

Let's:

  1. Formalize a list of steps involved in transferring a WordPress site between hosts
  2. Build a canonical plugin that implements those steps and enables easy site migrations
  3. Merge it into WordPress core once its stable

This is relevant for:

  • Site migrations
  • Creating and restoring site backups
  • Staging and development environments
  • WordPress Playground imports and exports
  • Moving live sites into Playground and vice versa

... probably a lot more.

ZIP bundle as the export format

The Data Liberation proposal makes a great argument for a ".zip" bundle as the export format. I would love to leverage it here. A wordpress.zip file with all the site files and the data in an .sqlite (or plaintext .sql) format sounds like the most natural and convenient way of moving WordPress sites around.

Large sites may seem problematic at the first glance, as 300GB zip archives are difficult to manage. However, the ZIP format was built with streaming, compression, chunking, checksums, and seeking in mind. It is a good fit for handling imports that are many GBs large even on a host with 64 MB of ram allocated and not enough hard drive space to hold the import file itself.

To support that last point – I’ve built a streaming zip encoder and decoder in JavaScript for Playground. It can cherry-pick a single file from https://downloads.wordpress.org/plugin/gutenberg.17.5.2.zip by transferring only a few kilobytes and without downloading the entire 10+MB archive. It works with zip files, and it would work with a Synchronization API endpoint where the zipped fragments are generated on demand.

Differences with WXR

Unlike WXR imports this is looking to transfer a site in its entirely with the Transfer Protocol. The export bundle should include every database table, every installed plugin, every asset and file in the wp-content directory. It must also include meta information such as the domain from which the site is being exported and all custom wp-config.php settings. This will be necessary in order to automate the transfer.

Tasks involved in site transfer

  • Set IMPORTING constant so things shut down:
    • Stop sending emails
    • Database replication
    • Cleanup jobs/CRON jobs that might filter on post creation
  • Communicate source and destination site domains/base URLs
  • Rewrite URLs in the database to match new site URL
  • Rewrite URLs in all files including wp-config.php, wp-content, sunrise.php, mu-plugins, etc.
  • Communicate wp-config.php settings, including things like WP_SITEURL and plugins directory, theme directory, content directory, memory limits, and other settings.
  • Let the target site set the database credentials.
  • Copy all content from source to destination site, including users, site options, database * tables.
  • Bonus if there's no post-processing via tools like wp search-replace. The transferred data would rewritten as the transfer happens (e.g. to adjust the site URL).
  • Bonus if we can cryptographically secure the conduit through which the transfer takes place to prevent someone intercepting a transfer (e.g. create a private/public keypair, only allow a single transfer at a time, use that certificate to authenticate the transfer.
  • Bonus to track transfer state, communicate progress on it, and allow for pausing and resuming a transfer.
  • Bonus if we can start a database transaction log via $wpdb or similar system when starting a transfer so that the source site can continue to serve requests and ensure that the destination site gets a full concurrent update to its data.

Challenges

  • This assumes a blank slate on the target site otherwise we risk overwriting ids or mismatching ids.
  • The right design could become a foundation for live synchronization between WordPress sites.

Related efforts

Co-authored with @dmsnell

cc @dufresnesteven @berislavgrgicak @tellyworth @dd32 @barry @payton @peterwilsoncc @swissspidy @miyarakira @matt @youknowriad @mamaduka @aristath

Attachments (2)

Sync Protocol Flow.drawio.png (152.5 KB) - added by dmsnell 6 weeks ago.
Rough outline of synchronizing resources via vector clock
Sync Protocol Dashboard.drawio(1).png (133.3 KB) - added by dmsnell 6 weeks ago.
A hypothetical dashboard showing which resources are ready to be sent, optionally withheld from syncing, and an ongoing transfer that can be paused.

Download all attachments as: .zip

Change History (27)

#1 @zieladam
4 months ago

  • Description modified (diff)

#2 @zieladam
4 months ago

  • Description modified (diff)

#3 @zieladam
4 months ago

  • Description modified (diff)

#4 @zieladam
4 months ago

  • Description modified (diff)

#5 @zieladam
4 months ago

  • Description modified (diff)

#6 @zieladam
4 months ago

  • Summary changed from Site Transfer protocol to Site Transfer Protocol

#7 @dmsnell
4 months ago

Unlike WXR imports

Lately I've been thinking that we could augment WXR since it's inherently expandable, to communicate much of the extra site meta. In fact, existing imports would/should ignore those sections, but transfers could use the information to aid in the import.

Communicate source and destination site domains/base URLs

One challenge in recording source and destination is that either can disappear. If we normalize all deployment-related information to the Playground and its URLs, then presumably we could export from any site, and that same export file could be imported into any other deployment, regardless of where it originated or where it's bound.

This could also easy the identification of assets within the database since there would be a common Playground prefix.

Communicate wp-config.php settings, including things like WP_SITEURL and plugins directory, theme directory, content directory, memory limits, and other settings.

Extending the previous point, a transform to the Playground as an intermediary would give us the chance to not only record those options which are specifically called out on the source site, but potentially other available settings so that we can ensure proper reception.

Consider if we normalize to a Playground environment where every custom setting is adjusted. Someone could load that export file into the Playground and ensure it will work wherever it's going. For example, maybe there's a custom location for plugins; if the source site is hard-coding values then it might break on import. This preview step would provide an opportunity to catch that error.

#8 @zieladam
4 months ago

@dmsnell I just realized we need to answer this fundamental question before going further:

Is Site Transfer a direct Host <-> Host operation with optional support for .zip uploads? Or is it an export&download -> upload&import operation built with future Host<->Host exchange in mind?

I assumed this ticket is about a "download&upload" flow, and the direct Host <-> Host exchange will be another discussion related to the Sync Protocol.

Now I am not so sure. Perhaps the architecture should already prioritize a conversational and asynchronous data exchange between sites, with ZIP downloads as a cool-but-secondary feature.

Lately I've been thinking that we could augment WXR since it's inherently expandable, to communicate much of the extra site meta. In fact, existing imports would/should ignore those sections, but transfers could use the information to aid in the import.

I can see three "natural" ways of encoding that metadata:

  • WXR. However, when would WXR hold both content AND metadata? On site export the content would be in the database so the WXR file would only carry the metadata – at which point it's wouldn't have almost anything in common with WXR as we know it. On the upside, WXR can be streamed with the upcoming XML API and you can also edit them with a text editor.
  • JSON. Either package.json, wp.json, or blueprint.json. I am not sure what would it carry besides that metadata, though. A wp.json or blueprint.json file could be the standard bootstrap format for WordPress sites and it would play nicely with Blueprints being encoded as JSON.
  • SQLite. There could be a table like wp_export_metadata where all the relevant information would be stored. Or perhaps we could even store that as site options. It would be readable and writable via $wpdb.

For ZIP exports, I would love the JSON option as it would be aligned with Blueprints JSON files. For streaming data exchange, WXR seems better suited. I'd scope this down to a single data format for starters and WXR – or XML in general – seems to have the most upsides. Later on we can chat about JSON support to repurpose the same logic in development environments.

One challenge in recording source and destination is that either can disappear.

In ZIP exports, this is much easier as the archive contains all the static assets and not just their remote URIs.

In direct Host <-> Host transfer we need an entire world of error handling logic.

Now I question that fundamental question again. An initial implementation that only works with a ZIP export file seems like a much more attainable target. It would already help developers, small-to-medium sites, and Playground while enabling learning and exploring. The Host<->Host conversation could still drive all the major architectural decisions to enable smooth transition into implementing that API in V2.

Consider if we normalize to a Playground environment where every custom setting is adjusted

I love this. My only concern are large, 10GB exports that wouldn't fit into memory today. Playground could support remote ZIP files as virtual filesystems, though, which would unlock working even with those huge sites. Writing the changes back into these ZIPs would mean simply appending updated files at the end of the stream.

Last edited 4 months ago by zieladam (previous) (diff)

#9 @dmsnell
4 months ago

Is Site Transfer a direct Host <-> Host operation with optional support for .zip uploads? Or is it an export&download -> upload&import operation built with future Host<->Host exchange in mind?

The more I consider it the more I see these as the same thing, whereby the ZIP format is the means through which we normalize the transfer. I could be totally overlooking obvious things here though, so I would like to know where this idea makes no sense.

Given the VFS-like interface we have with ZIPs, I imagine that if a site only wants to import posts and not media then it will skip the part of the ZIP containing the wp-contents file.

Maybe this is asking too much of the remote site, to regenerate a ZIP on the fly for specific parts. The challenges I'm capable of seeing at the moment are all more related to whether we ship wp-content assets the same way we ship database and config values. It's all about bulk data and less about the destination of the transfer.

WXR. However, when would WXR hold both content AND metadata? On site export the content would be in the database so the WXR file would only carry the metadata – at which point it's wouldn't have almost anything in common with WXR as we know it. On the upside, WXR can be streamed with the upcoming XML API and you can also edit them with a text editor.

I'm still completely on the fence about this too. Of course there'd be duplication of content in the WXR vs. the tables, but I see the database as the authoritative source for non-asset content while the WXR could be a reasonable signaling protocol to guide the import.

Some part of me wants to remove the content from the WXR, but if we do that we potentially lose a lot for older systems and for our ability to easily inspect the export. 🤔

Even if we have post content in the WXR it will lack the meta information unless we also export it there as well, which I guess we could do, and even remove those rows from the sqlite database 🤷‍♂️

Something big still seems to be missing that I haven't seen yet on all this, but I think we're starting to get a better handle of the space by asking all these questions and figuring out how it could all go wrong.

In direct Host <-> Host transfer we need an entire world of error handling logic.

Yes, but also if "the Playground ZIP" is the transfer format then it's indistinguishable from importing a ZIP from a local disk, other than the bytes are arriving over the network. yeah we'll need another layer of error handling, but we should be able to restart the ZIP mid-sequence on the source site.

makes me thing that one preliminary step we'd need for this, to make it reentrant, is to create a relatively small manifest on the source site to start the process. this could do a number of things:

  • generate content hashes for all relevant media or database tables. this might involve some way of snapshotting the data.
  • generate a list of media files and their content hashes
  • sequence the files for the ZIP stream.

after this the source site can reference that manifest to virtually deliver the ZIP stream mid-sequence without having to scan all the data on its own disk. this manifest would roughly correspond in size to the number of files and database objects, but it could itself be a kind of journaling snapshot of a site - maybe there's a tie-in with other snapshotting/concurrent work on this

#10 @zieladam
4 months ago

The more I consider it the more I see these as the same thing, whereby the ZIP format is the means through which we normalize the transfer. I could be totally overlooking obvious things here though, so I would like to know where this idea makes no sense.

I think you're right here. To me, Host <-> to ZIP is a version of the Host <-> Host scenario where one of the Hosts passively accepts the stream and never asks to pause, rewind, resume, compute delta etc.

The challenges I'm capable of seeing at the moment are all more related to whether we ship wp-content assets the same way we ship database and config values. It's all about bulk data and less about the destination of the transfer.

I consider both bulk data. Database dumps may be as large as 300GB. Harmonizing the transfer infrastructure for both seems like more reliably and less code.

Of course there'd be duplication of content in the WXR vs. the tables, but I see the database as the authoritative source for non-asset content while the WXR could be a reasonable signaling protocol to guide the import.

We should bet a clearer idea of what's worth including in WXR and duplicating from the database once we start shaping the code. I remain open minded.

Even if we have post content in the WXR it will lack the meta information unless we also export it there as well, which I guess we could do, and even remove those rows from the sqlite database 🤷‍♂️

I'm hesitant about this. We'd have to preserve IDs so it would be just moving the data for the sake of moving it.

It should be fine to ship a WXR file that has no content at all, only the guiding metadata. I even wonder whether Blueprints could be expressed as WXR, in which case the exported WXR file would be a Blueprint.

yeah we'll need another layer of error handling, but we should be able to restart the ZIP mid-sequence on the source site.

Agreed! Here's a fancy idea – we could support partial uploads. Instead of uploading the entire 1GB ZIP all at once, it would become a series of requests, each transmitting a part of that ZIP. This way no crash would invalidate all the work. Even if the internet goes down, you'd just "upload" the ZIP file again and only the unprocessed bytes would be uploaded – and stream-processed during the upload.

is to create a relatively small manifest on the source site to start the process. this could do a number of things:

It could be bundled in WXR perhaps.

#11 @zieladam
4 months ago

A proper WXR importer would be a good first step here. Brownie points if it supports stream processing. WXR carries post data where site URLs need to be rewritten in the block attributes and inside the HTML markup. Some of these posts may be larger than the available memory size, hence the need for stream processing. All of that also applies to the full site transfer.

The only blocker I can see for that WXR importer is lack of an XML_Tag_Processor in core. We do have HTML_Tag_Processor, but it can't be used to process XML documents. Luckily, @dmsnell is working on the XML machinery. Once it is available, it will unlock a lot of exploratory avenues here.

Last edited 4 months ago by zieladam (previous) (diff)

#12 @zieladam
4 months ago

The export bundle could rewrite all site URLs to wordpress.internal if this proposal goes through:

https://www.icann.org/en/public-comment/proceeding/proposed-top-level-domain-string-for-private-use-24-01-2024

@georgestephanis noted that:

My understanding is that by designating .internal this way, networking hardware can short circuit dns requests and the like

Here's a related HN discussion: https://news.ycombinator.com/item?id=39206503

Alternatively, perhaps a .local domain would also be a good candidate.

If neither of the two would make sense, playground.wordpress.net sounds like the third best option.

#13 @dmsnell
4 months ago

Some of these posts may be larger than the available memory size, hence the need for stream processing

this may be splitting hairs, but I don't believe that this is a constraint. ultimately our minimum chunk of content that must be able to fit into memory atomically is a post's contents. while a WXR may be much bigger than the available memory, any post whose post_content is too large will fail at every stage of the pipeline and cannot be imported, created, or rendered even if it were in the database.

the discussion for .internal is interesting. might be a ripe time for us to share some input to represent WordPress' and the Playground's needs.

#14 @zieladam
3 months ago

Surfacing the Stream Everything issue in the PHP Blueprints project – we could use those explorations to build the data processing primitives for the live sync:

https://github.com/WordPress/blueprints/issues/2

#15 @zieladam
3 months ago

About site-to-site sync, I want to surface a future access control problem to discuss. Imagine syncing a WordPress site into a mobile app where a collaborator of that site is logged in. Is it possible to build an access control layer that limits the sync to only the data that subscriber is supposed to read and modify?

My gut feeling is: no. The sync needs to be restricted to site admins.

Let's discuss the two synchronization angles here:

  • Writing the data. WordPress uses use-case oriented capabilities. Sync would use database rows and columns. The two models don’t work together. This can theoretically be solved, though, by implementing row– and column– level write restrictions.
  • Reading the data. From the database perspective, WordPress demands full access even if it limits what the users may read. To render the site, WordPress uses site options, post meta, plugin tables, and so on.

I don’t think reading permissions can be solved. I just don’t see a way to run a WordPress site using just the subset of data subscribers have access to. If anyone can think of a solution, let's discuss.

#16 @zieladam
2 months ago

About live sync, @berislav.grgicak said the CouchDB Replication Protocol and CouchDB API are both noteworthy and could potentially be useful to recreate in PHP:

#17 @zieladam
2 months ago

Dolt also looks interesting:

https://github.com/dolthub/dolt

It’s a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository.

MySQL->Dolt replication, push/pull mechanics, and the conflict resolution approach could give us a lot for free. PostgreSQL support is on their roadmap, too, and perhaps that’s how WP could get Postgres support. A few things that worry me are the ~103mb binary size, no plans to support WebAssembly (in 2019 at least), and no timeline for Dolt->MySQL replication.

#18 @pradeepvr
2 months ago

To make this thing work across the board, should we specify and pre declare entities like Code config, Site Settings Data and user generated content separately to make a clean migration.

Also this has to be done with a strategy where the migration code should consider the post to attachement and post meta linking in a ID agnostic way. Is this possible to get into this scope here?

Last edited 2 months ago by pradeepvr (previous) (diff)

#19 follow-up: @zieladam
7 weeks ago

@pradeepvr this issue is specifically about a transfer technique that preserves all the IDs. I don't think IDs can be reliably abstracted given that they're often serialized in JSON or serialize() output. Would you be able to elaborate on why do you need an ID agnostic way of linking database data? FWIW, a VersionPress-like approach could be useful for that.

#20 @zieladam
7 weeks ago

@dmsnell had an idea of setting up a state table on both the "source" and the "target" site. It would consist at least of a "Connection ID", "Version number", and "Record identifier". Potentially it could also include a "Last updated" timestamp.

A connection ID would be different for each connected site – I could have a single WordPress propagating content to three different sites.

A Version Number would be a simple counter. Anytime a database record changes, it's incremented. The actual data revision wouldn't be stored here, just the number. The only purpose is to check whether a given record changed since the last sync.

A Record identifier would uniquely identify a record. It could be two columns – one for the table, and one for the record ID.

The timestamp would enable querying changes made since a specific point in time, e.g. "the last sync" but also "yesterday".

#21 in reply to: ↑ 19 @pradeepvr
7 weeks ago

Replying to zieladam:

Would you be able to elaborate on why do you need an ID agnostic way of linking database data? FWIW, a VersionPress-like approach could be useful for that.

I thought the main thing about the ID agnostic way would be it will help import post, pages and attachments separately in out of order and doesn't have to worry about linking between them manually.

I was thinking in the lines of separating the configuration settings and user generated content before finalizing the STP and using that, we could copy only the user content from one site to another. With the Versionpress and connectionID approach you have mentioned in other comment i infer now that its not needed to do it in a post ID agnostic way. I don't fully understand this STP thing and thus i'm blabbering little bit it seems :)

#22 @zieladam
7 weeks ago

Thank you for explaining @pradeepvr! Indeed ID agnosticism is not required to reliably transfer the user content, in fact it will be more reliable if the IDs are preserved :-) However, for ID-agnostic approach, one exploratory avenue would be to start IDs with something arbitrary like 12345600001, and then try to detect them in text blobs. It wouldn't be 100% reliable, but perhaps it would work well enough.

#23 @zieladam
7 weeks ago

Surfacing this interesting bit from @artpi:

I found a very interesting alternative to PouchDB:
I am brainstorming of ways to use WP REST API as Firebase replacement and stumbled upon Watermelon DB:
https://watermelondb.dev/docs
It uses SQLite under the hood, just as playground. But it also has a sync engine that does not require a specific software on the server side. It only requires 2 endpoints: push and pull.
https://watermelondb.dev/docs/Sync/Backend
Not only that, but I’m thinking it can be made use existing WP REST API endpoints by overriding synchronize function
https://watermelondb.dev/docs/Sync/Frontend
This is something more interesting for mobile apps.
But here is what Im thinking:
Replicate WP Schema in Watermelon DB on the playground end
Write Push/Pull endpoints in a plugin
That way we would have fully decentralized sync where playground instances can treat “real” wp site as a server

@dmsnell
6 weeks ago

Rough outline of synchronizing resources via vector clock

#24 @dmsnell
6 weeks ago

A long post follows; please bear with me.

In my proposal, I considered using a form of vector clock to track potentially-unsynchronized state between connected WordPresses. I've tried to convey an extremely rough sketch in the attachment above. This does not address the conflated ID problem, but I can hopefully speak to that at the end.

I propose a best-effort system for ensuring that updated resources are detected and shared between connected sites, where connected sites are admin-level connections communicating via a "backdoor" secure connection, established by exchange of private/public key pairs.

For each connection, both sites will store a new record in their synchronization state table indicating the identify of the connected WordPress. This will be important for the UX of the system.

When resources are updated, they have inherent dependencies. These could be files or related database records. By instrumenting $wpdp properly, we can build associations and dependency chains automatically (or choose to keep all resources in sync between sites and record everything). Every time a record is updated, we track in a state table a version number for that resource. This is a simple system: a write increments the version by one, even if the data is the same as before the update.

A site will then have a new table tracking every uploaded file, every plugin, every database record, and every of any other resource it has, as well as a single number for each of those. This table will be much smaller than the tables containing those resources. Deleting a resource can be represented through NULL or 0 or some other tombstone.

When sites connect, a primary site can transfer all its records (the Transfer) to the secondary site. It will record in the sync state tracking which version of each resource it sent during the transfer (and it can wait for acknowledgement from the receiving site). From this point on it will have a sound guess at what content the secondary site has.

When sites continue to communicate, the primary site can compare the version of each resource it has updated against the version it last sent to the secondary site. Any new, deleted, or updated resources are expected to be stale on the secondary site and thus need to be transferred over.

User flows

It's at this point we can see some high-level designs in this approach. For minimal additional work and storage we can track what content needs to be transferred. This can be presented to a user in a dashboard, and we can even create "recognizers" to further classify the resources. For example, a plugin can give a name and description to an otherwise unknown database row. The primary site can perform a quick computation to estimate the total number of resources needing a transfer, as well as their approximate byte size.

This method also depends on establishing two-way communication via the "backdoor" channel. This can be achieved on standard WordPress hosts using a combination of long-polling and stream_select() and some other communication on the server, but does not require long-running PHP processes or threads or forking processes. See the next attached image for a preview of the dashboard.

This is a direct synchronization protocol, whereby two connected sites trust each other, and the receiving site will import received content into its database. Things is currently lacks is a sense of provenance. It would be favorable to store the source and timestamp of all imported resources in order to be able to show what has been sync'd vs. what was created locally.

Because of the sync-state table all transfers are interruptable and trackable. They can fail and be retried. Also, through the use of the HTML API and dependency inference, it's possible to prioritize resource transfer, such that dependent resources exist on the receiving end before the resource itself. This leads to zero-downtime transfers where an imported post is immediately complete upon import, since any linked content exists first and the post can be rewritten upon arrival with the HTML API to update those links.

Discussion

I apologize for how lengthy and simultaneously rough and prescribed this is. I'm trying to dump some ideas "onto paper" since @zieladam and I have spoken about this many times. It's a big-picture idea for a technical design that powers a specific user flow, which is all about visibility into a reliable and interruptable synchronization process.

@dmsnell
6 weeks ago

A hypothetical dashboard showing which resources are ready to be sent, optionally withheld from syncing, and an ongoing transfer that can be paused.

#25 @zieladam
6 weeks ago

Downloading media files sequentially can be extremely slow – see this network graph from Playground: https://github.com/WordPress/blueprints-library/issues/100

The transfer protocol should parallelize the transfers using stream_select – a native PHP function supported on all PHP versions supported by WordPress. I've built this small, dependency-free AsyncHTTP\Client wrapper for the Blueprints library that we could readily reuse here:

https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php

Note: See TracTickets for help on using tickets.