Make WordPress Core

Opened 3 weeks ago

Last modified 2 days ago

#64696 new defect (bug)

Real time collboration effectively disables persistent post caches while anyone edits a post

Reported by: peterwilsoncc's profile peterwilsoncc Owned by:
Milestone: 7.0 Priority: normal
Severity: normal Version: trunk
Component: Posts, Post Types Keywords: has-patch has-unit-tests
Focuses: performance Cc:

Description

When real time collaboration is enabled, any time the post editor is open the persistent caching for WP_Query is effectively turned off due to the frequent updates to the wp_sync_awareness meta key.

Any time post data is updated, the post's last_changed time up updated in persistent caches to indicate the entire WP_Query (and various other post caches) need to be invalidated.

To reproduce:

  1. Configure a site with a prescient cache
  2. Enable real time collaboration
  3. Without an editor open, run the WP CLI command wp eval "var_dump( wp_cache_get_last_changed( 'posts' ) );" a couple of times, a few seconds apart
  4. Observe that the value does not change between calls.
  5. Edit a post, any post type will work.
  6. run the WP CLI command wp eval "var_dump( wp_cache_get_last_changed( 'posts' ) );" a couple of times, a second apart
  7. Observe that the value does changes every second

As objects cached in the post-queries group use the outcome of wp_cache_get_last_changed( 'posts' ) to salt their caches, this means that leaving the editor open will effectively prevent the caching of post queries throughout the site.

Reducing the frequency the wp_sync_awareness meta data is updated will reduce the severity of the problem, if that's possible.

To an extent this is an inevitable effect whenever the RTC syncing data is stored in the post and post meta tables, ideally it would be limited to only occur when an edit is made so that leaving a browser tab open doesn't have negative implications on the performance of a site.

I've put this on the 7.0 milestone for the purposes of investigating the issue to see if mitigation is option.

Change History (62)

#1 follow-up: @dd32
3 weeks ago

For discussion purposes, How much does this differ from the previous behaviour?

I believe the previous editor / post-locking heartbeat also resulted in a post edit, but I recall that being every 30-60 seconds instead of almost every second.

#2 @westonruter
3 weeks ago

Oh, that's not good. It seems this is due to WP_Sync_Post_Meta_Storage::add_update() calling add_post_meta() which ends up triggering a added_post_meta action, and in default-filters.php the wp_cache_set_posts_last_changed() function is hooked to run when added_post_meta fires.

Similarly, when WP_Sync_Post_Meta_Storage::set_awareness_state() is called, it calls update_post_meta() which will end up triggering the updated_post_meta action, and wp_cache_set_posts_last_changed() also runs when that happens.

To work around this, we could temporarily unhook these when the post meta is added or updated:

add_action( 'added_post_meta', 'wp_cache_set_posts_last_changed' );
add_action( 'updated_post_meta', 'wp_cache_set_posts_last_changed' );

This presumes that changes to this postmeta will not impact any related caches (which I presume they wouldn't).

#3 in reply to: ↑ 1 ; follow-up: @peterwilsoncc
3 weeks ago

Replying to dd32:

For discussion purposes, How much does this differ from the previous behaviour?

I believe the previous editor / post-locking heartbeat also resulted in a post edit, but I recall that being every 30-60 seconds instead of almost every second.

Heartbeat:

  • fires every ten seconds
  • throttles to two minutes once the browser loses focus, via the visibility API
  • throttles to two minutes if the browser has focus but there's no mouse/keyboard activity for five minutes
  • suspends itself after ten minutes if suspension is enabled; suspends itself after sixty minutes if suspension is disabled

RTC:

  • fires once per second without collaborators active
  • fires four times per second with collaborators active
  • will back off due to network errors, once a network request succeeds it will go back up to full speed
  • doesn't slow down if the browser window loses focus
  • I can't find any code indicating a slow down due to a lack of activity, I've had a browser window open in the background for around ten minutes while typing this and it's still firing each second

Replying to westonruter:

To work around this, we could temporarily unhook these when the post meta is added or updated:

That will end up breaking the cache for post queries using meta queries, eg WP_Query( [ 'meta_key' => 'looney_tunes', 'meta_value' => 'tom and jerry' ] );.

Last edited 3 weeks ago by peterwilsoncc (previous) (diff)

This ticket was mentioned in PR #11002 on WordPress/wordpress-develop by @westonruter.


3 weeks ago
#4

  • Keywords has-patch added

Trac ticket: https://core.trac.wordpress.org/ticket/64696

## Use of AI Tools

None

#5 in reply to: ↑ 3 ; follow-up: @westonruter
3 weeks ago

Replying to peterwilsoncc:

Replying to westonruter:

To work around this, we could temporarily unhook these when the post meta is added or updated:

That will end up breaking the cache for post queries using meta queries, eg WP_Query( [ 'meta_key' => 'looney_tunes', 'meta_value' => 'tom and jerry' ] );.

I don't mean for all post meta additions and updates. I mean specifically in WP_Sync_Post_Meta_Storage::set_awareness_state() and WP_Sync_Post_Meta_Storage::add_update().

Here's what I have in mind: https://github.com/WordPress/wordpress-develop/pull/11002

The only post meta involved would be wp_sync_awareness and wp_sync_update. I don't believe there are any queries introduced which involve querying by these post meta.

@czarate commented on PR #11002:


3 weeks ago
#6

I can't speak to whether this is the correct performance mitigation, but the code change looks good in the context of the sync provider. 👍 Thank you!

#7 follow-up: @mindctrl
3 weeks ago

I'll have to measure before and after to confirm, but a cursory look at Weston's PR seems to reduce meta db ops by a meaningful amount.

  • doesn't slow down if the browser window loses focus
  • I can't find any code indicating a slow down due to a lack of activity, I've had a browser window open in the background for around ten minutes while typing this and it's still firing each second

Probably should be a separate ticket, but I can confirm this behavior. It would be nice to include some throttling detection for when browsers lose focus and/or don't have typing/activity in some period of time. I left a couple clients open for ~7 hours for testing and this caused over 230k post meta writes.

#8 in reply to: ↑ 7 @czarate
3 weeks ago

Replying to mindctrl:

Probably should be a separate ticket, but I can confirm this behavior. It would be nice to include some throttling detection for when browsers lose focus and/or don't have typing/activity in some period of time. I left a couple clients open for ~7 hours for testing and this caused over 230k post meta writes.

I created this Gutenberg issue since polling intervals are controlled client-side.

https://github.com/WordPress/gutenberg/issues/75829

@mindctrl commented on PR #11002:


3 weeks ago
#9

I'll try to confirm the level of accuracy of the screenshots included here to be sure there are no bugs in the way I'm tracking this, but here's a quick look at the difference before and after this PR:

## Before
https://github.com/user-attachments/assets/523a411b-4a77-482e-ab53-95a497fb475c

## After
https://github.com/user-attachments/assets/a30bd562-94bc-4ebd-8976-ac78500ea8c5

@westonruter commented on PR #11002:


3 weeks ago
#10

@mindctrl it might be helpful to also have a client that is continuously doing reads to a function that uses wp_cache_get_last_changed( 'posts' ) as the salt for a return value that should be cached in object cache, to simulate the impact of having this on a live site getting a lot of traffic. This would be in a addition to having sync clients open. For example, get_archives() would continuously have its cache impacted by sync changes.

It seems like there are a lot of examples where unrelated post and post meta changes are causing object cache invalidations needlessly. (Cache invalidation is hard!)

#11 in reply to: ↑ 5 ; follow-up: @peterwilsoncc
3 weeks ago

Replying to westonruter:

I don't mean for all post meta additions and updates. I mean specifically in WP_Sync_Post_Meta_Storage::set_awareness_state() and WP_Sync_Post_Meta_Storage::add_update().

Here's what I have in mind: https://github.com/WordPress/wordpress-develop/pull/11002

The only post meta involved would be wp_sync_awareness and wp_sync_update. I don't believe there are any queries introduced which involve querying by these post meta.

Yep that makes more sense but I'd be hesitant doing so for a couple of reasons:

  • it adds a bug trap for our future selves, ie WordPress contributors, due to the snowflake code
  • it blocks third party developers from using the proper APIs to monitor usage

Initially I think it would be better to investigate back-off in the Gutenberg repo and monitor how that goes before messing with caching algorithms

#12 in reply to: ↑ 11 ; follow-up: @westonruter
3 weeks ago

Replying to peterwilsoncc:

Initially I think it would be better to investigate back-off in the Gutenberg repo and monitor how that goes before messing with caching algorithms

That too, but even with back-off we definitely don't want syncing logic to have any impact on the frontend. Any changes to a wp_sync_storage post type or any of its post meta should be isolated from impacting any frontend queries since they are totally unrelated. The only thing which should impact the frontend is when the post being modified actually gets an update.

#13 @peterwilsoncc
3 weeks ago

I've requested an assist on GB#75829 as the PR I have GB PR#75843 works find until embeds are added to the post.

#14 in reply to: ↑ 12 ; follow-up: @peterwilsoncc
3 weeks ago

Replying to westonruter:

...even with back-off we definitely don't want syncing logic to have any impact on the frontend.

As Dion indicates above, with the heartbeat API there is some amount of effect on the front end while editing. The same is true for revisions, autosaves and probably a few more things.

There may be a solution for this by changing how we salt the cache for WP_Query to reduce invalidation to a subset of queries. A problem for another ticket.

Under no circumstances do I think the solution is to deliberately store out of date date in the cache. To do so would replace one bug with another.

#15 in reply to: ↑ 14 @westonruter
3 weeks ago

Replying to peterwilsoncc:

Replying to westonruter:

...even with back-off we definitely don't want syncing logic to have any impact on the frontend.

As Dion indicates above, with the heartbeat API there is some amount of effect on the front end while editing. The same is true for revisions, autosaves and probably a few more things.

True, but the problem just seems to be exponentially worse now since the cache invalidations can happen 4 times a second for every user syncing changes. In contrast, the most frequent heartbeat interval is once every 15 seconds, but an autosave only happens once every 60. So while a user is typing continuously with realtime collaboration enabled, this can result in 240 invalidations of posts in the object cache per minute, whereas with autosave heartbeat it only happens once a minute. On a highly dynamic traffic'ed site without page caching, but having persistent object caching to deal with scaling issues, this can result in the object cache essentially being disabled and the database load being increased substantially.

I don't necessarily like my workaround for this issue and I would like there to be a deeper solution to fix the issue of cached posts in the object cache being invalidated needlessly. But I'm also mindful that we're now in beta and this seems like a much bigger performance problem to solve for the general case which may be more appropriate for 7.1.

This ticket was mentioned in Slack in #core by johnparris. View the logs.


3 weeks ago

This ticket was mentioned in Slack in #hosting by johnparris. View the logs.


2 weeks ago

This ticket was mentioned in PR #11067 on WordPress/wordpress-develop by @mindctrl.


2 weeks ago
#18

  • Keywords has-unit-tests added

This is a proof of concept for moving awareness state out of wp_postmeta to cache, using the Transients API, which will use object cache if available and fallback to wp_options if not.

I'm looking for feedback if this is a viable idea in general. If so, I can iterate on this to get it in better shape.

Trac ticket: https://core.trac.wordpress.org/ticket/64696

## Use of AI Tools

I used Claude Opus 4.6 to help with this, mostly for the phpunit test.

This ticket was mentioned in PR #11068 on WordPress/wordpress-develop by @JoeFusco.


2 weeks ago
#19

The real-time collaboration sync layer currently stores messages as post meta, which works but creates side effects at scale. This moves it to a dedicated wp_sync_updates table purpose-built for the workload.

The beta1 implementation stores sync messages as post meta on a private wp_sync_storage post type. Post meta is designed for static key-value data, not high-frequency transient message passing. This mismatch causes:

  1. Cache thrashing — Every sync write triggers wp_cache_set_posts_last_changed(), invalidating site-wide post query caches unrelated to collaboration.
  1. Compaction race condition — The "delete all, re-add some" pattern in remove_updates_before_cursor() loses messages under concurrent writes. The window between delete_post_meta() and the add_post_meta() loop is unprotected.
  1. Cursor race condition — Timestamp-based cursors (microtime() * 1000) miss updates when two writes land within the same millisecond.

A purpose-built table with auto-increment IDs eliminates all three at the root: no post meta hooks fire, compaction is a single atomic DELETE, and auto-increment IDs guarantee unique ordering. The WP_Sync_Storage interface and WP_HTTP_Polling_Sync_Server are unchanged.

Also adds a wp_sync_storage filter so hosts can substitute alternative backends (Redis, WebSocket) without patching core, and includes a beta1 upgrade path that cleans up orphaned wp_sync_storage posts.

### Credits

Trac ticket: https://core.trac.wordpress.org/ticket/64696

## Use of AI Tools

Co-authored with Claude Code (Opus 4.6), used to synthesize discussion across related tickets and PRs into a single implementation. All code was reviewed and tested before submission.

@peterwilsoncc commented on PR #11067:


2 weeks ago
#20

@mindctrl The transients API is designed in such a way that persistence isn't guaranteed. Data can be removed at any time, either by the cache being flushed or the transients been cleaned up.

Sites with multiple servers may use a shared cache or they may use a per server cache, for example, but even for single site servers they may be dropped during a deploy.

As it's inherently unreliable, I don't think this is an approach we can take.

#21 @JoeFusco
2 weeks ago

I opened a PR which moves sync storage out of post meta entirely into a dedicated table. Details can be found on GitHub.

https://github.com/WordPress/wordpress-develop/pull/11068

@JoeFusco commented on PR #11067:


2 weeks ago
#22

Hey @mindctrl, I opened #11068 which builds on this same direction. Moved all sync storage to a dedicated table, with awareness in transients like your approach here.

@peterwilsoncc's point about transient reliability is worth discussing — awareness data is transient by nature (cursor positions, selections) and repopulates on the next poll, so a cache flush just means a brief flicker rather than data loss. Happy to explore alternatives if that's still a concern.

#23 @peterwilsoncc
2 weeks ago

WordPress/Gutenberg@0013814ce has been merged to throttle syncing when the browser tab is in active: ie, when the screen saver is active or the user has switched to another tab.

I initially attempted to use document.hasFocus() to slow down too but that became super flaky with embeds as it's not possible to determine if a cross origin iframe has focus.

@mindctrl commented on PR #11067:


2 weeks ago
#24

@peterwilsoncc thanks for reviewing. To be clear, I don't love this idea, but given how late we are in the cycle, I wanted to try to reduce the amount of postmeta db operations without refactoring a lot of code. I'd much prefer a custom table.

Awareness syncing is the biggest "offender" here, because even when no content is changing and clients are idle, it's still updating the updated_at value for each client.

Since this is awareness state and not content data, it wouldn't be a lossy, destructive thing if the cache were cleared temporarily before the next awareness state sync, which would happen less than 1 second later. It would just appear that someone dropped offline very briefly.

@JoeFusco commented on PR #11068:


2 weeks ago
#25

@peterwilsoncc Thanks for the thorough review — please don't apologize, this is exactly the kind of institutional knowledge that's invaluable.

As not all sites run the database upgrade routine on a regular basis (I think wordpress.org is often behind) as it can be quite a burden on a site, if this approach is to be taken it would need to include a check for the table's existence in a couple of locations:

Completely agree. I believe I've audited every code path that reads wp_enable_real_time_collaboration to map out what actually needs protection.

  • The option_{$option} hook

When the setting is saved, the flow through options.php → update_option() writes to the wp_options table. No code is hooked onto that option change that would interact with the sync_updates table, so toggling the setting is safe regardless of table state.

  • Before displaying the options field on the writing page

The checkbox in options-writing.php calls get_option() to check/uncheck itself — it doesn't query the sync_updates table, so it renders safely whether or not the upgrade has run.

The code path that touches the table is the REST route registration in rest-api.php:433, which instantiates WP_Sync_Table_Storage. That's already gated with a db_version >= 61697 check (c13f1cd), so the table is never accessed on sites that haven't run the upgrade.

Some form of filtering would need to be available for sites that are using custom data stores.

The wp_sync_storage filter (line 443) should cover this — it lets sites swap out WP_Sync_Table_Storage for any custom WP_Sync_Storage implementation (Redis, etc.).

Let me know if I've missed anything or if you think additional safeguards would be worthwhile!

@westonruter commented on PR #11068:


2 weeks ago
#26

@peterwilsoncc:

As not all sites run the database upgrade routine on a regular basis (I think wordpress.org is often behind) as it can be quite a burden on a site, if this approach is to be taken it would need to include a check for the table's existence in a couple of locations:

  • The option_{$option} hook
  • Before displaying the options field on the writing page

Some form of filtering would need to be available for sites that are using custom data stores.

Yeah, this has been an issue, but I think it's primarily/only really an issue for database tables that are used for frontend features. Given that the proposed wp_sync_updates table here is exclusively used in the post editor which is in the admin, and access to the admin requires first doing a DB upgrade screen, then I think this wouldn't be a concern.

I know Matt has been reluctant in the past to add new tables to the database schema. I guess (and Matt will be able to correct me) that's to avoid ending up with a database scheme of dozens and dozens of tables.

I haven't touched custom tables for a long time, so I'm not aware of the state of the art here. But I do know that WooCommerce made the switch to using custom tables instead of postmeta for the sake of scalabilitiy, simplicity, and reliability. Doing a quick search in WPDirectory and I see that Jetpack, Yoast, Elementor, WPForms, WordFence, and many others use custom tables as well.

#28 @JoeFusco
2 weeks ago

Thanks everyone for the reviews and discussion here. The feedback has been clear and easy to act on.

Worth noting that the dedicated table approach also benefits hosts running alternative transports like WebSockets. Beta1's post meta storage caused cache trashing regardless of how updates were delivered. Moving to a purpose-built table with the wp_sync_storage filter means any transport gets clean storage with zero cache side effects, rather than having to work around post meta internally.

@peterwilsoncc's visibility throttling in the client complements this nicely too - it reduces request frequency on the client side while the table change makes each request cheaper on the server side. Both fixes stack.

The PR includes full test coverage but would of course love more eyes testing this across different hosting setups.

@peterwilsoncc commented on PR #11068:


13 days ago
#29

Yeah, this has been an issue, but I think it's primarily/only really an issue for database tables that are used for frontend features. Given that the proposed wp_sync_updates table here is exclusively used in the post editor which is in the admin, and access to the admin requires first doing a DB upgrade screen, then I think this wouldn't be a concern.

This isn't quite correct, on multi site installs site (ie, non super admins) don't get prompted to upgrade the database tables and globally upgrading tables can be prevented using a filter:

add_filter( 'map_meta_cap', function( $caps, $cap ) {
        if ( 'upgrade_network' === $cap ) {
                $caps[] = 'do_not_allow';
        }
        return $caps;
}, 10, 4 );

To reproduce:

  1. Install MS
  2. Create a sub site
  3. Add a user to the network
  4. Add the new user as an admin on the sub site
  5. Login as sub site admin
  6. Bump DB version in version.php
  7. Visit dashboard as sub site admin
  8. Observe lack of db upgrade prompt
  9. Login as super admin
  10. Observe but don't click db upgrade prompt
  11. You can ignore the prompt as long as you wish as a super admin
  12. Add the code above as a mu-plugin
  13. Prompt disappears
  14. Visit /wp-admin/network/upgrade.php
  15. Observe permission denied message
  16. Remove mu-plugin added above to avoid confusion later.

Keep in mind that the sub site admin can enable site RTC on the writing page, even though they can't update the tables.

I haven't touched custom tables for a long time, so I'm not aware of the state of the art here. But I do know that WooCommerce made the switch to using custom tables instead of postmeta for the sake of scalabilitiy, simplicity, and reliability. Doing a quick search in WPDirectory and I see that Jetpack, Yoast, Elementor, WPForms, WordFence, and many others use custom tables as well.

Not really relevant for the core schema.

@westonruter commented on PR #11068:


13 days ago
#30

Not really relevant for the core schema.

Seems relevant to me, as plugins may opt to introduce new tables if the core schema doesn't provide an efficient way to store the desired data. In the same way, if a new feature for core can't be represented efficiently in the core schema, then so too should a new table be considered.

@peterwilsoncc commented on PR #11068:


13 days ago
#31

@westonruter Well, no, because my point wasn't about whether the tables would be more performant (I said it's probably a good idea in my comment), my point was that Matt has a reluctance to add more tables to the database schema. Whether that reluctance still remains is worth asking but if it does then the plugin actions aren't relevant to core.

This ticket was mentioned in PR #11119 on WordPress/wordpress-develop by @peterwilsoncc.


12 days ago
#32

Trac ticket: https://core.trac.wordpress.org/ticket/64696

## Use of AI Tools

Nil. Nada. Null.

#33 @peterwilsoncc
12 days ago

I've created PR#11119 for moving the awareness state to transients as a much simplified approach:

  • WP_Query cache accuracy is retained
  • transient cache key uses the post ID rather than the room ID, the existing post meta is single use only (ie, only one item of that name is stored per post) so I think the post ID is fine
  • I've set the transient expiry to one hour as a pretty arbitrary number.

This reduces the WP_Query cache invalidation back to the following:

  • actual edits are made to a document
  • heartbeat triggers

#34 @matt
12 days ago

I'm generally against new tables, but this is a useful primitive for all our future real-time collab and sync work. I like the plain room name instead of hash, which also makes LIKE possible. Hopefully we can leverage this for Playground sync in the future as well.

@peterwilsoncc commented on PR #11068:


11 days ago
#35

@westonruter @josephfusco As mentioned in Slack and on the ticket, Matt has approved an additional table for RTC.

As this is a first design and wasn't discussed architecturally, I would like to slow right down and consider the architecture on the ticket prior to proceeding here. As you'll have seen, I've posted in slack but will comment on the ticket once I've thought about things further.

Prior to the discussion, I think it's best if this PR is put on hold so we can focus on any other RTC issues in the meantime. I think there's a few things in the JavaScript client and the polling class that need work in the mean time.

#36 @peterwilsoncc
11 days ago

As Matt has authorized a new table for RTC to resolve the performance impacts, we (contributors working on this ticket) can move on to an architectural discussion.

Primarily, we need to figure out the table and index structure:

  • WP_HTTP_Polling_Sync_Server has been written for a future in which additional object types have real time collaboration
  • Core objects are posts, comments, terms and users. Posts include post types as sub objects
  • Plugins frequently add their own object, eg WooCommerce has orders; email plugins have email campaigns, etc
  • What data types are needed for various aspects of the table
  • What data currently stored in the meta_value blob of data ought to be stored in it's own column

Miscellaneous questions for consideration:

  • Should awareness be managed via the table or just syncing?
  • Related to the above, on the product call (I couldn't attend), I understand presence indicators were discussed for other pages, eg list tables. Is this something that can be stored here too?
  • MS requires manual intervention to upgrade tables, how best to manage RTC prior to these being added
  • What is the permissions model: is it based on the object being collaborated on, new meta caps, some combination of the two?

@czarate commented on PR #11067:


11 days ago
#37

@mindctrl @westonruter Even as we explore a new table as our path forward, resolving the race condition in get_updates_after_cursor is extremely valuable to increase confidence in beta testing. Do you think we can merge this for beta 3?

This ticket was mentioned in Slack in #core by amykamala. View the logs.


11 days ago

This ticket was mentioned in Slack in #hosting by amykamala. View the logs.


10 days ago

@peterwilsoncc commented on PR #11068:


10 days ago
#40

As I mentioned yesterday, I'd strongly prefer, and think it essential, the new table structure be discussed on the ticket rather than on this pull request. Basing the architecture discussion on code leads to a code first decision but it's essential we make an architecture first decision.

We have once chance to get this right, following best practices will make this far more likely to happen.

@westonruter commented on PR #11068:


10 days ago
#41

We have once chance to get this right, following best practices will make this far more likely to happen.

Is the concern so grave? The data here is ephemeral. If a WordPress upgrade requires a table schema change and the existing sync updates are dropped, this doesn't seem like such a big deal to me.

@peterwilsoncc commented on PR #11068:


10 days ago
#42

Is the concern so grave? The data here is ephemeral. If a WordPress upgrade requires a table schema change and the existing sync updates are dropped, this doesn't seem like such a big deal to me.

dbDelta() runs prior to the upgrade routine so there is no opportunity to truncate before altering the table. The table will have A LOT of data very quickly so any changes will lock the table for a while (I ended up with about 100 rows of posts, post_meta editing hello world for about five minutes).

#43 follow-up: @peterwilsoncc
10 days ago

Here are my thoughts after Chris caught me up on the background and technical requirements of RTC.

---

Items to move to the table:

  • wp_sync_update post meta for rooms
  • Awareness

Remaining where they currently are:

  • crdt document - it's only updated when a user saves, not when a sync occurs

Table name: wp_collaboration

Columns:

  • ID (BIGINT, Primary key) -- Used as cursor in Post Meta storage class
  • room_id (VARCHAR, 60, Indexed) - hash, with a bit of room for plugin authors to do things
  • event_type (VARCHAR, 255, Indexed) -- "awareness" OR "sync_update" OR future type
  • gmt_timestamp (date/time, Indexed) - when the event happened, this may be updated for items such as the awareness api when it updates on an ongoing basis
  • event_data (LONGTEXT) -- blob, client id, etc as appropriate

WP_Sync_Post_Meta_Storage to be renamed as appropriate for the new dedicated table.

The WP_Sync_Storage interface to be updated to allow for awareness to be stored in multiple rows:

public function get_awareness_states_since( string $room, int $timestamp ): array;
public function add_awareness_state( string $room, array $awareness_state ): void;
public function remove_awareness_states_before( string $room, int $timestamp ): array;

Props @chriszarate for giving this an edit and brainstorming prior to publication.

#44 in reply to: ↑ 43 @JoeFusco
9 days ago

The pull request we've been working on renames everything from “sync” to “collaboration” - classes, REST routes, table name. A deprecated wp-sync/v1 route alias sticks around for backward compatibility with the Gutenberg plugin during its transition to wp-collaboration/v1.

Here is how the pull request lines up with the proposed schema:


Items moved to the table:

  • wp_sync_update post meta - done, sync updates go into wp_collaboration rows instead of post meta. This gets rid of the cache invalidation noise from wp_cache_set_posts_last_changed() firing on every write.

Remaining where they currently are:

  • CRDT document - agreed, it’s only written on save, not on sync. The pull request doesn’t touch that path.
  • Awareness - stored in transients rather than in the table. Awareness data (cursor positions, selections) is high-frequency and short-lived - every active user writes it every few seconds, and entries go stale after 30 seconds. Transients handle this well: 60 second expiry, automatic cleanup when someone closes their tab. Moving awareness into the table and adding an event_type column is worth exploring, but it adds write volume and index considerations we’d want to benchmark first. Proposing deferment for 7.1.

Table name: wp_collaboration - matches. Registered in wpdb as $wpdb->collaboration.

Columns:

  • id (BIGINT, unsigned, auto-increment, Primary key) - cursor for polling. Auto-increment instead of gmt_timestamp because the beta 1-3 post meta implementation uses millisecond timestamps as message IDs. When two editors send updates in the same millisecond, they get the same ID and one is silently dropped. Auto-increment can’t collide, so that class of bug goes away.
  • room (VARCHAR, 255) - room identifier, unhashed since room strings like postType/post:42 are short
  • update_value (LONGTEXT) - JSON-encoded update payload
  • created_at (DATETIME) - timestamp for the 7-day cleanup cron (matches autodrafts). Expanding its use (admin UI, audit queries) fits naturally in 7.1.

Composite index KEY room (room, id) - every hot query (idle poll, catch-up, compaction) filters by room first, then scans by id. Without this index those queries hit the full table. Worth calling out since it’s not in the proposed schema but all three operations depend on it.

WP_Sync_Storage renamed to WP_Collaboration_Storage. Awareness methods are get_awareness_state() and set_awareness_state().

The PR includes a runnable proof of the beta 1-3 data loss bug and a performance benchmark covering idle poll, catch-up, and compaction from 100 to 100,000 rows.

Pull request description has been updated with the latest commands for testing:
https://github.com/WordPress/wordpress-develop/pull/11068

Props @mindctrl for flagging the timestamp collision risk and the importance of append-only writes.

#45 @peterwilsoncc
9 days ago

@JoeFusco There's a race condition in the current awareness implementation that remains with the switch to transients: as the awareness stored in a single blob (code ref), near simultaneous requests can drop client IDs from the room.

Chris and I were thinking that putting each client's ID in the new table as a separate row will allow us to avoid that condition. With four request per second when multiple collaborators are around, we have a lot of near simultaneous requests.

#46 follow-up: @mindctrl
9 days ago

Putting each client's awareness in its own row makes sense. What I don't understand with the current iteration is why we want client_id and type in the update_value blob. They're both necessary for filtering results, which we currently do by decoding every update_value blob and comparing on type and client_id in PHP.

@westonruter commented on PR #11068:


8 days ago
#47

@josephfusco It looks like the types can be made more specific. For example:

  • src/wp-includes/collaboration/class-wp-collaboration-table-storage.php

    diff --git a/src/wp-includes/collaboration/class-wp-collaboration-table-storage.php b/src/wp-includes/collaboration/class-wp-collaboration-table-storage.php
    index 894777e608..8c62cc1ab0 100644
    a b  
    1515 * @since 7.0.0
    1616 *
    1717 * @access private
     18 *
     19 * @phpstan-import-type AwarenessState from WP_Collaboration_Storage
    1820 */
    1921class WP_Collaboration_Table_Storage implements WP_Collaboration_Storage {
    2022        /**
    class WP_Collaboration_Table_Storage implements WP_Collaboration_Storage { 
    7375         *
    7476         * @param string $room    Room identifier.
    7577         * @param int    $timeout Seconds before an awareness entry is considered expired.
    76          * @return array<int, array{client_id: int, state: mixed, wp_user_id: int}> Awareness entries.
     78         * @return array<int, array> Awareness entries.
     79         * @phpstan-return array<int, AwarenessState>
    7780         */
    7881        public function get_awareness_state( string $room, int $timeout = 30 ): array {
    7982                global $wpdb;
    class WP_Collaboration_Table_Storage implements WP_Collaboration_Storage { 
    241244         *
    242245         * @global wpdb $wpdb WordPress database abstraction object.
    243246         *
    244          * @param string $room       Room identifier.
    245          * @param int    $client_id  Client identifier.
    246          * @param array  $state      Serializable awareness state for this client.
    247          * @param int    $wp_user_id WordPress user ID that owns this client.
     247         * @param string               $room       Room identifier.
     248         * @param int                  $client_id  Client identifier.
     249         * @param array<string, mixed> $state      Serializable awareness state for this client.
     250         * @param int                  $wp_user_id WordPress user ID that owns this client.
    248251         * @return bool True on success, false on failure.
    249252         */
    250253        public function set_awareness_state( string $room, int $client_id, array $state, int $wp_user_id ): bool {
  • src/wp-includes/collaboration/interface-wp-collaboration-storage.php

    diff --git a/src/wp-includes/collaboration/interface-wp-collaboration-storage.php b/src/wp-includes/collaboration/interface-wp-collaboration-storage.php
    index 9550384da5..5223e89a63 100644
    a b  
    1010 * Interface for collaboration storage backends used by the collaborative editing server.
    1111 *
    1212 * @since 7.0.0
     13 *
     14 * @phpstan-type AwarenessState array{client_id: int, state: mixed, wp_user_id: int}
    1315 */
    1416interface WP_Collaboration_Storage {
    1517        /**
    interface WP_Collaboration_Storage { 
    3234         *
    3335         * @param string $room    Room identifier.
    3436         * @param int    $timeout Seconds before an awareness entry is considered expired.
    35          * @return array<int, array{client_id: int, state: mixed, wp_user_id: int}> Awareness entries.
     37         * @return array<int, array> Awareness entries.
     38         * @phpstan-return array<int, AwarenessState>
    3639         */
    3740        public function get_awareness_state( string $room, int $timeout = 30 ): array;
    3841
    interface WP_Collaboration_Storage { 
    8689         *
    8790         * @since 7.0.0
    8891         *
    89          * @param string $room       Room identifier.
    90          * @param int    $client_id  Client identifier.
    91          * @param array  $state      Serializable awareness state for this client.
    92          * @param int    $wp_user_id WordPress user ID that owns this client.
     92         * @param string               $room       Room identifier.
     93         * @param int                  $client_id  Client identifier.
     94         * @param array<string, mixed> $state      Serializable awareness state for this client.
     95         * @param int                  $wp_user_id WordPress user ID that owns this client.
    9396         * @return bool True on success, false on failure.
    9497         */
    9598        public function set_awareness_state( string $room, int $client_id, array $state, int $wp_user_id ): bool;

#48 in reply to: ↑ 46 @peterwilsoncc
5 days ago

Replying to mindctrl:

Putting each client's awareness in its own row makes sense. What I don't understand with the current iteration is why we want client_id and type in the update_value blob. They're both necessary for filtering results, which we currently do by decoding every update_value blob and comparing on type and client_id in PHP.

I agree, this is why I was proposing the type row in comment 43.

@peterwilsoncc commented on PR #11068:


5 days ago
#49

I am running out of ways to ask politely, can we please discuss and agree on the architecture of the new table _on the trac ticket_ before continuing to work on code.

As I said last week, discussing architecture on a pull request leads to a code first decision whereas discussing the architecture on the ticket focuses the architecture alone, without the influence of a first draft pull request.

To be 100% clear: any work on code without an agreed plan forward is a waste of everyones time and simply delays the work. I am closing this PR so discussion can continue on the ticket.

#50 @czarate
5 days ago

Putting each client's awareness in its own row makes sense. What I don't understand with the current iteration is why we want client_id and type in the update_value blob. They're both necessary for filtering results, which we currently do by decoding every update_value blob and comparing on type and client_id in PHP.

Both client_id and type are internal implementation details that are specific to Yjs and the polling provider. They could change in the future. Keeping the storage mechanism opaque to these implementation details greatly reduces the risk of breaking changes.

#51 follow-up: @mindctrl
5 days ago

Replying to peterwilsoncc:

I've been thinking about the schema proposal and whether a single table with an event_type column is the right approach, or whether awareness and content updates are different enough to warrant separate storage.

The access patterns and lifecycles are different:

  • Content updates are append-only messages that must be reliably delivered to every client exactly once. They're ordered by cursor (auto-increment ID), compacted periodically, and may persist for minutes to hours during an editing session. Reliable delivery matters, as a missed or duplicated update means a diverged document.
  • Awareness state is per-client, last-write-wins type data (cursor position, selection, user info). It's overwritten on every poll cycle (~250ms–1s), expires after a short period of inactivity, and losing it briefly just means a collaborator's cursor flickers. There's no ordering requirement, we just want the latest state per client. If awareness is expanded beyond the editor screen, as mentioned by Matt elsewhere (I remember reading it but can't find where he said it at the moment), the need to save awareness state will increase, and it will exacerbate any tradeoffs we make.

Combining them in one table means awareness writes (which are the highest-frequency operations involving every active client, every poll cycle) inflate the auto-increment ID space and add rows that the compaction/cleanup logic needs to work around. The indexing needs are also different: content updates need (room, id) for cursor-based polling, while awareness needs (room, client_id) for upsert-by-client lookups.

Separate storage would let each be optimized for its actual workload. Content updates could stay in the wp_collaboration table with auto-increment cursors and compaction, and that would keep it lean and fast with no need to scan past expired or current awareness rows. Awareness could use a second table with an INSERT ... ON DUPLICATE KEY UPDATE keyed on (room, client_id). Each client only ever writes its own row, so concurrent updates can't overwrite each other, and it doesn't add write volume to the content updates table.

Replying to czarate:

Both client_id and type are internal implementation details that are specific to Yjs and the polling provider. They could change in the future. Keeping the storage mechanism opaque to these implementation details greatly reduces the risk of breaking changes.

Regardless of the backing implementation (Yjs or otherwise), any collaborative editing system needs to identify its clients and distinguish between types of sync data. If the table is holding multiple types of data, we'll always need to know what type we have, and preferably query for only the data needed at any given time.

  • client_id — Any collaboration protocol needs to identify which participant originated an update. Without a client identifier, we can't filter a client's own updates out of poll responses, we can't do per-client awareness state, and we can't attribute changes. The concept of a client identity is inherent to multiuser editing.
  • type — The distinction between document updates and awareness/presence information exists in every collaboration system. These are conceptually different operations with different lifecycles. Whether we call it type, event_type, channel, or whatever, we need to distinguish them.

Optimizing tables for the use case we have, and know we will have soon, seems like a good win and doesn't saddle us with more schemas that are less than ideal. There seems to be several tradeoffs associated with trying to put both in a single table. Is there an actual mandate to limit to one table? Is there a technical/deployment type concern of having two new tables instead of one? Do we think there would be a higher rate of failure or some other kind of ecosystem fallout if we had two new tables?

#52 in reply to: ↑ 51 @czarate
4 days ago

Replying to mindctrl:

Content updates are append-only messages that must be reliably delivered to every client exactly once. ... Reliable delivery matters, as a missed or duplicated update means a diverged document.

Small point of clarity: At-least-once delivery is OK, if not ideal. Merging an already-applied update is effectively a no-op (assuming the update has not been modified in any way). A missed update, on the other hand, is destructive.

Otherwise I agree with your case for a second table.

Regardless of the backing implementation (Yjs or otherwise), any collaborative editing system needs to identify its clients and distinguish between types of sync data. If the table is holding multiple types of data, we'll always need to know what type we have, and preferably query for only the data needed at any given time.

While I don't think the overhead of filtering results from an "over-query" is that bad, I agree with this in general—as long as we represent both client_id and type flexibly in the schema. Both should probably be VARCHARs with reasonably future-proof sizes?

#53 @JoeFusco
4 days ago

Building on the discussion here and @mindctrl's analysis in comment 51 - sync updates and awareness are as different as posts and sessions. One is append-only content ordered by cursor, the other is ephemeral per-user state that gets overwritten every second. Storing both in one table with an event_type column means the UNIQUE KEY (room, client_id) that prevents simultaneous awareness writes from overwriting each other can't coexist with sync update rows, where a single client writes many rows per room.

This is what led us to two tables, each a general-purpose primitive:

wp_collaboration - append-only message passing

Column Type Notes
id BIGINT unsigned, auto_increment, Primary Key Cursor for polling
room VARCHAR(191) Unhashed room identifier ($max_index_length for utf8mb4 index compatibility)
type VARCHAR(32) update, sync_step1, sync_step2, compaction
client_id VARCHAR(32) Originating client
update_value LONGTEXT Opaque payload
created_at DATETIME For cron cleanup (7-day TTL)

Indexes: KEY room (room, id), KEY created_at (created_at)

wp_awareness - per-client presence

Column Type Notes
id BIGINT unsigned, auto_increment, Primary Key
room VARCHAR(191) Unhashed room identifier ($max_index_length for utf8mb4 index compatibility)
client_id VARCHAR(32) One row per client per room
wp_user_id BIGINT unsigned WordPress user ID
update_value LONGTEXT Awareness payload
created_at DATETIME Updated on every write, used for expiry

Indexes: UNIQUE KEY room_client (room, client_id), KEY room_created_at (room, created_at), KEY created_at (created_at)

The multisite upgrade path is gated - wp_is_collaboration_enabled() checks both the option and db_version, so the tables are never touched before the upgrade runs.

Cleanup is a single cron: collaboration rows older than 7 days, awareness rows older than 60 seconds.

Last edited 4 days ago by westonruter (previous) (diff)

This ticket was mentioned in Slack in #core by joefusco. View the logs.


4 days ago

This ticket was mentioned in Slack in #hosting by amykamala. View the logs.


3 days ago

#56 follow-up: @amykamala
3 days ago

Hosts have provided some brief feedback in this week's hosting meeting on RTC performance and defaults (as they stand in Beta 4) https://wordpress.slack.com/archives/C3D6T7F8Q/p1773252483953569

#57 follow-ups: @peterwilsoncc
3 days ago

@JoeFusco My reading of Matt's comment above is that the addition of one table has been authorized for the current and future RTC features. It's a very rare occurrence so we're not going to be able to slide in two when it's possible to use one. The last time the table structure changed was to introduce term meta and links were deprecated to prep for the eventual removal of wp_links.

No matter where awareness is stored, it's going to generate a lot of rows. If we use transients, they get added to the options table, if we use the RTC table, they get added there.

I think it better to put awareness in the RTC table but maybe we can lean on the transients pattern and use a persistent cache if it's available.

Re: Over queries, anything that allows us to limit the queries is an improvement over the current post meta implementation. My understanding is the query will be something along the lines of SELECT * FROM table WHERE cursor > /* cursor position on last request, 0.25 seconds ago */ AND event = sync. In most cases, that means that over query is one row and PHP is probably more efficient for filtering than adding a column and an index for said column.

#58 in reply to: ↑ 56 @peterwilsoncc
3 days ago

Replying to amykamala:

Hosts have provided some brief feedback in this week's hosting meeting on RTC performance and defaults (as they stand in Beta 4) https://wordpress.slack.com/archives/C3D6T7F8Q/p1773252483953569

#64845 has been opened do discuss defaults so we can keep this ticket focused on the tables/persistent cache issue.

#59 in reply to: ↑ 57 ; follow-ups: @JoeFusco
3 days ago

Replying to peterwilsoncc:

@JoeFusco My reading of Matt's comment above is that the addition of one table has been authorized for the current and future RTC features. It's a very rare occurrence so we're not going to be able to slide in two when it's possible to use one.

Agreed - one table for sync, no argument there.

No matter where awareness is stored, it's going to generate a lot of rows. If we use transients, they get added to the options table, if we use the RTC table, they get added there.

This is the key question. Awareness generates rows no matter where it lives - the difference is what guarantees that storage provides and what it costs the host table.

I think it better to put awareness in the RTC table but maybe we can lean on the transients pattern and use a persistent cache if it's available.

Persistent object cache as the primary store for awareness - fully agree. It's fast, atomic, and ephemeral by nature. No DB writes for awareness when cache is available.

The open question is the fallback when persistent object cache isn't available. Transients fall back to wp_options, and for awareness that means:

  • One write per client per room every poll interval (~1 second)
  • Short TTL (30-60 seconds), so rows expire and accumulate until lazy garbage collection runs
  • On multisite, these writes hit per-site wp_X_options tables - already the most bloated table on most installs

At 10 sub-sites with 2 editors each, that's 20 transient writes/sec across 10 wp_options tables. At 50 sub-sites, it's 100+. These fall directly out of the polling interval.

Putting awareness in the single collaboration table has a different cost. Sync needs multiple rows per (room, client_id), so a UNIQUE KEY on those columns can't exist in a shared table. Without it, concurrent awareness writes produce duplicate rows instead of being prevented - requiring application-level deduplication rather than preventing duplicates at the schema level.

Re: Over queries, anything that allows us to limit the queries is an improvement over the current post meta implementation. My understanding is the query will be something along the lines of SELECT * FROM table WHERE cursor > /* cursor position on last request, 0.25 seconds ago */ AND event = sync. In most cases, that means that over query is one row and PHP is probably more efficient for filtering than adding a column and an index for said column.

Agreed - PHP-side filtering makes sense given the row counts from a cursor query. No need for an index on event type.

So for no-cache sites, the options are:

Option Issue
Transients → wp_options High-frequency ephemeral writes to a table designed for low-frequency config data
Single collaboration table No UNIQUE KEY — concurrent awareness writes produce duplicates instead of being prevented
Require persistent object cache Excludes a significant portion of the install base
Small dedicated awareness table Correct by constraint, bounded, trivial cron cleanup

This isn't a preference for two tables — it's that every fallback path for awareness without cache makes something else worse. If there's a fifth option I'm not seeing, I'd welcome it.

Does the original one table guidance apply to the no-cache fallback path specifically? The scope is narrow: bounded rows (one per client per room), 60-second TTL, @access private, feature-gated.

#60 in reply to: ↑ 59 @westonruter
2 days ago

Replying to JoeFusco:

Require persistent object cache: Excludes a significant portion of the install base

Note that in #56040 for WP 6.1, a Persistent Object Cache test was added to Site Health. It appears in the production environment. It has certain thresholds for whether the test fails:

Threshold Value
alloptions_count 500
alloptions_bytes 100,000
comments_count 1,000
options_count 1,000
posts_count 1,000
terms_count 1,000
users_count 1,000

If a persistent object cache becomes a requirement for RTC, the thresholds could be eliminated in favor of it always being recommended if RTC is enabled and the polling transport is being used.

Nevertheless, shared hosting environments likely wouldn't provide Redis/Memcached to be able to enable a persistent object cache, so site owners would be stuck unless they upgrade. And for hosts that do provide Redis/Memcached, they may make available a WebSocket transport anyway.

#61 in reply to: ↑ 57 @czarate
2 days ago

There is another potential use case for this table that is distinct from the default HTTP polling provider: CRDT document persistence. Persisting the CRDT doc alongside the entity record (e.g., post) is essential for preventing the "initialization problem" where two peers have different initial state for their respective CRDT documents and therefore cannot apply updates consistently.

Currently, we only persist CRDT documents for post entities, but that will change in the future as we sync additional entity types. CRDT documents for post entities are stored in post meta, but other entity types may not have an equivalent storage target (e.g., taxonomies, post types, etc.) Being able to persist CRDT documents in this new table (even when another sync provider is used) is appealing. It would simply be a row with a distinct type (persisted_crdt_doc).

#62 @dd32
2 days ago

Require persistent object cache

I was asked if WordPress.org stats collects data about this, and as far as I know we do not. We do not collect mu-plugin data either as far as I'm aware.

However, we do collect data about PHP extension availability: redis ~42%, memcached ~28%, apcu ~20%.
I don't think it's relevant, as we'd use WPDB for tables, but: sqlite3 ~94%

#63 in reply to: ↑ 59 @peterwilsoncc
2 days ago

Replying to JoeFusco:

Replying to peterwilsoncc:

@JoeFusco My reading of Matt's comment above is that the addition of one table has been authorized for the current and future RTC features. It's a very rare occurrence so we're not going to be able to slide in two when it's possible to use one.

Agreed - one table for sync, no argument there.

I really don't like dealing with disingenuous comments. My comment is perfectly clear that my understanding is there is approval to add one (1) table in total to WordPress 7.0, bringing the total number of tables in WordPress to thirteen (13) in a single site install. To pretend to interpret my comment as anything else is very unhelpful.

It took a week of delaying to even get this discussion started, let's not waste further time with deliberate misinterpretations.

For the purposes of awareness updates, I don't see that a unique index is absolutely needed. Helpful, sure but not something that warrents another table.

For most requests, the table will be updated so we can run something along the lines of

UPDATE `wp_collaboration` 
   SET event_time = NOW()
   WHERE room = 'room1' AND event = 'awareness' AND client_id = 'client2'

If no rows are affected, that implies that the client is signing on and an insert is required.

Yes, if two concurrent requests occur there is a slim chance two rows will be created but that should be an exceptionally rare circumstance: the polling client waits for the apiFetch promise to complete before scheduling the next polling event. You can see this in affect by throttling network requests.

As @czarate mentions above, there is a chance that other items will need to be considered as part of the real time collaboration feature. Rather than targeted, endpoint specific tables, we need to write a table that meets the requirements of what is known today and the unknown of tomorrow.


Re: requiring a persistent cache: I don't think this is a good idea, I just see it as a handy place to shift awareness if it's available. Similar to transients but without the options table.

Note: See TracTickets for help on using tickets.