WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 2 years ago

#19225 assigned enhancement

WordPress-Importer : Perform less database queries when backfilling attachment urls

Reported by: tott Owned by: duck_
Milestone: WordPress.org Priority: normal
Severity: normal Version:
Component: Import Keywords: has-patch needs-testing
Focuses: Cc:

Description

The current backfill_attachment_urls() approach causes a REPLACE query for each attachment url. This can cause the invalidation of a lot of data and also cause MySQL replication lag in some cases.

The attached patch will make sure to bundle these requests and perform only maximum of one db update per post.

Attachments (3)

better-attachment-url-backfilling.diff (1.5 KB) - added by tott 2 years ago.
better-attachment-url-backfilling.2.diff (1.6 KB) - added by tott 2 years ago.
better-attachment-url-backfilling.3.diff (1.7 KB) - added by tott 2 years ago.

Download all attachments as: .zip

Change History (10)

comment:1 westi2 years ago

  • Owner set to duck_
  • Status changed from new to assigned

comment:2 SergeyBiryukov2 years ago

  • Milestone changed from Awaiting Review to WordPress.org

comment:3 duck_2 years ago

The idea sounds like a good one. However, I notice right away that the current patch removes updating of enclosure meta.

comment:4 duck_2 years ago

A couple of other thoughts:

  • $from_url should be run through like_escape() before being put in the query
  • Can we retrieve post_content in the first query rather than using get_post() as well? Would this be beneficial?
  • $wpdb::update() can be used instead of using the query() method

comment:5 tott2 years ago

All valid points. I changed this accordingly. I left the replace call for the postmeta data as this will likely not cause any issues as the amount of data that's being invalidated is rather low.

comment:6 duck_2 years ago

better-attachment-url-backfilling.2.diff includes:

count( $post_ids ) > 0

but $post_ids isn't being declared in this version of the patch. Should be $posts?

Could you explain what you mean by "invalidation of a lot of data" and "data that's being invalidated"? Just want to make sure I fully understand the why :)

Another minor thing, but I think the uksort() before the first loop is unnecessary now.

comment:7 tott2 years ago

3 time's a charm - Note to self: don't code when ill and in a hurry to go to bed.

I implemented the suggestions. Let me explain a bit more about the background of this and why this is sub-optimal.

When you run a replicated MySQL setup with for example a Master and a Slave server you need to transfer each changes from the Master to the Slave after a change happened. Now imagine a replace query that would change something in the post content for each url.

With each replaced url this would need to find all the matching posts, read them, change them and write them back to the disk. So not only that you would have something that could rewrite the content of multiple posts at the same time you also have a scenario where you could write the same post a couple of times depending on how many urls are in it.

That's why I changed this to a per-post approach. First figure out which urls are in the post and then just update the post once. In this way the worst thing that can happen is that you send one update for each post. This should have a much lower disk IO impact than the original approach.

Version 0, edited 2 years ago by tott (next)
Note: See TracTickets for help on using tickets.