WordPress.org

Make WordPress Core

Opened 7 days ago

Last modified 7 days ago

#44349 new defect (bug)

Posts show up multiple times in backend, when imported with Import WordPress tool

Reported by: wzshop Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 4.9.6
Component: Query Keywords: reporter-feedback 2nd-opinion
Focuses: Cc:

Description

When I import for example 200 posts with the Import WordPress tool and then go to wp-admin>posts, the posts show up in the query multiple times (for instance on page 1 and page 2). Seems like others are left out.

Attachments (1)

post export demo.xml (194.4 KB) - added by wzshop 7 days ago.

Download all attachments as: .zip

Change History (11)

#1 follow-up: @pbiron
7 days ago

  • Component changed from General to Import
  • Keywords reporter-feedback added

Welcome to trac!

Can you please let us know which version of the WordPress Importer plugin you are using?

#2 in reply to: ↑ 1 @wzshop
7 days ago

Replying to pbiron:

Welcome to trac!

Can you please let us know which version of the WordPress Importer plugin you are using?

Version 0.6.4

#3 follow-up: @pbiron
7 days ago

Thank you for the version # and for attaching a sample export to test with.

I'm not seeing the behavior you report when I import post export demo.xml into a fresh install (no plugins other than WordPress Importer v 0.6.4 and twentyseventeen theme).

Also, post export demo.xml only has 149 posts (confirmed by evaluating the xpath expression //input on it, as well as the number of posts that get imported in my test site) and not the 200 you mention in this ticket's description. Could that be why "Seems like others are left out"?

#4 in reply to: ↑ 3 @wzshop
7 days ago

Sorry, meant 149 posts. See the vid here: https://youtu.be/IK84U3ERCi4

#5 @pbiron
7 days ago

  • Component changed from Import to Query
  • Keywords 2nd-opinion added

Thanx for the video! Now that I understand what you are reporting, I can confirm that I'm seeing what you are seeing. At first sight this appears to be a bug in the importer, but it is not. The explanation is a little complicated, but I'll do my best.

First, you can confirm that "Post 129" was only imported once by searching for "Post 129" and seeing that it only appears once in the search results.

The reason it appears on page 1 & page 2 is the result of 2 factors:

  1. more than one post in the WXR file being imported have the same value in the <wp:post_date> element, thus more than one of the imported posts have the same value in the post_date field.
  2. the pagination on /wp-admin/edit.php is accomplished via LIMIT and ORDER BY clauses in the SQL request. MySQL performs certain optimizations on queries that include a LIMIT and those optimizations are what is producing the seemingly buggy behavior you're seeing.

As explained in MySQL's LIMIT Query Optimization,

If you combine LIMIT row_count with ORDER BY, MySQL stops sorting as soon as it has found the first row_count rows of the sorted result, rather than sorting the entire result. If ordering is done by using an index, this is very fast. If a filesort must be done, all rows that match the query without the LIMIT clause are selected, and most or all of them are sorted, before the first row_count are found. After the initial rows have been found, MySQL does not sort any remainder of the result set.

One manifestation of this behavior is that an ORDER BY query with and without LIMIT may return rows in different order, as described later in this section.

and more importantly

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns.

So, with the posts imported from post export demo.xml, the 1st page of posts when Number of items per page = 20 (in Screen Options) contains some posts with post_date = 2018-06-11 15:51:54 and some with post_date = 2018-06-11 15:51:53 (the later including "Post 129") with the order of posts within each "group" being "random". The 2nd page contains some posts with post_date = 2018-06-11 15:51:53 (including "Post 129") and some with post_date = 2018-06-11 15:51:52.

If you change Number of items per page = 30 all of the posts with post_date = 2018-06-11 15:51:53 will appear on the 1st page (again in a "random" order) and thus "Post 129" will only appear on that 1st page.

Alternatively, if the posts in the site you exported (to produce post export demo.xml) all had unique post_date's, then when you import them into another site you will not see this behavior.

Yes, this is behavior is confusing, but it is a result of MySQL's "Limit Query Optimization" and is not a bug in the Importer nor in WP_Query.

Thus, I believe this ticket should be closed as "invalid", but I will leave it to one of the SQL gurus on the team to confirm my analysis and close it accordingly.

This ticket was mentioned in Slack in #core by pbiron. View the logs.


7 days ago

#7 @wzshop
7 days ago

Thank you very much for the broad explanation. Is there any way to 'fix' this seemingly buggy behavior or a workaround, since it also has its effect on the frontend, when browsing the posts per category for instance. This is not something I would like to present to my visitors.

I can however comfirm that this problem won't occur when setting different dates. Thanks again.

#8 @pbiron
7 days ago

Isn't non-determinisum fun :-)

Is there any way to 'fix' this seemingly buggy behavior

As trac is not the place for support questions, I'd suggest you ask that that question on https://wordpress.stackexchange.com/ (feel free to reference this ticket when you do).

#9 @wzshop
7 days ago

Ok, 1 more question to understand the MySQL's LIMIT Query Optimization; It seems like posts are showing up multiple times, because of the posts are randomly ordered when browsing the pagination, but they don't actually show up multiple times?

Thanks

#10 @pbiron
7 days ago

In pure-SQL terms, any query of the form:

SELECT column FROM table ORDER BY order_by_column LIMIT x,y

where

  1. order_by_columnis indexed, and
  2. there are more than y rows with the same value in order_by_column that span the "pages" x-1 and x

could result in some rows from "page" x-1 also appearing on "page" x.

In WP_Query terms, the above translates into any query of the form:

<?php
$args = array(
    // the following are the indexed columns in the wp_posts table
    'orderby' => any of 'type', 'date', 'id', 'parent', 'author' or 'name'
    // each of the below "pagination" args results in a LIMIT clause
    'posts_per_page|showposts|posts_per_archive_page' => any value other than -1
);
$query = new WP_Query( $args );// including get_posts( $args )

There are other WP_Query corner cases that will exhibit the non-deterministic behavior as well, including but probably not limited to, hooking into any of posts_orderby, posts_clauses, posts_orderby_request, posts_clauses_request, posts_request to add any of the above (plus wp_posts.post_status, which is not directly legal in the args to new WP_Query()) to the "ORDER BY" clause of the SQL query that WP_Query ultimately executes.

Note: See TracTickets for help on using tickets.