WordPress.org

Make WordPress Core

Opened 5 weeks ago

Closed 3 weeks ago

#50910 closed defect (bug) (fixed)

5.5 Sitemap URLs are incorrectly paginated

Reported by: jonathanstegall Owned by: peterwilsoncc
Milestone: 5.5.1 Priority: normal
Severity: normal Version: 5.5
Component: Sitemaps Keywords: has-patch has-unit-tests commit
Focuses: Cc:

Description

I've created a sitemap with enough posts that it needs to be paginated. The sitemap index file - /wp-sitemap.xml - has three post URLs.

In the index file, they display as:

  • wp-sitemap-posts-post-1.xml
  • wp-sitemap-posts-post-2.xml
  • wp-sitemap-posts-post-3.xml

When I visit those URLs, they are:

  • wp-sitemap-posts-post-1.xml
  • wp-sitemap-posts-post-2.xml/page/2
  • wp-sitemap-posts-post-3.xml/page/3

The first URL works correctly, but the last two are 404s.

This happens with no plugins activated and using the twentytwenty theme. My setup does have "Day and name" permalinks enabled. I'm using PHP 7.3.19 and Nginx 1.19.0 in a local development environment.

Attachments (3)

50910.diff (9.1 KB) - added by pbiron 5 weeks ago.
50910.2.diff (9.4 KB) - added by pbiron 3 weeks ago.
50910.3.diff (9.7 KB) - added by peterwilsoncc 3 weeks ago.

Download all attachments as: .zip

Change History (32)

#1 follow-up: @pbiron
5 weeks ago

If I understand what you are describing, then:

  • wp-sitemap-posts-post-2.xml IS the 2nd page
  • wp-sitemap-posts-post-3.xml IS the 3rd page

I would expect /page/2, etc to result in 404's.

#2 in reply to: ↑ 1 @jonathanstegall
5 weeks ago

Replying to pbiron:

If I understand what you are describing, then:

  • wp-sitemap-posts-post-2.xml IS the 2nd page
  • wp-sitemap-posts-post-3.xml IS the 3rd page

I would expect /page/2, etc to result in 404's.

Yes, that's correct. But the /page/2 or /page/3 is incorrectly being added to the URLs that should be the 2nd and 3rd pages.

#3 follow-up: @pbiron
5 weeks ago

Oh, you mean that when you click on wp-sitemap-posts-post-2.xml you actually get redirected to wp-sitemap-posts-post-2.xml/page/2?

#4 in reply to: ↑ 3 @jonathanstegall
5 weeks ago

Replying to pbiron:

Oh, you mean that when you click on wp-sitemap-posts-post-2.xml you actually get redirected to wp-sitemap-posts-post-2.xml/page/2?

Yes, exactly. And this happens with no active plugins, using the default theme, and the other setup notes.

#5 follow-up: @pbiron
5 weeks ago

Now I understand.

I am not able to reproduce that behavior with Apache and I don't have Nginx to test with.

Can you enter https://YOURDOMAIN/?sitemap=posts&sitemap-subtype=post&paged=2 in your browser and report back here what result you see?

p.s. thank you for the ticket and welcome to Trac!

Last edited 5 weeks ago by pbiron (previous) (diff)

#6 in reply to: ↑ 5 @jonathanstegall
5 weeks ago

Replying to pbiron:

Now I understand.

I am not able to reproduce that behavior with Apache and I don't have Nginx to test with.

Can you enter https://YOURDOMAIN/?sitemap=posts&sitemap-subtype=post&paged=2 in your browser and report back here what result you see?

p.s. thank you for the ticket and welcome to Trac!

Thanks for the welcome! That URL does not do any further redirects, but it shows this message in Firefox:

Error loading stylesheet: An unknown error has occurred (805303f4)
https://MYDOMAIN/wp-sitemap.xsl

In Chromium, I can view source to see the XML and it looks normal, but it only has the individual post URLs; not the URLs that get displayed on the sitemap index.

I was originally checking PHP error logs and I haven't seen any entries, which makes sense if it's an XSLT parsing issue.

Edit:

Sorry about the Firefox thing; that's not part of the issue here. I'm able to get the error message to go away by forcing the HTTPs, and then it displays a styled sitemap with individual post URLs, as it does in Chromium.

Last edited 5 weeks ago by jonathanstegall (previous) (diff)

#7 @GamerZ
5 weeks ago

#50926 was marked as a duplicate.

#8 @GamerZ
5 weeks ago

I am having the same issue. I found the cause which is Line 475 to Line 477 of wp-includes/canonical.php
As after removing the lines it works

if ( $paged > 1 ) {
    $addl_path .= user_trailingslashit( "$wp_rewrite->pagination_base/$paged", 'paged' );
}

OR you can solve it with filter (but I think it should be fix in core)

<?php
function lesterchan_redirect_canonical( string $redirect_url, string $requested_url ) {
        if ( strpos( $requested_url, 'wp-sitemap' ) !== false ) {
                return $requested_url;
        }

        return $redirect_url;
}
add_filter( 'redirect_canonical', 'lesterchan_redirect_canonical', 10, 2 );
Last edited 5 weeks ago by GamerZ (previous) (diff)

#9 @johnbillion
5 weeks ago

  • Keywords needs-patch added
  • Milestone changed from Awaiting Review to 5.5.1

Thanks for the report @jonathanstegall and welcome.

Moving to the 5.5.1 milestone for investigation.

#10 @SergeyBiryukov
5 weeks ago

  • Owner set to SergeyBiryukov
  • Status changed from new to accepted

This ticket was mentioned in PR #482 on WordPress/wordpress-develop by peterwilsoncc.


5 weeks ago

  • Keywords has-patch has-unit-tests added; needs-patch removed

#12 follow-up: @peterwilsoncc
5 weeks ago

I've put together a pull request for this modifying redirect_canonical() further to account for sitemap pagination.

I did investigate adding is_sitemap() and the required changes to WP and WP_Query() but either:

  • have no idea what I am doing
  • sitemaps don't use the main query

I suspect the former, tbh, they're pretty deep internals.

@swissspidy or @pbiron are you able to take a look at the PR to see if I am missing anything?

#13 @salvoaranzulla
5 weeks ago

I have the same problem. If you open https://www.aranzulla.it/wp-sitemap.xml, you can see that https://www.aranzulla.it/wp-sitemap-posts-post-2.xml (page > 1) generates a redirect.

@pbiron
5 weeks ago

#14 @pbiron
5 weeks ago

I'm not sure why, but I'm now seeing the behavior described...and I wasn't before.

50910.diff solves the problem is essentially the same was as the PR from @peterwilsoncc (by ensuring that redirect_canonical() doesn't incorrectly add page/2), but does so in a simpler way.

The patch introduces a new function, get_sitemap_url() that is similar in spirit to get_permalink() but for sitemap URLs and uses that in redirect_canonical().

Not only does that make redirect_canonical() easier to follow when dealing with sitemap canonicals compared to Peter's PR, but it "future proofs" things as well. Peter's PR in essence bakes the structure of sitemap canonical URLs into redirect_canonical(). Since WP_Sitemap_Provider has a get_sitemap_url() method, having that structure baked into redirect_canonical() could cause problems down the road should WP_Sitemaps_Provider::get_sitemap_url() change the structure of the URLs (however unlikely such a future change may be). The new get_sitemap_url() function uses WP_Sitemaps_Provider::get_sitemap_url().

The patch also moves Peter's unit test into the file with the existing sitemaps canonical tests: tests/canonical/sitemaps.php (and uses a data provider to make it easier, IMO, to add additional test cases).

Additionally, the patch adds unit tests for the new get_sitemaps_url() function.

#15 @pbiron
5 weeks ago

Regarding the comment on the PR:

Fun this passes with phpunit --filter test_sitemaps_canonical_redirects but fails with the full run or phpunit --group canonical. My suspicion is something is hanging around $wp from a previous test.

The reason the unit tests in the PR fail when run as phpunit --group canonical has to do with an interaction between the way sitemaps adds rewrite rules/tags and WP_UnitTestCase_Base::go_to(), which is called by WP_Canonical_UnitTestCase::assertCanonical().

Sitemaps adds rewrite tags (e.g., sitemap, sitemap-subtype and sitemap-stylesheet) on init and those tags are added to WP::public_query_vars. WP_UnitTestCase_Base::go_to() creates a new instance of the WP class, but since sitemaps aren't "reinitialized", $wp->public_query_vars doesn't contain the sitemaps rewrite tags in subsequent tests. I hope that makes sense!

Moving the unit test to Tests_Canonical_Sitemaps gets around that problem since Tests_Canonical_Sitemaps::setUp() "manually" initializes sitemaps before each test and thus the rewrite tags are correctly added.

That might be good enough, but I want to float an idea. The sitemaps feature plugin had to call add_rewrite_tag() when adding it's rewrite rules. But maybe, now that sitemaps are in core, it would be better to add those directly to WP::$public_query_vars, as in:

<?php
class WP {
        /**
         * Public query variables.
         *
         * Long list of public query variables.
         *
         * @since 2.0.0
         * @since 5.5.1 Added 'sitemap', 'sitemap-subtype' and 'sitemap-stylesheet'
         * @var string[]
         */
        public $public_query_vars = array( 'm', 'p', 'posts', 'w', 'cat', 'withcomments', 'withoutcomments', 's', 'search', 'exact', 'sentence', 'calendar', 'page', 'paged', 'more', 'tb', 'pb', 'author', 'order', 'orderby', 'year', 'monthnum', 'day', 'hour', 'minute', 'second', 'name', 'category_name', 'tag', 'feed', 'author_name', 'pagename', 'page_id', 'error', 'attachment', 'attachment_id', 'subpost', 'subpost_id', 'preview', 'robots', 'favicon', 'taxonomy', 'term', 'cpage', 'post_type', 'embed', 'sitemap', 'sitemap-subtype', 'sitemap-stylesheet' );
}
...
}

and not call add_rewrite_tag() in WP_Sitemaps::add_rewrite_rules().

@swissspidy what do you think about that idea?

#16 in reply to: ↑ 12 @pbiron
5 weeks ago

Replying to peterwilsoncc:

I did investigate adding is_sitemap() and the required changes to WP and WP_Query() but either:

  • have no idea what I am doing
  • sitemaps don't use the main query

I suspect the former, tbh, they're pretty deep internals.

The idea of adding is_sitemap() did come up in https://github.com/GoogleChromeLabs/wp-sitemaps/issues/134, but there was an easier way to address the problem that introducing that would have solved.

Maybe it would still be good to add that, but I think that should be it's own ticket.

#17 @swissspidy
4 weeks ago

what do you think about that idea?

Makes sense to me

This ticket was mentioned in Slack in #core by audrasjb. View the logs.


4 weeks ago

This ticket was mentioned in Slack in #core by pbiron. View the logs.


4 weeks ago

@pbiron
3 weeks ago

#20 @pbiron
3 weeks ago

  • Keywords needs-testing added

50910.2.diff is just like 50910.diff, with the following minor differences:

  1. in redirect_canonical() it checks for the sitemaps case first, before checking for uses of paged. Functionally it's the same but it simplifies the conditional
  2. removes a "debugging" @group bar tag I had added to some of the unit tests...that allowed me to just run those tests while developing them

#21 follow-up: @pbiron
3 weeks ago

I investigated the suggestion in comment 15, but decided against making that change, at least for now, for a few reasons:

  1. that change is not needed the fix this problem
  2. the way the unit tests are now structured, that change isn't need when running the unit tests

It might be a worthwhile change to make on it's own, but I'll open another ticket where the pros/cons can be discussed.

While investigating that, however, I came across something odd. There is an existing unit test that fails for me locally and as far as I can tell, should be failing on Travis since sitemaps were merged into core.

It's the test in tests/phpunit/tests/query/vars.php, which was added in [36045] and changed in [36048]. That test can be run with:

phpunit --group 35115

It checks that $wp->public_query_vars is correct. But since sitemaps were merged into core, that "expected" array should contains sitemap, sitemap-subtype, sitemap-stylesheet which it doesn't and the test fails for me locally.

When the tests are run on travis, how has it been passing? Any ideas @SergeyBiryukov ?

This ticket was mentioned in Slack in #core-sitemaps by pbiron. View the logs.


3 weeks ago

#23 in reply to: ↑ 21 @peterwilsoncc
3 weeks ago

  1. that change is not needed the fix this problem
  2. the way the unit tests are now structured, that change isn't need when running the unit tests

Makes perfect sense on both accounts.

I've replaced my PR with your code, so I can get a full test run going.

While reviewing your code:

  • I found some repeated code in the redirect_canonical() block, you had if sitemaps, elseif paged or other stuff, elseif sitemaps. I've removed the repetition in the pull request.
  • I also added some canonical tests for ugly permalinks to ensure the sitemaps don't attempt to redirect when pretty permalinks are enabled.

The full pull request in #482 with your code, my tests. The changes I've made are in this commit.

I am still waiting for the unit tests to complete on the PR, once they're done I'll upload a new patch and I think it's good to go.

#24 @pbiron
3 weeks ago

  • Keywords commit added; needs-testing removed

Thanx for testing Peter!

Marking for commit.

This ticket was mentioned in Slack in #core by pbiron. View the logs.


3 weeks ago

#26 @peterwilsoncc
3 weeks ago

  • Owner changed from SergeyBiryukov to peterwilsoncc
  • Status changed from accepted to assigned

#27 @peterwilsoncc
3 weeks ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 48872:

Sitemaps: Prevent incorrect redirection of paged sitemap requests.

Update redirect_canonical() to account for custom pagination and URL format used by sitemaps in order to follow standard practices.

Introduce the function get_sitemap_url() to simplify getting the index and provider URLs as needed.

Props jonathanstegall, pbiron, GamerZ, salvoaranzulla, peterwilsoncc.
Fixes #50910.

#28 @peterwilsoncc
3 weeks ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening for backporting to 5.5 branch.

#29 @peterwilsoncc
3 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 48873:

Sitemaps: Prevent incorrect redirection of paged sitemap requests.

Update redirect_canonical() to account for custom pagination and URL format used by sitemaps in order to follow standard practices.

Introduce the function get_sitemap_url() to simplify getting the index and provider URLs as needed.

Props jonathanstegall, pbiron, GamerZ, salvoaranzulla, peterwilsoncc.
Merges [48872] to the 5.5 branch.
Fixes #50910.

Note: See TracTickets for help on using tickets.