WordPress.org

Make WordPress Core

Opened 3 months ago

Last modified 3 months ago

#51860 new enhancement

wp-sitemap includes empty post entries returned by filters

Reported by: jsmoriss Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 5.5
Component: Sitemaps Keywords:
Focuses: Cc:

Description

The ./wp-includes/sitemaps/providers/class-wp-sitemaps-posts.php class applies the 'wp_sitemaps_posts_show_on_front_entry' and 'wp_sitemaps_posts_entry' filter, which may enrich the $sitemap_entry, or return false, or return an empty array (as another way to exclude a post from the sitemap, for example). This is preferable if the post needs to be excluded based on some post meta, etc. Currently, the empty $sitemap_entry is added to the wp-sitemap.xml, which is incorrect. The empty $sitemap_entry should be skipped.

Attachments (2)

class-wp-sitemaps-posts.php.diff (991 bytes) - added by jsmoriss 3 months ago.
Diff to test $sitemap_entry returned by filters and skip empty values (false and empty array).
class-wp-sitemaps-posts.php.2.diff (986 bytes) - added by jsmoriss 3 months ago.
Diff against 5.5.3.

Download all attachments as: .zip

Change History (7)

@jsmoriss
3 months ago

Diff to test $sitemap_entry returned by filters and skip empty values (false and empty array).

@jsmoriss
3 months ago

Diff against 5.5.3.

#1 follow-up: @swissspidy
3 months ago

  • Type changed from defect (bug) to enhancement

return an empty array (as another way to exclude a post from the sitemap, for example)

Just for reference, the filter was actually deliberately implemented so that this is _not_ possible as it would mess with pagination. So it was never recommended to return an empty array. Instead, the WP_Query args should be filtered to exclude posts.

#2 @jsmoriss
3 months ago

To filter posts by complex meta, you would have to run a query to get all posts, loop through each and figure out which ones need to be excluded, and then add the excluded post IDs to the 'wp_sitemaps_posts_query_args' query. Running double queries does not sound optimal to me.

#3 in reply to: ↑ 1 @jsmoriss
3 months ago

Replying to swissspidy:

return an empty array (as another way to exclude a post from the sitemap, for example)

Just for reference, the filter was actually deliberately implemented so that this is _not_ possible as it would mess with pagination. So it was never recommended to return an empty array. Instead, the WP_Query args should be filtered to exclude posts.

The 'wp_sitemaps_posts_query_args' filter is applied when counting the maximum number of pages, and for each page, so unless a filter hook is coded properly, it has the potential to affect performance (having to be applied for the counter and each page).

Using a static cache, I was able to create a filter that, I believe, will avoid affecting performance too much. I'll leave this example here, in case anyone else needs to exclude many posts, terms, or users from the query. Note that the excluded post IDs are merged from the cache, which allows other filters to exclude additional post IDs before/after this filter.

public function wp_sitemaps_posts_query_args( $args, $post_type ) {

        static $local_cache = array();

        if ( ! isset( $local_cache[ $post_type ] ) ) {

                $local_cache[ $post_type ] = array();

                $query = new WP_Query( array_merge( $args, array(
                        'fields'        => 'ids',       // Return an array of post ids.
                        'no_found_rows' => true,        // Skip counting total rows found.
                ) ) );

                foreach ( $query->posts as $post_id ) {

                        if ( is_post_excluded_example_function( $post_id ) ) {

                                $local_cache[ $post_type ][] = $post_id;
                        }
                }
        }

        if ( ! empty( $local_cache[ $post_type ] ) ) {

                $args[ 'post__not_in' ] = empty( $args[ 'post__not_in' ] ) ? $local_cache[ $post_type ] :
                        array_merge( $args[ 'post__not_in' ], $local_cache[ $post_type ] );
        }

        return $args;
}

#4 follow-up: @Cybr
3 months ago

I faced the same issue; discussed at Slack https://wordpress.slack.com/archives/CTKTGNJJW/p1604995479019700.

A proper workaround to this would be to maintain an active cache with page/term/author IDs you wish to exclude. Then, inject those into the query as exclusions. However, this is not an easy feat since WordPress is quite dynamic.

I believe Yoast SEO provides such a system with "Indexables," something that's challenging to incorporate and maintain.

#5 in reply to: ↑ 4 @jsmoriss
3 months ago

Replying to Cybr:

I faced the same issue; discussed at Slack https://wordpress.slack.com/archives/CTKTGNJJW/p1604995479019700.

A proper workaround to this would be to maintain an active cache with page/term/author IDs you wish to exclude. Then, inject those into the query as exclusions. However, this is not an easy feat since WordPress is quite dynamic.

I believe Yoast SEO provides such a system with "Indexables," something that's challenging to incorporate and maintain.

I see you faced the same issue I did. I believe I found a working solution by using a static cache variable to create the exclusion list once per page load. Since XML sitemaps are not retrieved as often as a PHP webpage, this is probably fine, but for maximum performance, I would consider storing the exclusion list in transient cache and hooking a 'clear cache' action to clear the exclusion list transient when posts, terms, or users are updated.

https://github.com/SurniaUlula/wpsso/blob/master/lib/wp-sitemaps.php

js.

Note: See TracTickets for help on using tickets.