Make WordPress Core

Opened 14 months ago

Closed 6 months ago

Last modified 6 months ago

#56294 closed enhancement (maybelater)

WordPress search finds block name in comment

Reported by: zodiac1978's profile zodiac1978 Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 5.0
Component: Database Keywords: needs-patch dev-feedback
Focuses: performance Cc:

Description

There is a known issue with the WP search, that it is a full text search over
post_content which also finds HTML tags like table. So searching for the word "table" also finds every post/page with table markup in it.

This problem is very limited, so it wasn't necessary to fix it, although there is a plugin to fix it:
https://wordpress.org/plugins/wp-search-ignore-html-tags/

Now with the block editor (aka Gutenberg) this has changed. Every block is using the block name in a HTML comment. For example:

<!-- wp:syntaxhighlighter/code {"language":"php"} -->

If I now search in a tech blog about the term "syntaxhighlighter" I get every post/page with a code block and not only if the post/page is really containing this word in the text.

And with every new block the chance is higher to get more false positive search results.

Even the core blocks have problems, as "paragraph" (instead of just "p") or image (instead of "img") have a much higher chance for false positive search results, because of the ambiguity.

There is a Github issue for the block editor about it:
https://github.com/WordPress/gutenberg/issues/3739

But it was closed from @pento due to the fact, that it is a known WordPress issue and not necessarily a problem of the block editor and its type of data.

@danielbachhuber was asking at https://github.com/WordPress/gutenberg/issues/10307#issuecomment-426995580

However, I don't have any great ideas for how to resolve this with MySQL. I'd love to hear of a solution if someone has one. Barring that, this probably won't be a priority to fix with WP 5.0

After looking at the plugin linked above, I created a solution (with the support from @kau-boy):

/**
 * Modify search query to ignore the search term in HTML comments.
 *
 * @param string   $where The WHERE clause of the query.
 * @param WP_Query $query The WP_Query instance (passed by reference).
 *
 * @return string The modified WHERE clause.
 */
function tl_update_search_query( $where, $query ) {
    if ( ! is_search() || ! $query->is_main_query() ) {
        return $where;
    }
 
    global $wpdb;
    $search_query = get_search_query();
    $search_query = $wpdb->esc_like( $search_query );
 
    $where .= " AND {$wpdb->posts}.post_content NOT REGEXP '<!--.*$search_query.*-->' ";
 
    return $where;
}
add_filter( 'posts_where', 'tl_update_search_query', 10, 2 );

Before I try to create a PR for it. Would this be a possible way to solve this or is a NOT REGEXP too slow if many posts exist? I am running this solution on my blog but it has no high traffic and not very much posts - so my finding may not show the big picture here.

Feedback about possible problems (and hopefully how to solve them) are much appreciated! Thanks in advance.

Change History (17)

#1 @costdev
14 months ago

  • Focuses performance added

#2 @ravishaheshan
8 months ago

This is also problematic when there are attributes in the blocks like classes

<!-- wp:image {"id":157,"sizeSlug":"full","linkDestination":"none","className":"primary rounded"} -->

A search query for "primary" will yield all posts with blocks that has the class "primary"

#3 @zodiac1978
8 months ago

I've put the code from above in a plugin:
https://wordpress.org/plugins/ignore-block-name-in-search/

For everyone looking for a fast solution to the problem.

Still hoping to get some feedback if fiddling around with the posts_where filter is appropriate for WordPress core.

#4 @ravishaheshan
8 months ago

@zodiac1978 thanks. I'm thinking about adding a post meta field with sanitized post_content(w/o block tags and HTML tags) and modifying the search query to use that instead of the actual post_content field

Last edited 8 months ago by ravishaheshan (previous) (diff)

#5 @ravishaheshan
7 months ago

@zodiac1978 Your solution doesn't work when the search term is in both tags and content. ex: <!-- wp:core/paragraph -->this is a paragraph<!-- /wp:core/paragraph --> the post with this content doesn't get captured when searching for "paragraph" because of the AND clause in your query

#6 @zodiac1978
6 months ago

  • Resolution set to maybelater
  • Status changed from new to closed

Your solution doesn't work when the search term is in both tags and content.

@ravishaheshan That's true. Unfortunately I did not find a simple way to achieve this with MySQL below 8.

Since MySQL 8 there is REPLACE_REGEXP which would solve this:
https://stackoverflow.com/questions/986826/how-to-do-a-regular-expression-replace-in-mysql

I'm thinking about adding a post meta field with sanitized post_content(w/o block tags and HTML tags) and modifying the search query to use that instead of the actual post_content field

One way of solving this issue is using Relevanssi. It is already doing this indexing and is therefore solving this issue too:
https://wordpress.org/plugins/relevanssi/

So unless someone finds a way to do this with MySQL below 8 this can be closed as "maybelater".

#7 @ravishaheshan
6 months ago

@zodiac1978 thanks for your response. We do use "Relevanssi" in our projects. But by default, it doesn't remove Gutenberg annotations from the content so we do see invalid results in search results. Could you please explain how to do that if it isn't too much to ask? Thank you

Last edited 6 months ago by ravishaheshan (previous) (diff)

#8 follow-up: @zodiac1978
6 months ago

But by default, it doesn't remove Gutenberg annotations from the content

IIRC Relevanssi is building an index and this does not contain those comments. The content itself is untouched, and the search is not using the content anymore, but the index.

In my tests I couldn't find block names with Relevanssi used.

At the moment I try to get this at least working with MySQL 8.0.4 or above (where REGEXP_REPLACE was introduced), but couldn't get this to work. My latest try:

$where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";

This does not throw any error but is still not working ... :(

#9 @ravishaheshan
6 months ago

@zodiac1978 thanks. With MariaDB 10 I was able to get it working with following

add_filter('posts_search', function ($search, $query) {
    if ($query->is_search()
    && !is_admin()
    && (!isset($_GET['context']) || (isset($_GET['context']) && $_GET['context'] != 'edit'))
    ) {
        global $wpdb;
        $search_terms = $query->get('search_terms', []);
        foreach ($search_terms as $search_term) {
            $search_term = $wpdb->esc_like($search_term);
            $post_content_clause = $wpdb->prepare("wp_posts.post_content LIKE '%%%s%%'", $search_term);
            if (strpos($search, $post_content_clause) !== false) {
                $modified_post_content_clause = $wpdb->prepare(
                    "REGEXP_REPLACE(post_content, %s, '') LIKE '%%%s%%'",
                    "\<[^<>]*{$search_term}[^<>]*\>",
                    $search_term
                );
                // Replace search clause for post_content with the one that search
                // in tag sanitized content
                $search = str_replace(
                    $post_content_clause,
                    $modified_post_content_clause,
                    $search
                );
            }
        }
    }
    return $search;
}, 10, 2);

Unfortunately, we can't use it since most of our projects get hosted in WPEngine and they don't support MariaDB or MySQL 8. Right now we are trying to create a plugin that creates a separate table with sanitized content and MySQL FTS and modify the search query to use that. But Relevanssi works as you said that's much better for us.

NOTE: Not sure whether this will work with MySQL since it doesn't support lookaheads

Last edited 6 months ago by ravishaheshan (previous) (diff)

#10 in reply to: ↑ 8 @espiat
6 months ago

Replying to zodiac1978:

But by default, it doesn't remove Gutenberg annotations from the content

IIRC Relevanssi is building an index and this does not contain those comments. The content itself is untouched, and the search is not using the content anymore, but the index.

In my tests I couldn't find block names with Relevanssi used.

At the moment I try to get this at least working with MySQL 8.0.4 or above (where REGEXP_REPLACE was introduced), but couldn't get this to work. My latest try:

$where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";

This does not throw any error but is still not working ... :(

Hi.

With this

> $where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";

i find class names, too.

So i changed it. Now class names arent in the search results.

function tl_custom_search_query($search, $query)
{
    if (!is_search() || !$query->is_main_query()) {
        return $search;
    }
    global $wpdb;
    $search_query = get_search_query();
    $search_query = $wpdb->esc_like($search_query);
    // Remove the original search condition
    $search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search);
    // Add a custom search condition
    $search .=  " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '\"[^\"]*\"', ''), '<!-.*?->', '') LIKE '%{$search_query}%'";
    return $search;
}
add_filter('posts_search', 'tl_custom_search_query', 10, 2);

local test system:
mysql 8.0.16
nginx
php 8.0.0

#11 @zodiac1978
6 months ago

My code is working, but the problem is, that WordPress is adding more markup with classes, like <pre class="wp-block-syntaxhighlighter-code"> or <figure class="wp-block-gallery has-nested-images columns-default is-cropped"> why the search is still finding those posts/pages.

Thanks to @l1nuxjedi for helping me with some testing in MariaDB.

So the approach from @espiat to remove the class names is a good idea, but I think it needs some more refinement. At the moment, everything in quotes from the content is filtered out.

Maybe we can limit this to just in "<" and ">"?

#12 follow-up: @l1nuxjedi
6 months ago

In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/

#13 in reply to: ↑ 12 ; follow-ups: @zodiac1978
6 months ago

Replying to l1nuxjedi:

In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/

The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...

See: https://stackoverflow.com/a/1732454

And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).

#14 @espiat
6 months ago

Changed the syntax from excluding strings in double quotes, now excluding strings between < and >:

<?php
function tl_custom_search_query($search, $query)
{
    if (!is_search() || !$query->is_main_query()) {
        return $search;
    }

    global $wpdb;
    $search_query = get_search_query();
    $search_query = $wpdb->esc_like($search_query);

    // Remove the original search condition
    $search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search);

    // Add a custom search condition
    $search .=  " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '<.+?>', ''), '<!-.*?->', '') LIKE '%{$search_query}%'";

    return $search;
}
add_filter('posts_search', 'tl_custom_search_query', 10, 2);

PS: The search is one of the fundamental functions in a website and should be running without false results.

#15 in reply to: ↑ 13 @espiat
6 months ago

Replying to zodiac1978:

Replying to l1nuxjedi:

In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/

The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...

See: https://stackoverflow.com/a/1732454

And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).

Okay. Then we can change the code back to:

<?php
function tl_custom_search_query($search, $query)
{
    if (!is_search() || !$query->is_main_query()) {
        return $search;
    }
    global $wpdb;
    $search_query = get_search_query();
    $search_query = $wpdb->esc_like($search_query);
    // Remove the original search condition
    $search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search);
    // Add a custom search condition
    $search .=  " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '\"[^\"]*\"', ''), '<!-.*?->', '') LIKE '%{$search_query}%'";
    return $search;
}
add_filter('posts_search', 'tl_custom_search_query', 10, 2);

I was making a search in a default wp install. (no plugins)

and i was a bit surprised that the search in the core is currently very imprecise. If I search for "alt" then I get an article in the results that has ONLY an alt tag (the html img attribute) in the code:

<!-- wp:paragraph -->
<p>i am a para </p>
<!-- /wp:paragraph -->

<!-- wp:paragraph -->
<p></p>
<!-- /wp:paragraph -->

<!-- wp:image {"id":321,"sizeSlug":"large","linkDestination":"none"} -->
<figure class="wp-block-image size-large paragraph"><img src="https://olliedemo.local/wp-content/uploads/2023/03/about-792x1024.png" alt="" class="wp-image-321"/></figure>
<!-- /wp:image -->

That shouldn't happen either.
Is the search currently in such a bad state?

#16 in reply to: ↑ 13 ; follow-up: @l1nuxjedi
6 months ago

Replying to zodiac1978:

Replying to l1nuxjedi:

In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/

The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...

See: https://stackoverflow.com/a/1732454

And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).

I would hope in that case that you have literal < and > encoded using HTML entities. Otherwise the only real solution to avoid edge cases would be a proper HTML parser to retrieve the raw text from and parse that into some index generating code to use for the search.

Another suggestion would be an engine such as Sphinx as the backend instead, which can intelligently filter out meta tags. Not necessarily that one though. It has been over 15 years since I've built something similar to what you are trying to achieve here and I'm sure the correct technology has moved on since then.

#17 in reply to: ↑ 16 @zodiac1978
6 months ago

Replying to l1nuxjedi:
And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).

I would hope in that case that you have literal < and > encoded using HTML entities.

Of course you are right. Content is encoded in WordPress database posts table, so in the database there is only &lt;.

Otherwise the only real solution to avoid edge cases would be a proper HTML parser to retrieve the raw text from and parse that into some index generating code to use for the search.

Yes, but this would be out of scope for a core ticket, I think. ;)

Another suggestion would be an engine such as Sphinx as the backend instead, which can intelligently filter out meta tags. Not necessarily that one though. [...]

Yes, another one would be Elastic Search, but using an own search engine would be overkill for most users. My main goal with this ticket was examining the possibilities to fix the search without too much overhead (like duplicating the posts table with a cleaned content version as search index, like Relevanssi is doing it).

As WordPress is not requiring the versions which support REGEXP_REPLACE this will still be closed as maybelater, but I will fix my plugin, and we could re-evaluate this ticket if the requirements change.

Thank you all for helping!

Note: See TracTickets for help on using tickets.