#56294 closed enhancement (maybelater)
WordPress search finds block name in comment
Reported by: |
|
Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 5.0 |
Component: | Database | Keywords: | needs-patch dev-feedback |
Focuses: | performance | Cc: |
Description
There is a known issue with the WP search, that it is a full text search over
post_content
which also finds HTML tags like table
. So searching for the word "table" also finds every post/page with table markup in it.
This problem is very limited, so it wasn't necessary to fix it, although there is a plugin to fix it:
https://wordpress.org/plugins/wp-search-ignore-html-tags/
Now with the block editor (aka Gutenberg) this has changed. Every block is using the block name in a HTML comment. For example:
<!-- wp:syntaxhighlighter/code {"language":"php"} -->
If I now search in a tech blog about the term "syntaxhighlighter" I get every post/page with a code block and not only if the post/page is really containing this word in the text.
And with every new block the chance is higher to get more false positive search results.
Even the core blocks have problems, as "paragraph" (instead of just "p") or image (instead of "img") have a much higher chance for false positive search results, because of the ambiguity.
There is a Github issue for the block editor about it:
https://github.com/WordPress/gutenberg/issues/3739
But it was closed from @pento due to the fact, that it is a known WordPress issue and not necessarily a problem of the block editor and its type of data.
@danielbachhuber was asking at https://github.com/WordPress/gutenberg/issues/10307#issuecomment-426995580
However, I don't have any great ideas for how to resolve this with MySQL. I'd love to hear of a solution if someone has one. Barring that, this probably won't be a priority to fix with WP 5.0
After looking at the plugin linked above, I created a solution (with the support from @kau-boy):
/** * Modify search query to ignore the search term in HTML comments. * * @param string $where The WHERE clause of the query. * @param WP_Query $query The WP_Query instance (passed by reference). * * @return string The modified WHERE clause. */ function tl_update_search_query( $where, $query ) { if ( ! is_search() || ! $query->is_main_query() ) { return $where; } global $wpdb; $search_query = get_search_query(); $search_query = $wpdb->esc_like( $search_query ); $where .= " AND {$wpdb->posts}.post_content NOT REGEXP '<!--.*$search_query.*-->' "; return $where; } add_filter( 'posts_where', 'tl_update_search_query', 10, 2 );
Before I try to create a PR for it. Would this be a possible way to solve this or is a NOT REGEXP
too slow if many posts exist? I am running this solution on my blog but it has no high traffic and not very much posts - so my finding may not show the big picture here.
Feedback about possible problems (and hopefully how to solve them) are much appreciated! Thanks in advance.
Change History (22)
#3
@
2 years ago
I've put the code from above in a plugin:
https://wordpress.org/plugins/ignore-block-name-in-search/
For everyone looking for a fast solution to the problem.
Still hoping to get some feedback if fiddling around with the posts_where
filter is appropriate for WordPress core.
#4
@
2 years ago
@zodiac1978 thanks. I'm thinking about adding a post meta field with sanitized post_content(w/o block tags and HTML tags) and modifying the search query to use that instead of the actual post_content field
#5
@
2 years ago
@zodiac1978 Your solution doesn't work when the search term is in both tags and content. ex: <!-- wp:core/paragraph -->this is a paragraph<!-- /wp:core/paragraph -->
the post with this content doesn't get captured when searching for "paragraph" because of the AND
clause in your query
#6
@
2 years ago
- Resolution set to maybelater
- Status changed from new to closed
Your solution doesn't work when the search term is in both tags and content.
@ravishaheshan That's true. Unfortunately I did not find a simple way to achieve this with MySQL below 8.
Since MySQL 8 there is REPLACE_REGEXP which would solve this:
https://stackoverflow.com/questions/986826/how-to-do-a-regular-expression-replace-in-mysql
I'm thinking about adding a post meta field with sanitized post_content(w/o block tags and HTML tags) and modifying the search query to use that instead of the actual post_content field
One way of solving this issue is using Relevanssi. It is already doing this indexing and is therefore solving this issue too:
https://wordpress.org/plugins/relevanssi/
So unless someone finds a way to do this with MySQL below 8 this can be closed as "maybelater".
#7
@
2 years ago
@zodiac1978 thanks for your response. We do use "Relevanssi" in our projects. But by default, it doesn't remove Gutenberg annotations from the content so we do see invalid results in search results. Could you please explain how to do that if it isn't too much to ask? Thank you
#8
follow-up:
↓ 10
@
2 years ago
But by default, it doesn't remove Gutenberg annotations from the content
IIRC Relevanssi is building an index and this does not contain those comments. The content itself is untouched, and the search is not using the content anymore, but the index.
In my tests I couldn't find block names with Relevanssi used.
At the moment I try to get this at least working with MySQL 8.0.4 or above (where REGEXP_REPLACE was introduced), but couldn't get this to work. My latest try:
$where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";
This does not throw any error but is still not working ... :(
#9
@
2 years ago
@zodiac1978 thanks. With MariaDB 10 I was able to get it working with following
add_filter('posts_search', function ($search, $query) { if ($query->is_search() && !is_admin() && (!isset($_GET['context']) || (isset($_GET['context']) && $_GET['context'] != 'edit')) ) { global $wpdb; $search_terms = $query->get('search_terms', []); foreach ($search_terms as $search_term) { $search_term = $wpdb->esc_like($search_term); $post_content_clause = $wpdb->prepare("wp_posts.post_content LIKE '%%%s%%'", $search_term); if (strpos($search, $post_content_clause) !== false) { $modified_post_content_clause = $wpdb->prepare( "REGEXP_REPLACE(post_content, %s, '') LIKE '%%%s%%'", "\<[^<>]*{$search_term}[^<>]*\>", $search_term ); // Replace search clause for post_content with the one that search // in tag sanitized content $search = str_replace( $post_content_clause, $modified_post_content_clause, $search ); } } } return $search; }, 10, 2);
Unfortunately, we can't use it since most of our projects get hosted in WPEngine and they don't support MariaDB or MySQL 8. Right now we are trying to create a plugin that creates a separate table with sanitized content and MySQL FTS and modify the search query to use that. But Relevanssi works as you said that's much better for us.
NOTE: Not sure whether this will work with MySQL since it doesn't support lookaheads
#10
in reply to:
↑ 8
@
2 years ago
Replying to zodiac1978:
But by default, it doesn't remove Gutenberg annotations from the content
IIRC Relevanssi is building an index and this does not contain those comments. The content itself is untouched, and the search is not using the content anymore, but the index.
In my tests I couldn't find block names with Relevanssi used.
At the moment I try to get this at least working with MySQL 8.0.4 or above (where REGEXP_REPLACE was introduced), but couldn't get this to work. My latest try:
$where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";This does not throw any error but is still not working ... :(
Hi.
With this
> $where .= " AND REGEXP_REPLACE({$wpdb->posts}.post_content, '\<\!\-\-.*?\-\-\>', '') LIKE '%{$search_query}%'";
i find class names, too.
So i changed it. Now class names arent in the search results.
function tl_custom_search_query($search, $query)
{
if (!is_search() || !$query->is_main_query()) {
return $search;
}
global $wpdb;
$search_query = get_search_query();
$search_query = $wpdb->esc_like($search_query);
// Remove the original search condition
$search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search);
// Add a custom search condition
$search .= " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '\"[^\"]*\"', ''), '<!-.*?->', '') LIKE '%{$search_query}%'";
return $search;
}
add_filter('posts_search', 'tl_custom_search_query', 10, 2);
local test system:
mysql 8.0.16
nginx
php 8.0.0
#11
@
2 years ago
My code is working, but the problem is, that WordPress is adding more markup with classes, like <pre class="wp-block-syntaxhighlighter-code">
or <figure class="wp-block-gallery has-nested-images columns-default is-cropped">
why the search is still finding those posts/pages.
Thanks to @l1nuxjedi for helping me with some testing in MariaDB.
So the approach from @espiat to remove the class names is a good idea, but I think it needs some more refinement. At the moment, everything in quotes from the content is filtered out.
Maybe we can limit this to just in "<" and ">"?
#12
follow-up:
↓ 13
@
2 years ago
In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
#13
in reply to:
↑ 12
;
follow-ups:
↓ 15
↓ 16
@
2 years ago
Replying to l1nuxjedi:
In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...
See: https://stackoverflow.com/a/1732454
And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).
#14
@
2 years ago
Changed the syntax from excluding strings in double quotes, now excluding strings between < and >:
<?php function tl_custom_search_query($search, $query) { if (!is_search() || !$query->is_main_query()) { return $search; } global $wpdb; $search_query = get_search_query(); $search_query = $wpdb->esc_like($search_query); // Remove the original search condition $search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search); // Add a custom search condition $search .= " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '<.+?>', ''), '<!-.*?->', '') LIKE '%{$search_query}%'"; return $search; } add_filter('posts_search', 'tl_custom_search_query', 10, 2);
PS: The search is one of the fundamental functions in a website and should be running without false results.
#15
in reply to:
↑ 13
@
2 years ago
Replying to zodiac1978:
Replying to l1nuxjedi:
In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...
See: https://stackoverflow.com/a/1732454
And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).
Okay. Then we can change the code back to:
<?php function tl_custom_search_query($search, $query) { if (!is_search() || !$query->is_main_query()) { return $search; } global $wpdb; $search_query = get_search_query(); $search_query = $wpdb->esc_like($search_query); // Remove the original search condition $search = preg_replace("/\({$wpdb->posts}.post_content LIKE '%[^%]+%'\)/", "", $search); // Add a custom search condition $search .= " AND REGEXP_REPLACE(REGEXP_REPLACE({$wpdb->posts}.post_content, '\"[^\"]*\"', ''), '<!-.*?->', '') LIKE '%{$search_query}%'"; return $search; } add_filter('posts_search', 'tl_custom_search_query', 10, 2);
I was making a search in a default wp install. (no plugins)
and i was a bit surprised that the search in the core is currently very imprecise. If I search for "alt" then I get an article in the results that has ONLY an alt tag (the html img attribute) in the code:
<!-- wp:paragraph --> <p>i am a para </p> <!-- /wp:paragraph --> <!-- wp:paragraph --> <p></p> <!-- /wp:paragraph --> <!-- wp:image {"id":321,"sizeSlug":"large","linkDestination":"none"} --> <figure class="wp-block-image size-large paragraph"><img src="https://olliedemo.local/wp-content/uploads/2023/03/about-792x1024.png" alt="" class="wp-image-321"/></figure> <!-- /wp:image -->
That shouldn't happen either.
Is the search currently in such a bad state?
#16
in reply to:
↑ 13
;
follow-up:
↓ 17
@
2 years ago
Replying to zodiac1978:
Replying to l1nuxjedi:
In fact the first example here shows how to modify that DB query to filter all meta tags: https://mariadb.com/kb/en/regexp_replace/
The problem with parsing HTML with RegEx is, that there are so many edge cases that will break it. One "<" in the content is filtering out everything until the next closing ">" for example ...
See: https://stackoverflow.com/a/1732454
And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).
I would hope in that case that you have literal < and > encoded using HTML entities. Otherwise the only real solution to avoid edge cases would be a proper HTML parser to retrieve the raw text from and parse that into some index generating code to use for the search.
Another suggestion would be an engine such as Sphinx as the backend instead, which can intelligently filter out meta tags. Not necessarily that one though. It has been over 15 years since I've built something similar to what you are trying to achieve here and I'm sure the correct technology has moved on since then.
#17
in reply to:
↑ 16
@
2 years ago
Replying to l1nuxjedi:
And with the example from the MariaDB knowledge base: https://regex101.com/r/CY0zuJ/1 (just added "5 < 1" in the content).
I would hope in that case that you have literal < and > encoded using HTML entities.
Of course you are right. Content is encoded in WordPress database posts table, so in the database there is only <
.
Otherwise the only real solution to avoid edge cases would be a proper HTML parser to retrieve the raw text from and parse that into some index generating code to use for the search.
Yes, but this would be out of scope for a core ticket, I think. ;)
Another suggestion would be an engine such as Sphinx as the backend instead, which can intelligently filter out meta tags. Not necessarily that one though. [...]
Yes, another one would be Elastic Search, but using an own search engine would be overkill for most users. My main goal with this ticket was examining the possibilities to fix the search without too much overhead (like duplicating the posts table with a cleaned content version as search index, like Relevanssi is doing it).
As WordPress is not requiring the versions which support REGEXP_REPLACE this will still be closed as maybelater
, but I will fix my plugin, and we could re-evaluate this ticket if the requirements change.
Thank you all for helping!
#19
@
10 months ago
I would hope in that case that you have literal < and > encoded using HTML entities.
Unfortunately this is not the case every time:
See: #45387
Also: https://github.com/WordPress/gutenberg/issues/15636
#20
follow-up:
↓ 21
@
9 months ago
What an interesting and exciting challenge to solve. I'll share some of my own thoughts, having worked on search indexing in different platforms and having worked on the serialized block HTML at all levels.
Concerning the use of functions like REPLACE_REGEX
I really caution folks to consider what those are implying on the database when performing a search. They end up parsing and modifying every row on every search. For small test sites this probably never amounts to much, but perhaps on a site with thousands of posts and thousands of daily visitors, this could rapidly overwhelm the database. I don't see computing the search index on every search query as a super viable option.
That being said, the discussions about storing a kind of transformed post in another location would make searching easier with the existing toolsets and performance characteristics. Post meta is a convenient approach, but may not be the most ideal for similar performance reasons. It could be similarly computed as an additional column on the post row in the database, or in a separate database table just for post indexes.
The HTML API finally provides powerful and reliable tools for searching the rendered or text content of a post. For example, WordPress could store a plaintext view over a post every time it updates a post, and searches can be performed against that. This not only would work around the challenges posed by the block content, but also the very same challenges which have always existed within WordPress' search. For example, it's always been the case that if you search for form
or code
or template
and a post contains those tags, that the search will return those false results.
Generating the text content has never been easier, and because of the HTML API interface it also gets around unexpected constructions involving character rerferences, as it always decodes them before presenting the string values to calling PHP code.
<?php function get_text_content( $html ) { $text_content = ''; $processor = new WP_HTML_Tag_Processor( $html ); while ( $processor->next_token() ) { if ( '#text' === $processor->get_token_name() ) { $text_content .= $processor->get_modifiable_text(); } } return $text_content; }
Upon this foundation all sorts of stronger search indices can be built and then searched.
#21
in reply to:
↑ 20
@
9 months ago
Replying to dmsnell:
What an interesting and exciting challenge to solve.
Thanks for chiming in!
[...] I really caution folks to consider what those are implying on the database when performing a search. [...] I don't see computing the search index on every search query as a super viable option.
Yes, that was my thought too. I just wasn't sure how much the impact is in real world examples.
[...] storing a kind of transformed post in another location would make searching easier with the existing toolsets and performance characteristics.
Do you think this would be possible for WordPress core to add such a new table column for searching? My thought was, that this would be plugin territory.
Upon this foundation all sorts of stronger search indices can be built and then searched.
I would be happy if this could be finally tackled in this way. I dismissed this idea, because I never thought WordPress core would do it. But if you think this could come to core: Let's go for it! I'm happily throw in my thoughts, ideas and will test this feature on the way.
#22
@
9 months ago
I just wasn't sure how much the impact is in real world examples.
It would be good to benchmark, but I have a strong suspicion it's squarely in the untenable category. Frankly, the full-table scans involved purely in checking if a literal byte sequence are in a post is heavy enough.
One of the perks of creating a new search index is the collapse of the data space required to search. The search results performance should scale with the diversity of content on a site, not on the number of posts or the lengths of those posts.
For instance, one of the most basic things to consider is storing a lookup from search words to posts containing those search words. If we search for "apple" we shouldn't have to scan 1,000 posts looking for apple
and running into all of the related problems reported in this ticket.
Instead, we could have a lookup table containing every "word" (and I'm overlooking a lot of nuance when I say "word") in every post, and for each word, it lists the posts containing that word. We look up "apple" very rapidly and that may return a much smaller set of posts which we can then scan to perform the second and more refined search pass.
Do you think this would be possible for WordPress core to add such a new table column for searching? My thought was, that this would be plugin territory.
If I were to start today I would create a separate feature plugin and explore how to do it in a Core way. It's likely that it will take a lot of work and refinement based on the feedback people provide. For instance, is it appropriate simply to build a better default search algorithm, or would it be more valuable to create a pluggable search indexing system that someone could plug Elasticsearch into, or their custom search backend?
Maybe WordPress can develop the internal mechanisms needed for issuing content to the indexer and updating indexed objects while the actual indexing and retrieval becomes secondary.
Based on a number of chats I've had with @zieladam about synchronizing multiple WordPress instances, I think the "vector clock" state-tracking table we've explored could be a bit fortuitous for search indexing. Search indexes are a system based on the need to keep external data in sync with the reference data.
https://core.trac.wordpress.org/ticket/60375#comment:24
I dismissed this idea, because I never thought WordPress core would do it.
Who is Core if not those of us using it and reporting issues and striving to make WordPress grow stronger?
This is also problematic when there are attributes in the blocks like classes
A search query for "primary" will yield all posts with blocks that has the class "primary"