Make WordPress Core

Opened 7 weeks ago

Last modified 5 weeks ago

#64944 accepted defect (bug)

Generated Excerpts - Missing white space when stripping <br>s generated in paragraph block, verse block, etc.

Reported by: addiestavlo's profile addiestavlo Owned by: audrasjb's profile audrasjb
Milestone: 7.1 Priority: normal
Severity: normal Version: 5.0
Component: Formatting Keywords: has-patch has-unit-tests
Focuses: Cc:

Description (last modified by sabernhardt)

When using the Verse block (or similarly Paragraph block and shift + enter spacing), <br>s are added in the block's content between lines. When WordPress generates an excerpt (when no custom excerpt is set), these <br>s are stripped along with other HTML tags. This often creates excerpts with missing spaces between words.

Consider the common poetry formatting where multiple lines exist in a paragraph block to represent a stanza.

On the code side of the editor this looks like:

<!-- wp:verse -->
<pre class="wp-block-verse">this is<br>a verse block<br>it has<br>the same issues</pre>
<!-- /wp:verse -->

or

<!-- wp:paragraph -->
<p>This is a poem<br>using shft+space<br>Inside a paragraph block<br>for good stanza formatting</p>
<!-- /wp:paragraph -->

When WP generates an excerpt based off the post content this ends up as:
"this isa verse blockit hasthe same issues"
or
"This is a poemusing shift+spaceInside a paragraph blockfor good stanza formatting."

This shows up often in excerpts generated from content corresponding to poetry, song lyrics, or other similar formats. When excerpts are used in any context (post previews, email subject descriptions, etc.) these missing white spaces obviously look horrible.

To Reproduce (recently tested in WP Playground on 6.9):

  • Create a new post using a Paragraph or Verse block. For the Verse block, standard enter to add new lines will repro the issue. For the Paragraph block, shift + enter to create new lines within the block.
  • Do not create a custom excerpt.
  • Publish the post.
  • Run get_the_excerpt for the post.
  • Verify that there are no spaces between the last words of one line and first words of the next.

How to fix?

I am uncertain on the best approach to resolve some notes:

wp_trim_excerpt - calls get_the_content when no excerpt text is passed to it. Later calls wp_trim_words

wp_trim_words - calls wp_strip_all_tags and later creates a $words_array using preg_split on the "/[\n\r\t ]+/" pattern.

wp_strip_all_tags - strips all the tags in a preg_replace. Later, if $remove_breaks is true, replaces '/[\r\n\t ]+/' patterns with spaces. In the current chain in this context $remove_breaks is false so this doesn't happen here, and the preg_split noted above in wp_trim_words will find these.

One thought, if wp_strip_all_tags similarly considered <br>s in the $remove_breaks block AND moved this handling before the preg_replace that strips tags, that seems like potentially a general improvement. If the goal is to replace breaks with spaces, then <br>s should be considered there. However, we don't call $remove_breaks in our context coming from wp_trim_words and it may not make sense to add that there.

Another thought, would it make sense for wp_trim_words to replace <br>s with spaces before calling wp_strip_all_tags ? Those spaces would then be caught by the pattern in the preg_split creating the $words_array.

I am attaching a diff for the latter. <br> tags are stripped without preserving spacing, causing words to concatenate (e.g., ‘thisexample’). This replaces <br> with a space before tag stripping to preserve word boundaries.

Attachments (1)

br-spacing-fix.diff (460 bytes) - added by addiestavlo 7 weeks ago.
<br> tags are stripped without preserving spacing, causing words to concatenate (e.g., ‘thisexample’). This replaces <br> with a space before tag stripping to preserve word boundaries. Note tags are already stripped just after this, and newlines, returns, spaces, etc. are all later used to create the $words_array here. This helps retain expected generated excerpt behavior when using verse and paragraph (w/ shft+space) type blocks.

Download all attachments as: .zip

Change History (8)

@addiestavlo
7 weeks ago

<br> tags are stripped without preserving spacing, causing words to concatenate (e.g., ‘thisexample’). This replaces <br> with a space before tag stripping to preserve word boundaries. Note tags are already stripped just after this, and newlines, returns, spaces, etc. are all later used to create the $words_array here. This helps retain expected generated excerpt behavior when using verse and paragraph (w/ shft+space) type blocks.

This ticket was mentioned in PR #11352 on WordPress/wordpress-develop by Addison-Stavlo.


7 weeks ago
#1

  • Keywords has-unit-tests added

Ensures words around br tags are not concatenated together during wp_trim_words by replacing br tags with a space. This is done just before all tags are stripped and will ensure the words are actually separated when the $words_array is generated.

These br tags are common in the core block editor as they can appear in paragraph blocks (shft+enter for spacing), verse blocks, and likely more. For content written in forms similar to that of poetry or song lyrics, excerpts generated from the content stick words together. e.g. "Line one<br>line two" becomes "line oneline two" - this PR aims to resolve this problem at the source in trim words.

Trac ticket: https://core.trac.wordpress.org/ticket/64944

## Use of AI Tools

AI assistance: Yes
Tool(s): Cursor
Model(s): Composer 2
Used for: assistance with initial code investigation, assistance with generating regex pattern, initial test suggestions, and general writing directed by me. Placement of and suggested fix made by me, test reviewed and edited by me.

#2 @sabernhardt
7 weeks ago

  • Component changed from General to Formatting
  • Description modified (diff)
  • Version 6.9 deleted

(This can happen with blocks since WordPress 5.0, and it was possible earlier when adding br tags manually within the Code/Text view of the classic editor.)

Last edited 7 weeks ago by sabernhardt (previous) (diff)

#3 @audrasjb
6 weeks ago

  • Keywords needs-testing added
  • Milestone changed from Awaiting Review to 7.1
  • Version set to 5.0

Moving to 7.1 as we have a patch ready to be tested.

#4 @audrasjb
6 weeks ago

  • Owner set to audrasjb
  • Status changed from new to accepted

#5 @yashyadav247
5 weeks ago

Reproduction Report and Patch Testing

Description

This report validates whether the issue related to missing white space when stripping <br>s generated in paragraph block, verse block, etc.
can be reproduced.

Environment

  • WordPress: 7.1-alpha-62161-src
  • PHP: 8.2.28
  • Server: nginx/1.29.0
  • Database: mysqli (Server: 8.4.5 / Client: mysqlnd 8.2.28)
  • Browser: Chrome 145.0.0.0
  • OS: Windows 10/11
  • Theme: Twenty Twenty-Five 1.4
  • MU Plugins: None activated
  • Plugins:
    • Test Reports 1.2.1

Actual Results

Created a new post using Verse block, and published the post. Ran get_the_excerpt for the post.
No spaces between the last words of one line and first words of the next. ✅

Supplemental Artifacts

BEFORE

Created verse block

https://i.postimg.cc/fb3zr2w9/image.png

Calling get_the_excerpt for the post shows no spaces in between.

https://i.postimg.cc/XJCvp2mv/bug-reproduction.png

AFTER Patch:

Spaces visible between the last words of one line and first words of the next ✅

https://i.postimg.cc/ZY36Bjz1/image.png

This ticket was mentioned in Slack in #core-test by gaisma22. View the logs.


5 weeks ago

#7 @gaisma22
5 weeks ago

  • Keywords needs-testing removed

Patch Testing Report

Patch Tested: https://github.com/WordPress/wordpress-develop/pull/11352

Environment

  • WordPress: 7.0-beta6-62085-src
  • PHP: 8.3.30
  • Server: nginx/1.29.7
  • Database: MySQL 8.4.8
  • Browser: Brave
  • OS: Ubuntu 24.04
  • Theme: Twenty Twenty-Five 1.4
  • MU Plugins: None
  • Plugins: None

Steps Taken

  1. Created a new post using a Verse block with three lines separated by Enter.
  2. Published without a custom excerpt.
  3. Checked the generated excerpt via the REST API at /wp-json/wp/v2/posts/6. Before patch: Words from adjacent lines were stuck together with no spaces. e.g. "this is line onethis is line two"
  4. Created a new post with the same steps after applying PR #11352. Checked /wp-json/wp/v2/posts/9. After patch: Words are correctly separated by spaces. e.g. "this is line one this is line two this is line three"

✅ Patch is solving the problem

Expected Result

When WordPress generates an automatic excerpt from a post using a Verse block or Paragraph block with shift+enter line breaks, words from adjacent lines should be separated by spaces.

Additional Notes

  1. Bug confirmed on WordPress 7.0-beta6. The br tags in Verse block content were stripped without preserving word boundaries, causing words from adjacent lines to concatenate in the generated excerpt.
  2. Removing needs-testing as patch resolves the issue on WordPress 7.0-beta6-62085-src.

Screenshots/Screencast with results

Before Patch:
https://i.ibb.co/60Q5zz13/before-excerpt.png

After Patch:
https://i.ibb.co/j9jNbKWZ/after-excerpt.png

Note: See TracTickets for help on using tickets.