Ticket #10376 (closed enhancement: fixed)

Opened 3 years ago

Last modified 23 months ago

White space should be collapsed after tags are stripped in wp_trim_excerpt

Reported by: sojweb Owned by:
Priority: normal Milestone: 3.0
Component: Formatting Version: 2.8.1
Severity: normal Keywords: has-patch wp_trim_excerpt
Cc: scribu@…, ryanmccue@…

Description

I've a minor edit I'd like to propose: When wp_trim_excerpt strips tags from the post content, it leaves the white space surrounding those tags, and then pulls the first 55 characters, which could be blank. So, for a lot of my posts that had a big set of HTML tags at the beginning, the excerpt would be empty. The work around is easy: collapse the white space (which the browser would do anyway). I hope it seems logical. Diff file attached.

Attachments

formatting.php.diff Download (580 bytes) - added by sojweb 3 years ago.
formatting-preg-split.php.diff Download (668 bytes) - added by sojweb 3 years ago.
10376.2.diff Download (613 bytes) - added by Denis-de-Bernardy 2 years ago.
untested

Change History

  • Keywords has-patch, the_excerpt, wp_trim_excerpt added; has-patch removed

sojweb3 years ago

*That should read: "first 55 words" in the description

  • Cc scribu@… added
  • Milestone changed from Future Release to 2.8.2

Wouldn't a simple trim() do the job?

I agree, trim() would be much simpler and probably more efficient. Make sure it's after the strip_tags(), otherwise there could be whitespace in the middle being left there.

  • Cc ryanmccue@… added

comment:6   dd323 years ago

No. a trim would not work.

the issue here, is that

$in = "A<test> <test> <test> B";
$in = strip_tags($in);
echo $in; //is now 3 spaces between a and b

when you explode by spaced, you end up with spaces being words, ie.

$in = "   ";
$in = explode(' ', $in);
var_dump($in);
---
array
  0 => string '' (length=0)
  1 => string '' (length=0)
  2 => string '' (length=0)
  3 => string '' (length=0)

This is what i'd use myself to get the words:

$a = "          This is a test";
preg_match_all('|\b([^\W]+?)\b|', $a, $words);
$words = $words[1];

result:

array
  0 => string 'This' (length=4)
  1 => string 'is' (length=2)
  2 => string 'a' (length=1)
  3 => string 'test' (length=4)

Considering that strip_tags is run after wpautop() and the other filters on 'the_content', the pseudo-excerpt will always be one line text. If we want to avoid normalizing white space for the whole post, perhaps can use preg_split:

$words = preg_split('/\s+/', $text, $excerpt_length + 1, PREG_SPLIT_NO_EMPTY);

Probably shouldn't count on wpautotop() running, as a lot of people (myself included) have it disabled, but preg_split would be just as good, just don't forget to strip tags from $text; otherwise you will probably end up with broken HTML. I've attached a diff file that uses it.

  • Milestone changed from 2.8.2 to 2.9

@sojweb: why the move of the strip_tags call in the patch?

comment:10 follow-up: ↓ 11   sojweb3 years ago

@Denis-de-Bernardy: You mean why remove it from its own line? It makes for one less variable assignment... I suppose I did it more out of habit than anything else. It doesn't make any real difference.

comment:11 in reply to: ↑ 10   Denis-de-Bernardy3 years ago

Replying to sojweb:

@Denis-de-Bernardy: You mean why remove it from its own line? It makes for one less variable assignment... I suppose I did it more out of habit than anything else. It doesn't make any real difference.

It makes things less readable? :-)

D.

Readable? l33t h4x0r5 do not need readable code ;-).

Yayaya, and "If it was hard for me to write it, it should be hard for them to understand it!" Still... :-)

comment:14 in reply to: ↑ 13   sojweb3 years ago

Replying to Denis-de-Bernardy:

Yayaya, and "If it was hard for me to write it, it should be hard for them to understand it!" Still... :-)

Ha, exactly!

No, I'm totally with you... readability is worth the extra line.

  • Milestone changed from 2.9 to 3.0
  • Keywords has-patch added; has-patch, the_excerpt, removed
  • Keywords needs-patch added; has-patch removed

btw, the current patch is not utf8-safe.

  • Milestone changed from 3.0 to 3.1

untested

  • Milestone changed from 3.1 to 3.0

I've added a utf8-safe version of the suggested patch, in case anyone's willing to test this.

  • Keywords has-patch added; needs-patch removed
  • Status changed from new to closed
  • Resolution set to fixed

(In [13942]) A better default except, Remove multiple white spaces from the except as well as splitting safely on UTF8 strings. Props Denis-de-Bernardy for the UTF8 split. Fixes #10376

I know it doesn't really matter, but in regex, isn't [\n\r\t ]+ equivalent to \s+ ?

Nevermind, I saw the comment above.

Note: See TracTickets for help on using tickets.