WordPress.org

Make WordPress Core

Opened 7 years ago

Closed 4 years ago

#6077 closed defect (bug) (fixed)

UTF-8 strings are sometimes cut in the middle of a character

Reported by: nbachiyski Owned by:
Milestone: 2.5 Priority: normal
Severity: normal Version:
Component: General Keywords: unicode utf-8 excerpt has-patch
Focuses: Cc:

Description

Using substr on UTF-8 strings can cause some characters to be cut on the middle, because substr counts bytes, but in UTF-8 a character can be more than one byte.

Here is a patch, which:

  • Defines mb_strcut in compat.php}} for the users, who don't have the {{{mb_string extension.
  • Introduces a new wp_html_excerpt function, which uses mb_strcut and works well with html strings: counts entities as one character (& isn't 4 chars) and strips tags.

There are some tests for the two functions:

Attachments (3)

safe-excerpts.diff (5.3 KB) - added by nbachiyski 7 years ago.
html_entity_decode.diff (560 bytes) - added by tenpura 7 years ago.
safe-excerpts-no-decode.diff (1.2 KB) - added by nbachiyski 7 years ago.

Download all attachments as: .zip

Change History (14)

@nbachiyski7 years ago

comment:1 @nbachiyski7 years ago

  • Keywords has-patch added

comment:2 @ryan7 years ago

  • Resolution set to fixed
  • Status changed from new to closed

(In [7140]) Multi-byte character safe excerpting from nbachiyski. fixes #6077

comment:3 @tenpura7 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:4 @nbachiyski7 years ago

Oh, I was misled by the html_entity_decode manual, which says:

Version   Description
5.0.0     Support for multi-byte character sets was added.

I didn't see above this message was written that most of the encodings we need are supported by 4.3.0. So, let's add the encoding then.

@tenpura7 years ago

comment:5 @tenpura7 years ago

The manual says:

Any other character sets are not recognized
 and ISO-8859-1 will be used instead.

This is good thing but it outputs warnings in that case, so I updated the patch just to add "@" anyway.

comment:6 @tenpura7 years ago

According to the PHP user notes, html_entity_decode() has a bug with UTF-8.
Maybe we shoud create substitute function?

bug should be reproduced with this code before PHP 5.0.1:

echo html_entity_decode('€', ENT_QUOTES, 'UTF-8');

comment:7 @nbachiyski7 years ago

We can just drop the entity decoding part. Yes -- the excerpt could be a couple of characters shorter than the specified length, but that's how it has worked up to now and nobody complained.

comment:8 @nbachiyski7 years ago

Here is a patch, which removes entity decoding. Documentation and test are also updated.

comment:9 @westi7 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [7190]) Remove the entity decoding and recoding from wp_html_excerpt. Fixes #6077 props nbachiyski.

comment:10 follow-up: @kurtmckee4 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

It looks like this bug has resurfaced in the RSS2 comments feed at:

http://www.arnaudmontebourg.fr/?feed=rss2

I don't run that site, but this was reported as a bug in a software project I maintain:

https://code.google.com/p/feedparser/issues/detail?id=306

comment:11 in reply to: ↑ 10 @SergeyBiryukov4 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to kurtmckee:

This ticket was closed on a completed milestone. Please open a new one.

Note: See TracTickets for help on using tickets.