Opened 5 years ago
Closed 18 months ago
#6077 closed defect (bug) (fixed)
UTF-8 strings are sometimes cut in the middle of a character
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Priority: | normal | Milestone: | 2.5 |
| Component: | General | Version: | |
| Severity: | normal | Keywords: | unicode utf-8 excerpt has-patch |
| Cc: |
Description
Using substr on UTF-8 strings can cause some characters to be cut on the middle, because substr counts bytes, but in UTF-8 a character can be more than one byte.
Here is a patch, which:
- Defines mb_strcut in compat.php}} for the users, who don't have the {{{mb_string extension.
- Introduces a new wp_html_excerpt function, which uses mb_strcut and works well with html strings: counts entities as one character (& isn't 4 chars) and strips tags.
There are some tests for the two functions:
- _mb_strcut
- wp_html_excerpt (in the end of the file)
Attachments (3)
Change History (14)
nbachiyski — 5 years ago
comment:1
nbachiyski — 5 years ago
- Keywords has-patch added
comment:4
nbachiyski — 5 years ago
Oh, I was misled by the html_entity_decode manual, which says:
Version Description 5.0.0 Support for multi-byte character sets was added.
I didn't see above this message was written that most of the encodings we need are supported by 4.3.0. So, let's add the encoding then.
The manual says:
Any other character sets are not recognized and ISO-8859-1 will be used instead.
This is good thing but it outputs warnings in that case, so I updated the patch just to add "@" anyway.
According to the PHP user notes, html_entity_decode() has a bug with UTF-8.
Maybe we shoud create substitute function?
bug should be reproduced with this code before PHP 5.0.1:
echo html_entity_decode('€', ENT_QUOTES, 'UTF-8');
comment:7
nbachiyski — 5 years ago
We can just drop the entity decoding part. Yes -- the excerpt could be a couple of characters shorter than the specified length, but that's how it has worked up to now and nobody complained.
comment:8
nbachiyski — 5 years ago
Here is a patch, which removes entity decoding. Documentation and test are also updated.
nbachiyski — 5 years ago
- Resolution set to fixed
- Status changed from reopened to closed
comment:10
follow-up:
↓ 11
kurtmckee — 18 months ago
- Resolution fixed deleted
- Status changed from closed to reopened
It looks like this bug has resurfaced in the RSS2 comments feed at:
http://www.arnaudmontebourg.fr/?feed=rss2
I don't run that site, but this was reported as a bug in a software project I maintain:
comment:11
in reply to:
↑ 10
SergeyBiryukov — 18 months ago
- Resolution set to fixed
- Status changed from reopened to closed
Replying to kurtmckee:
This ticket was closed on a completed milestone. Please open a new one.

(In [7140]) Multi-byte character safe excerpting from nbachiyski. fixes #6077