Context Navigation

← Previous Ticket
Next Ticket →

#11175 closed defect (bug) (duplicate)

wp_check_invalid_utf8() should drop invalid utf-8 chars only instead of truncating string

Reported by:	sirzooro	Owned by:	hakre
Milestone:		Priority:	normal
Severity:	normal	Version:	2.9
Component:	Charset	Keywords:	has-patch tested needs-unit-tests dev-feedback
Focuses:		Cc:

Description

When you call wp_check_invalid_utf8() with 2nd param set to true, it tries to strip invalid utf-8. Now it removes 1st invalid utf-8 char and all chars after it, no matter if they are correct or not. Additionally it can print following notice:

Notice: iconv() [function.iconv]: Detected an illegal character in input string in .../wp-includes/formatting.php on line 437

Attached patch changes this, so function removes invalid chars only. Additionally it is less configuration-dependent, because it can use either mb_convert_encoding() or iconv().

Attachments (1)

formatting.php.diff (557 bytes) - added by sirzooro 16 years ago.

Download all attachments as: .zip

Change History (21)

@sirzooro
16 years ago

Attachment formatting.php.diff added

#1 @miqrogroove
16 years ago

When testing, please ensure this function never removes byte values less than 128. A situation where the function sees a 2-byte marker and then underflows, deletes two bytes and returns the results, would lead to code injection problems.

#2 @miqrogroove
16 years ago

Keywords tested added; needs-testing removed

iconv and mb_convert_encoding behave identically. Tested with various valid and invalid byte sequences. These functions do not check for code point validity, but they do a good job of stripping individual bytes that fail to decode.

Trivially noted disagreement with preg_match using invalid code points 0xD800 and 0x140000.

#3 @westi
16 years ago

Keywords needs-unit-tests added

Moved to 3.0 for now.

Need to add some unit tests for this.

#4 @nacin
16 years ago

Milestone changed from 2.9 to 3.0

Moving.

#5 @hakre
16 years ago

Component changed from General to Charset
Owner set to hakre

Reviewed the patch, looks good to me. Thanks for it, that really is an improvement. iconv rocks.

#6 @hakre
16 years ago

I think it's noteworthy to remind that wp_check_invalid_utf8() should not be used to check wether or not a string is a valid utf8 string. The function can lead to false results because it's broken.

Use seems_utf8() to verify a string beign properly UTF8 encoded or not.

#7 @hakre
15 years ago

seems_utf8() is broken as well, reference: #5998

#8 @hakre
15 years ago

Related: #10543

#9 @hakre
15 years ago

Since nor mb_string not iconv are non-default, I think there should be a fallback regex as well (I should be able to provide one w/o the /u modifier). So the function savely behaves on all installs.

To select which function to prefer, a shootout between all options should be run to checkout which one of those performs fastest.

#10 @hakre
15 years ago

Related: #11619

#11 @westi
15 years ago

Milestone changed from 3.0 to Future Release

This is too risky to do without alot of testing.

Truncating is usually a better response to invalid utf8 because it is very hard to determine what the real cause of invalidity is and where it is safe to start parsing the bytestream again.

Moving to future release until we have some detail tests for this.

#12 @miqrogroove
15 years ago

It is already tested as noted above.

#13 @SergeyBiryukov
14 years ago

Related: #15827

#14 @hakre
14 years ago

Related: #14618

#15 follow-up: ↓ 16 @jakub.tyrcha
14 years ago

See http://gsoc.svn.wordpress.org/2011/jakub.tyrcha for unit tests

#16 in reply to: ↑ 15 @westi
14 years ago

Replying to jakub.tyrcha:

See http://gsoc.svn.wordpress.org/2011/jakub.tyrcha for unit tests

Please could you create a ticket in the unit-tests trac and attach the patch there too.

#17 @chriscct7
11 years ago

Keywords dev-feedback added

@westi and @nacin what are your thoughts about continuing with this ticket? Should it be maybelatered?

#18 @askapache
11 years ago

Resolution set to duplicate
Status changed from new to closed

Duplicate of #29717.

#19 @DrewAPicture
11 years ago

Milestone Future Release deleted

#20 @SergeyBiryukov
5 years ago

#52131 was marked as a duplicate.

Note: See TracTickets for help on using tickets.

Trac UI Preferences

Download in other formats:

Make WordPress Core

Context Navigation

#11175 closed defect (bug) (duplicate)

wp_check_invalid_utf8() should drop invalid utf-8 chars only instead of truncating string

Description

Attachments (1)

Change History (21)

@sirzooro 16 years ago

#1 @miqrogroove 16 years ago

#2 @miqrogroove 16 years ago

#3 @westi 16 years ago

#4 @nacin 16 years ago

#5 @hakre 16 years ago

#6 @hakre 16 years ago

#7 @hakre 15 years ago

#8 @hakre 15 years ago

#9 @hakre 15 years ago

#10 @hakre 15 years ago

#11 @westi 15 years ago

#12 @miqrogroove 15 years ago

#13 @SergeyBiryukov 14 years ago

#14 @hakre 14 years ago

#15 follow-up: ↓ 16 @jakub.tyrcha 14 years ago

#16 in reply to: ↑ 15 @westi 14 years ago

#17 @chriscct7 11 years ago

#18 @askapache 11 years ago

#19 @DrewAPicture 11 years ago

#20 @SergeyBiryukov 5 years ago