Make WordPress Core

Opened 16 years ago

Closed 8 years ago

#5998 closed defect (bug) (wontfix)

Invalid Unicode characters

Reported by: shelleyp's profile shelleyp Owned by: hakre's profile hakre
Milestone: Priority: normal
Severity: normal Version: 2.3.3
Component: Charset Keywords:
Focuses: Cc:


Wordpress does not check for invalid Unicode characters, such as the following:


When the pages are served up as XHTML, allowing these characters through generates an XML error.

WordPress should filter out illegal Unicode code points.

Please see

Also, the regex
here is
incorrect, see [
this page].

Attachments (6)

bug5998.patch (2.1 KB) - added by schiller 16 years ago.
Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)
comment.php.diff (1.1 KB) - added by dwright 15 years ago.
wp-includes/comment.php.diff try php xml parser load, bails on non valid utf8
5998.patch (4.6 KB) - added by hakre 14 years ago.
correct working utf8 validation functions
5998-comment-php.3.patch (1.2 KB) - added by hakre 14 years ago.
refreshed against trunk
5998.2.patch (7.5 KB) - added by hakre 14 years ago.
5998-interrim.patch (1.9 KB) - added by hakre 14 years ago.
Formattings and wrong description in docblock (r2)

Download all attachments as: .zip

Change History (38)

#1 @shelleyp
16 years ago

This impacts on comments, ping backs, as well as search.

#2 @rubys
16 years ago

  • Cc rubys added

#3 follow-up: @schiller
16 years ago

  • Cc schiller added

16 years ago

Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)

#4 @bertilow
16 years ago

I think a case can be made for this not being WordPress's job at all. The checking should be done in MySQL. If the database has been correctly set to UTF-8, no invalid characters should ever be stored in the database in the first place. If they're not stored, then they will not show up or do damage.

It would still be wise the check for them though, just in case. But perhaps this bug should be forwarded to MySQL.

#5 @Denis-de-Bernardy
15 years ago

  • Component changed from General to Warnings/Notices
  • Owner anonymous deleted

#6 @peaceablewhale
15 years ago

  • Keywords early added; unicode invalid xhtml removed

#7 @Denis-de-Bernardy
15 years ago

  • Component changed from Warnings/Notices to Charset
  • Keywords needs-patch added; early removed
  • Owner set to hakre
  • Status changed from new to assigned

there are a could charset-related functions that should be used if this is still current, and enhanced if needed.

#8 @hakre
15 years ago

  • Keywords reporter-feedback added

The question is where the invalid bytes have their origin in. Illigeal code-points should be checked against in the application inputs stream (e.g. request data or data from the database) before touching the output logic. having a filter for the output would be too late.

it would be good to know more about the originals reporter case. more details and maybe steps to reproduce.

#9 @codedread
15 years ago

The reporter's case is the following:

Turn on true XHTML serving in WordPress (i.e. serving the application/xhtml+xml MIME type). Malicious commenters come in and inject invalid code points into a comment on the blog. WordPress does not handle some invalid code points (U+FFFE and U+FFFF specifically) so the result is that the blog now has a "Yellow Screen Of Death".

#10 @hakre
15 years ago

  • Component changed from Charset to Security
  • Severity changed from normal to critical

Looks like missing input validation for the comments. This leaves WordPress open to encoding related attacks btw. The "Yellow Screen Of Death" is a minor problem then.

Thanks for reporting and providing the information.

#11 @hakre
15 years ago

  • Keywords reporter-feedback removed

#12 @codedread
15 years ago

Note that the patch I provided does protect against the malicious characters in question, but I do not believe it does it at the right place - it only checks comments via the web form and not all possible means (not trackbacks, search queries).

#13 @hakre
15 years ago

Technically, the input (this time a post request) data's encoding must be checked against the form encoding. I must check the default theme what is fitting here, I would assume this is the blog charset option.

So checking for valid UTF8 is only fitting for UTF8 blogs (naturally). I think it is a good Idea to have at least UTF8 input validation as well as latin1. I dunno wehter other charset are officially supported by wordpress or not.

I do not know as well wether or not other input data is properly validated in terms of the encoding.

#14 @dwright
15 years ago

  • Cc david_v_wright@… added

In regards to the original reporter's case:

It could be strongly argued ( that it is not a good idea to turn on "true XHTML serving in WordPress" but if this is a valid request, it makes sense to resolve it.

This case is specific to XHTML/XML and UTF-8 and the fact that Firefox (for example) will 'choke' on non valid code points. (example  )

It makes sense to me, that since this related to XHTML/XML, handle it as such.

(although, is it acceptable to assume that the php xml parser will be avail in standard Wordpress/php setups?)

I am including a patch. (for comment's only, as that is what is specified in the original report)

btw: text/html mode is not effected by this issue. (current input filtering handles it)

15 years ago

wp-includes/comment.php.diff try php xml parser load, bails on non valid utf8

#15 @ryan
15 years ago

  • Milestone changed from 2.9 to Future Release

#16 in reply to: ↑ 3 @hakre
14 years ago

  • Milestone changed from Future Release to 3.0

Replying to schiller:
Can you please add more information where you got that corrected regex from? Have you written it on your own? Has it been reviewed? I would love to see more info about it. Thanks for the other references!

UTF-8 Regex documentation page (for the log)

#17 @hakre
14 years ago

  • Component changed from Security to Charset

While reviewing the ticket it came to my attentation that not both existing patches do apply clean any longer (not a big deal here).

I suggest to merge them both, offer correct working routines (first patch is UTF8 centric, I have plans to extend that for the XML wellformatness as well).

From those routines other parts of the software can benefit as well ( wp_check_invalid_utf8() should get some care an benefit ) so I strongly suggest to have them in. We do not have a solid and safe working utf-8 validation function right now.

#18 @hakre
14 years ago

wp_check_invalid_utf8() related: #11175

#19 @hakre
14 years ago

Related: #10543

14 years ago

correct working utf8 validation functions

14 years ago

refreshed against trunk

#20 @hakre
14 years ago

Merged patch. Refactored the xml-wellformed-validation into is_wellformed_xml() (file: formatting.php). Fixed another documentation issue in seems_utf8() and a bug in is_valid_utf8_statemachine().

14 years ago

#21 @hakre
14 years ago

wp_die should provide better status code on user errors: #11286

#22 @Denis-de-Bernardy
14 years ago

per remark in a separate ticket, is_valid_utf8_preg5998 has a problem with its regex. it's missing a delimiter and it could use (?: instead of ( in order to reduce backtracing.

#23 @hakre
14 years ago

Thanks for pointing to that, it would not be needed anyway and can be kicked out of the patch.

#24 @rubys
14 years ago

  • Cc rubys removed

#25 @nacin
14 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1
  • Severity changed from critical to major

#26 @hakre
14 years ago

Last patch wasn't 100% strict / double checked at all if I remember correctly. This needs a certain level of strictness in coding.

But I would suggest some minor code-changes (formatting) directly. I'll add another patch.

14 years ago

Formattings and wrong description in docblock (r2)

#27 @hakre
14 years ago

Related: #13418

#28 @nacin
14 years ago

  • Keywords early removed
  • Milestone changed from Awaiting Triage to Future Release

#29 @hakre
13 years ago

hmm. I thought the fix is easy.

@Nacin: You have an idea to which PHP version the fix should go back to?

#30 @chriscct7
10 years ago

  • Keywords needs-refresh added; needs-patch removed

#31 @chriscct7
8 years ago

  • Severity changed from major to normal

#32 @pento
8 years ago

  • Keywords needs-refresh removed
  • Milestone Future Release deleted
  • Resolution set to wontfix
  • Status changed from assigned to closed

Invalid UTF-8 is handled by the commits attached to #21212.

This was more of an issue when latin1 was more common, but the vast majority of WordPress sites use utf8 or utf8mb4 in MySQL now, so invalid characters won't be stored in the DB.

Note: See TracTickets for help on using tickets.