WordPress.org

Make WordPress Core

Opened 7 years ago

Last modified 10 months ago

#5998 assigned defect (bug)

Invalid Unicode characters

Reported by: shelleyp Owned by: hakre
Milestone: Future Release Priority: normal
Severity: major Version: 2.3.3
Component: Charset Keywords: needs-refresh
Focuses: Cc:

Description

Wordpress does not check for invalid Unicode characters, such as the following:

U+FFFE
U+FFFF

When the pages are served up as XHTML, allowing these characters through generates an XML error.

WordPress should filter out illegal Unicode code points.

Please see http://www.w3.org/TR/REC-xml/#NT-Char

Also, the regex
here is
incorrect, see [http://intertwingly.net/blog/2008/01/02/Keeping-On-Your-Toes
this page].

Attachments (6)

bug5998.patch (2.1 KB) - added by schiller 7 years ago.
Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)
comment.php.diff (1.1 KB) - added by dwright 6 years ago.
wp-includes/comment.php.diff try php xml parser load, bails on non valid utf8
5998.patch (4.6 KB) - added by hakre 6 years ago.
correct working utf8 validation functions
5998-comment-php.3.patch (1.2 KB) - added by hakre 6 years ago.
refreshed against trunk
5998.2.patch (7.5 KB) - added by hakre 6 years ago.
5998-interrim.patch (1.9 KB) - added by hakre 5 years ago.
Formattings and wrong description in docblock (r2)

Download all attachments as: .zip

Change History (36)

comment:1 @shelleyp7 years ago

This impacts on comments, ping backs, as well as search.

comment:2 @rubys7 years ago

  • Cc rubys added

comment:3 follow-up: @schiller7 years ago

  • Cc schiller added

@schiller7 years ago

Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)

comment:4 @bertilow7 years ago

I think a case can be made for this not being WordPress's job at all. The checking should be done in MySQL. If the database has been correctly set to UTF-8, no invalid characters should ever be stored in the database in the first place. If they're not stored, then they will not show up or do damage.

It would still be wise the check for them though, just in case. But perhaps this bug should be forwarded to MySQL.

comment:5 @Denis-de-Bernardy6 years ago

  • Component changed from General to Warnings/Notices
  • Owner anonymous deleted

comment:6 @peaceablewhale6 years ago

  • Keywords early added; unicode invalid xhtml removed

comment:7 @Denis-de-Bernardy6 years ago

  • Component changed from Warnings/Notices to Charset
  • Keywords needs-patch added; early removed
  • Owner set to hakre
  • Status changed from new to assigned

there are a could charset-related functions that should be used if this is still current, and enhanced if needed.

comment:8 @hakre6 years ago

  • Keywords reporter-feedback added

The question is where the invalid bytes have their origin in. Illigeal code-points should be checked against in the application inputs stream (e.g. request data or data from the database) before touching the output logic. having a filter for the output would be too late.

it would be good to know more about the originals reporter case. more details and maybe steps to reproduce.

comment:9 @codedread6 years ago

The reporter's case is the following:

Turn on true XHTML serving in WordPress (i.e. serving the application/xhtml+xml MIME type). Malicious commenters come in and inject invalid code points into a comment on the blog. WordPress does not handle some invalid code points (U+FFFE and U+FFFF specifically) so the result is that the blog now has a "Yellow Screen Of Death".

comment:10 @hakre6 years ago

  • Component changed from Charset to Security
  • Severity changed from normal to critical

Looks like missing input validation for the comments. This leaves WordPress open to encoding related attacks btw. The "Yellow Screen Of Death" is a minor problem then.

Thanks for reporting and providing the information.

comment:11 @hakre6 years ago

  • Keywords reporter-feedback removed

comment:12 @codedread6 years ago

Note that the patch I provided does protect against the malicious characters in question, but I do not believe it does it at the right place - it only checks comments via the web form and not all possible means (not trackbacks, search queries).

comment:13 @hakre6 years ago

Technically, the input (this time a post request) data's encoding must be checked against the form encoding. I must check the default theme what is fitting here, I would assume this is the blog charset option.

So checking for valid UTF8 is only fitting for UTF8 blogs (naturally). I think it is a good Idea to have at least UTF8 input validation as well as latin1. I dunno wehter other charset are officially supported by wordpress or not.

I do not know as well wether or not other input data is properly validated in terms of the encoding.

comment:14 @dwright6 years ago

  • Cc david_v_wright@… added

In regards to the original reporter's case: http://core.trac.wordpress.org/ticket/5998#comment:9

It could be strongly argued (http://www.webdevout.net/articles/beware-of-xhtml) that it is not a good idea to turn on "true XHTML serving in WordPress" but if this is a valid request, it makes sense to resolve it.

This case is specific to XHTML/XML and UTF-8 and the fact that Firefox (for example) will 'choke' on non valid code points. (example  )

It makes sense to me, that since this related to XHTML/XML, handle it as such.

(although, is it acceptable to assume that the php xml parser will be avail in standard Wordpress/php setups?)

I am including a patch. (for comment's only, as that is what is specified in the original report)

btw: text/html mode is not effected by this issue. (current input filtering handles it)

@dwright6 years ago

wp-includes/comment.php.diff try php xml parser load, bails on non valid utf8

comment:15 @ryan6 years ago

  • Milestone changed from 2.9 to Future Release

comment:16 in reply to: ↑ 3 @hakre6 years ago

  • Milestone changed from Future Release to 3.0

Replying to schiller:
Can you please add more information where you got that corrected regex from? Have you written it on your own? Has it been reviewed? I would love to see more info about it. Thanks for the other references!


UTF-8 Regex documentation page (for the log)

comment:17 @hakre6 years ago

  • Component changed from Security to Charset

While reviewing the ticket it came to my attentation that not both existing patches do apply clean any longer (not a big deal here).

I suggest to merge them both, offer correct working routines (first patch is UTF8 centric, I have plans to extend that for the XML wellformatness as well).

From those routines other parts of the software can benefit as well ( wp_check_invalid_utf8() should get some care an benefit ) so I strongly suggest to have them in. We do not have a solid and safe working utf-8 validation function right now.

comment:18 @hakre6 years ago

wp_check_invalid_utf8() related: #11175

comment:19 @hakre6 years ago

Related: #10543

@hakre6 years ago

correct working utf8 validation functions

@hakre6 years ago

refreshed against trunk

comment:20 @hakre6 years ago

Merged patch. Refactored the xml-wellformed-validation into is_wellformed_xml() (file: formatting.php). Fixed another documentation issue in seems_utf8() and a bug in is_valid_utf8_statemachine().

@hakre6 years ago

comment:21 @hakre6 years ago

wp_die should provide better status code on user errors: #11286

comment:22 @Denis-de-Bernardy6 years ago

per remark in a separate ticket, is_valid_utf8_preg5998 has a problem with its regex. it's missing a delimiter and it could use (?: instead of ( in order to reduce backtracing.

comment:23 @hakre6 years ago

Thanks for pointing to that, it would not be needed anyway and can be kicked out of the patch.

comment:24 @rubys6 years ago

  • Cc rubys removed

comment:25 @nacin5 years ago

  • Keywords early added
  • Milestone changed from 3.0 to 3.1
  • Severity changed from critical to major

comment:26 @hakre5 years ago

Last patch wasn't 100% strict / double checked at all if I remember correctly. This needs a certain level of strictness in coding.

But I would suggest some minor code-changes (formatting) directly. I'll add another patch.

@hakre5 years ago

Formattings and wrong description in docblock (r2)

comment:27 @hakre5 years ago

Related: #13418

comment:28 @nacin5 years ago

  • Keywords early removed
  • Milestone changed from Awaiting Triage to Future Release

comment:29 @hakre4 years ago

hmm. I thought the fix is easy.

@Nacin: You have an idea to which PHP version the fix should go back to?

comment:30 @chriscct710 months ago

  • Keywords needs-refresh added; needs-patch removed
Note: See TracTickets for help on using tickets.