Opened 5 years ago
Last modified 2 years ago
#5998 assigned defect (bug)
Invalid Unicode characters
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Priority: | normal | Milestone: | Future Release |
| Component: | Charset | Version: | 2.3.3 |
| Severity: | major | Keywords: | needs-patch |
| Cc: | schiller, david_v_wright@… |
Description
Wordpress does not check for invalid Unicode characters, such as the following:
U+FFFE
U+FFFF
When the pages are served up as XHTML, allowing these characters through generates an XML error.
WordPress should filter out illegal Unicode code points.
Please see http://www.w3.org/TR/REC-xml/#NT-Char
Also, the regex
here is
incorrect, see [http://intertwingly.net/blog/2008/01/02/Keeping-On-Your-Toes
this page].
Attachments (6)
Change History (35)
Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)
I think a case can be made for this not being WordPress's job at all. The checking should be done in MySQL. If the database has been correctly set to UTF-8, no invalid characters should ever be stored in the database in the first place. If they're not stored, then they will not show up or do damage.
It would still be wise the check for them though, just in case. But perhaps this bug should be forwarded to MySQL.
- Component changed from General to Warnings/Notices
- Owner anonymous deleted
comment:6
peaceablewhale — 4 years ago
- Keywords early added; unicode invalid xhtml removed
- Component changed from Warnings/Notices to Charset
- Keywords needs-patch added; early removed
- Owner set to hakre
- Status changed from new to assigned
there are a could charset-related functions that should be used if this is still current, and enhanced if needed.
- Keywords reporter-feedback added
The question is where the invalid bytes have their origin in. Illigeal code-points should be checked against in the application inputs stream (e.g. request data or data from the database) before touching the output logic. having a filter for the output would be too late.
it would be good to know more about the originals reporter case. more details and maybe steps to reproduce.
The reporter's case is the following:
Turn on true XHTML serving in WordPress (i.e. serving the application/xhtml+xml MIME type). Malicious commenters come in and inject invalid code points into a comment on the blog. WordPress does not handle some invalid code points (U+FFFE and U+FFFF specifically) so the result is that the blog now has a "Yellow Screen Of Death".
comment:10
hakre — 4 years ago
- Component changed from Charset to Security
- Severity changed from normal to critical
Looks like missing input validation for the comments. This leaves WordPress open to encoding related attacks btw. The "Yellow Screen Of Death" is a minor problem then.
Thanks for reporting and providing the information.
comment:11
hakre — 4 years ago
- Keywords reporter-feedback removed
comment:12
codedread — 4 years ago
Note that the patch I provided does protect against the malicious characters in question, but I do not believe it does it at the right place - it only checks comments via the web form and not all possible means (not trackbacks, search queries).
comment:13
hakre — 4 years ago
Technically, the input (this time a post request) data's encoding must be checked against the form encoding. I must check the default theme what is fitting here, I would assume this is the blog charset option.
So checking for valid UTF8 is only fitting for UTF8 blogs (naturally). I think it is a good Idea to have at least UTF8 input validation as well as latin1. I dunno wehter other charset are officially supported by wordpress or not.
I do not know as well wether or not other input data is properly validated in terms of the encoding.
comment:14
dwright — 4 years ago
- Cc david_v_wright@… added
In regards to the original reporter's case: http://core.trac.wordpress.org/ticket/5998#comment:9
It could be strongly argued (http://www.webdevout.net/articles/beware-of-xhtml) that it is not a good idea to turn on "true XHTML serving in WordPress" but if this is a valid request, it makes sense to resolve it.
This case is specific to XHTML/XML and UTF-8 and the fact that Firefox (for example) will 'choke' on non valid code points. (example  )
It makes sense to me, that since this related to XHTML/XML, handle it as such.
(although, is it acceptable to assume that the php xml parser will be avail in standard Wordpress/php setups?)
I am including a patch. (for comment's only, as that is what is specified in the original report)
btw: text/html mode is not effected by this issue. (current input filtering handles it)
comment:15
ryan — 4 years ago
- Milestone changed from 2.9 to Future Release
comment:16
in reply to:
↑ 3
hakre — 3 years ago
- Milestone changed from Future Release to 3.0
Replying to schiller:
Can you please add more information where you got that corrected regex from? Have you written it on your own? Has it been reviewed? I would love to see more info about it. Thanks for the other references!
UTF-8 Regex documentation page (for the log)
comment:17
hakre — 3 years ago
- Component changed from Security to Charset
While reviewing the ticket it came to my attentation that not both existing patches do apply clean any longer (not a big deal here).
I suggest to merge them both, offer correct working routines (first patch is UTF8 centric, I have plans to extend that for the XML wellformatness as well).
From those routines other parts of the software can benefit as well ( wp_check_invalid_utf8() should get some care an benefit ) so I strongly suggest to have them in. We do not have a solid and safe working utf-8 validation function right now.
comment:18
hakre — 3 years ago
wp_check_invalid_utf8() related: #11175
comment:19
hakre — 3 years ago
Related: #10543
comment:20
hakre — 3 years ago
Merged patch. Refactored the xml-wellformed-validation into is_wellformed_xml() (file: formatting.php). Fixed another documentation issue in seems_utf8() and a bug in is_valid_utf8_statemachine().
comment:21
hakre — 3 years ago
wp_die should provide better status code on user errors: #11286
per remark in a separate ticket, is_valid_utf8_preg5998 has a problem with its regex. it's missing a delimiter and it could use (?: instead of ( in order to reduce backtracing.
comment:23
hakre — 3 years ago
Thanks for pointing to that, it would not be needed anyway and can be kicked out of the patch.
comment:24
rubys — 3 years ago
- Cc rubys removed
comment:25
nacin — 3 years ago
- Keywords early added
- Milestone changed from 3.0 to 3.1
- Severity changed from critical to major
comment:26
hakre — 3 years ago
Last patch wasn't 100% strict / double checked at all if I remember correctly. This needs a certain level of strictness in coding.
But I would suggest some minor code-changes (formatting) directly. I'll add another patch.
comment:27
hakre — 3 years ago
Related: #13418
comment:28
nacin — 3 years ago
- Keywords early removed
- Milestone changed from Awaiting Triage to Future Release
comment:29
hakre — 2 years ago
hmm. I thought the fix is easy.
@Nacin: You have an idea to which PHP version the fix should go back to?

This impacts on comments, ping backs, as well as search.