Opened 17 years ago
Closed 9 years ago
#5998 closed defect (bug) (wontfix)
Invalid Unicode characters
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 2.3.3 |
Component: | Charset | Keywords: | |
Focuses: | Cc: |
Description
Wordpress does not check for invalid Unicode characters, such as the following:
U+FFFE
U+FFFF
When the pages are served up as XHTML, allowing these characters through generates an XML error.
WordPress should filter out illegal Unicode code points.
Please see http://www.w3.org/TR/REC-xml/#NT-Char
Also, the regex
here is
incorrect, see [http://intertwingly.net/blog/2008/01/02/Keeping-On-Your-Toes
this page].
Attachments (6)
Change History (38)
@
17 years ago
Patch. Assumes UTF-8. Only handles comment submission (not trackbacks, search queries, etc)
#4
@
17 years ago
I think a case can be made for this not being WordPress's job at all. The checking should be done in MySQL. If the database has been correctly set to UTF-8, no invalid characters should ever be stored in the database in the first place. If they're not stored, then they will not show up or do damage.
It would still be wise the check for them though, just in case. But perhaps this bug should be forwarded to MySQL.
#7
@
16 years ago
- Component changed from Warnings/Notices to Charset
- Keywords needs-patch added; early removed
- Owner set to hakre
- Status changed from new to assigned
there are a could charset-related functions that should be used if this is still current, and enhanced if needed.
#8
@
16 years ago
- Keywords reporter-feedback added
The question is where the invalid bytes have their origin in. Illigeal code-points should be checked against in the application inputs stream (e.g. request data or data from the database) before touching the output logic. having a filter for the output would be too late.
it would be good to know more about the originals reporter case. more details and maybe steps to reproduce.
#9
@
16 years ago
The reporter's case is the following:
Turn on true XHTML serving in WordPress (i.e. serving the application/xhtml+xml MIME type). Malicious commenters come in and inject invalid code points into a comment on the blog. WordPress does not handle some invalid code points (U+FFFE and U+FFFF specifically) so the result is that the blog now has a "Yellow Screen Of Death".
#10
@
16 years ago
- Component changed from Charset to Security
- Severity changed from normal to critical
Looks like missing input validation for the comments. This leaves WordPress open to encoding related attacks btw. The "Yellow Screen Of Death" is a minor problem then.
Thanks for reporting and providing the information.
#12
@
16 years ago
Note that the patch I provided does protect against the malicious characters in question, but I do not believe it does it at the right place - it only checks comments via the web form and not all possible means (not trackbacks, search queries).
#13
@
16 years ago
Technically, the input (this time a post request) data's encoding must be checked against the form encoding. I must check the default theme what is fitting here, I would assume this is the blog charset option.
So checking for valid UTF8 is only fitting for UTF8 blogs (naturally). I think it is a good Idea to have at least UTF8 input validation as well as latin1. I dunno wehter other charset are officially supported by wordpress or not.
I do not know as well wether or not other input data is properly validated in terms of the encoding.
#14
@
15 years ago
- Cc david_v_wright@… added
In regards to the original reporter's case: http://core.trac.wordpress.org/ticket/5998#comment:9
It could be strongly argued (http://www.webdevout.net/articles/beware-of-xhtml) that it is not a good idea to turn on "true XHTML serving in WordPress" but if this is a valid request, it makes sense to resolve it.
This case is specific to XHTML/XML and UTF-8 and the fact that Firefox (for example) will 'choke' on non valid code points. (example  )
It makes sense to me, that since this related to XHTML/XML, handle it as such.
(although, is it acceptable to assume that the php xml parser will be avail in standard Wordpress/php setups?)
I am including a patch. (for comment's only, as that is what is specified in the original report)
btw: text/html mode is not effected by this issue. (current input filtering handles it)
#16
in reply to:
↑ 3
@
15 years ago
- Milestone changed from Future Release to 3.0
Replying to schiller:
Can you please add more information where you got that corrected regex from? Have you written it on your own? Has it been reviewed? I would love to see more info about it. Thanks for the other references!
UTF-8 Regex documentation page (for the log)
#17
@
15 years ago
- Component changed from Security to Charset
While reviewing the ticket it came to my attentation that not both existing patches do apply clean any longer (not a big deal here).
I suggest to merge them both, offer correct working routines (first patch is UTF8 centric, I have plans to extend that for the XML wellformatness as well).
From those routines other parts of the software can benefit as well ( wp_check_invalid_utf8() should get some care an benefit ) so I strongly suggest to have them in. We do not have a solid and safe working utf-8 validation function right now.
#20
@
15 years ago
Merged patch. Refactored the xml-wellformed-validation into is_wellformed_xml() (file: formatting.php). Fixed another documentation issue in seems_utf8() and a bug in is_valid_utf8_statemachine().
#22
@
15 years ago
per remark in a separate ticket, is_valid_utf8_preg5998 has a problem with its regex. it's missing a delimiter and it could use (?: instead of ( in order to reduce backtracing.
#23
@
15 years ago
Thanks for pointing to that, it would not be needed anyway and can be kicked out of the patch.
#25
@
15 years ago
- Keywords early added
- Milestone changed from 3.0 to 3.1
- Severity changed from critical to major
#26
@
15 years ago
Last patch wasn't 100% strict / double checked at all if I remember correctly. This needs a certain level of strictness in coding.
But I would suggest some minor code-changes (formatting) directly. I'll add another patch.
#29
@
14 years ago
hmm. I thought the fix is easy.
@Nacin: You have an idea to which PHP version the fix should go back to?
#32
@
9 years ago
- Keywords needs-refresh removed
- Milestone Future Release deleted
- Resolution set to wontfix
- Status changed from assigned to closed
Invalid UTF-8 is handled by the commits attached to #21212.
This was more of an issue when latin1
was more common, but the vast majority of WordPress sites use utf8
or utf8mb4
in MySQL now, so invalid characters won't be stored in the DB.
This impacts on comments, ping backs, as well as search.