Opened 14 months ago
Last modified 3 months ago
#20368 new defect (bug)
htmlspecialchars() returns empty string for non-UTF-8 input in PHP 5.4
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | normal | Milestone: | Awaiting Review |
| Component: | General | Version: | |
| Severity: | major | Keywords: | |
| Cc: | info@…, kpayne@…, jkudish |
Description
The default value of the input $encoding parameter for htmlspecialchars() changed to UTF-8 in PHP 5.4. The prior default was ISO-8859-1. The function's UTF-8 handler checks the input, returning an empty string if the input isn't valid UTF-8.
WordPress will see the UTF-8 validator kicking because most of the htmlspecialchars() calls don't use the $encoding parameter. This will cause major problems for sites that have a DB_CHARSET other than utf8.
Posting 58859 to php-internals by Rasmus gives a clear example of the problem. Here is a link to view the whole thread, starting with posting 58853).
Creating two centralized functions is an approach for resolving this problem. This route is simpler and easier to maintain than adding the parameters to each htmlspecialchars() call throughout the code base.
- wp_hsc_db() for safely displaying database results. Uses DB_CHARSET to calculate the appropriate $encoding parameter. MySQL's character set names are not equivalent to the values PHP is looking for in the $encoding parameter. Please see the hsc_db() method in the Login Security Solution plugin for a mapping of the valid options.
- wp_hsc_utf8() for safely displaying strings known to be saved as UTF-8, such as error messages written in core. Uses UTF-8 as the $encoding parameter.
Some calls in core use the $flags parameter, so these new functions will need the parameter too. The default should be ENT_COMPAT, which works under PHP 5.2, 5.3 and 5.4.
It may be suggested that WP use htmlspecialchar()'s auto-detection option (by passing an empty string to the $encoding parameter). This is not advisable because it can produce inconsistent behavior. Even the PHP manual says this route is not recommended.

@convissor, thinking about another solution:
It seems that this only affects data coming from the database. What if the data was converted to UTF-8 when it was fetched from the database?
http://core.trac.wordpress.org/browser/tags/3.3.2/wp-includes/wp-db.php#L1115