Opened 13 years ago
Last modified 6 years ago
#20368 new defect (bug)
htmlspecialchars() returns empty string for non-UTF-8 input in PHP 5.4
Reported by: | convissor | Owned by: | |
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | major | Version: | |
Component: | Formatting | Keywords: | needs-patch needs-unit-tests |
Focuses: | Cc: |
Description
The default value of the input $encoding
parameter for htmlspecialchars()
changed to UTF-8 in PHP 5.4. The prior default was ISO-8859-1. The function's UTF-8 handler checks the input, returning an empty string if the input isn't valid UTF-8.
WordPress will see the UTF-8 validator kicking because most of the htmlspecialchars()
calls don't use the $encoding
parameter. This will cause major problems for sites that have a DB_CHARSET
other than utf8
.
Posting 58859 to php-internals by Rasmus gives a clear example of the problem. Here is a link to view the whole thread, starting with posting 58853).
Creating two centralized functions is an approach for resolving this problem. This route is simpler and easier to maintain than adding the parameters to each htmlspecialchars()
call throughout the code base.
wp_hsc_db()
for safely displaying database results. UsesDB_CHARSET
to calculate the appropriate$encoding
parameter. MySQL's character set names are not equivalent to the values PHP is looking for in the$encoding
parameter. Please see thehsc_db()
method in the Login Security Solution plugin for a mapping of the valid options.
wp_hsc_utf8()
for safely displaying strings known to be saved as UTF-8, such as error messages written in core. UsesUTF-8
as the$encoding
parameter.
Some calls in core use the $flags
parameter, so these new functions will need the parameter too. The default should be ENT_COMPAT
, which works under PHP 5.2, 5.3 and 5.4.
It may be suggested that WP use htmlspecialchar()
's auto-detection option (by passing an empty string to the $encoding
parameter). This is not advisable because it can produce inconsistent behavior. Even the PHP manual says this route is not recommended.
Attachments (1)
Change History (12)
#6
@
11 years ago
I think the scope of the problem is bigger than described above. Non-ascii data arriving at htmlspecialchars are usually supplied by the user according to the character encoding specified in HTML headers. This means, for example, you could set the site encoding to iso-8859-1, submit a copyright char through a form, and that's all it takes to break the system.
#7
@
11 years ago
This is fixed in PHP 5.6 according to http://us3.php.net/htmlspecialchars
@convissor, thinking about another solution:
It seems that this only affects data coming from the database. What if the data was converted to UTF-8 when it was fetched from the database?
http://core.trac.wordpress.org/browser/tags/3.3.2/wp-includes/wp-db.php#L1115