Make WordPress Core

Opened 12 years ago

Last modified 5 years ago

#20368 new defect (bug)

htmlspecialchars() returns empty string for non-UTF-8 input in PHP 5.4

Reported by: convissor's profile convissor Owned by:
Milestone: Priority: normal
Severity: major Version:
Component: Formatting Keywords: needs-patch needs-unit-tests
Focuses: Cc:

Description

The default value of the input $encoding parameter for htmlspecialchars() changed to UTF-8 in PHP 5.4. The prior default was ISO-8859-1. The function's UTF-8 handler checks the input, returning an empty string if the input isn't valid UTF-8.

WordPress will see the UTF-8 validator kicking because most of the htmlspecialchars() calls don't use the $encoding parameter. This will cause major problems for sites that have a DB_CHARSET other than utf8.

Posting 58859 to php-internals by Rasmus gives a clear example of the problem. Here is a link to view the whole thread, starting with posting 58853).

Creating two centralized functions is an approach for resolving this problem. This route is simpler and easier to maintain than adding the parameters to each htmlspecialchars() call throughout the code base.

  1. wp_hsc_db() for safely displaying database results. Uses DB_CHARSET to calculate the appropriate $encoding parameter. MySQL's character set names are not equivalent to the values PHP is looking for in the $encoding parameter. Please see the hsc_db() method in the Login Security Solution plugin for a mapping of the valid options.
  1. wp_hsc_utf8() for safely displaying strings known to be saved as UTF-8, such as error messages written in core. Uses UTF-8 as the $encoding parameter.

Some calls in core use the $flags parameter, so these new functions will need the parameter too. The default should be ENT_COMPAT, which works under PHP 5.2, 5.3 and 5.4.

It may be suggested that WP use htmlspecialchar()'s auto-detection option (by passing an empty string to the $encoding parameter). This is not advisable because it can produce inconsistent behavior. Even the PHP manual says this route is not recommended.

Attachments (1)

20368.patch (2.0 KB) - added by kurtpayne 12 years ago.
Proof of concept - convert strings from database to UTF-8 on access

Download all attachments as: .zip

Change History (12)

#1 @toscho
12 years ago

  • Cc info@… added

#2 @kurtpayne
12 years ago

  • Cc kpayne@… added

@convissor, thinking about another solution:

It seems that this only affects data coming from the database. What if the data was converted to UTF-8 when it was fetched from the database?

http://core.trac.wordpress.org/browser/tags/3.3.2/wp-includes/wp-db.php#L1115

@kurtpayne
12 years ago

Proof of concept - convert strings from database to UTF-8 on access

#3 @jkudish
12 years ago

  • Cc jkudish added

#5 @nacin
11 years ago

  • Component changed from General to Formatting

#6 @miqrogroove
10 years ago

I think the scope of the problem is bigger than described above. Non-ascii data arriving at htmlspecialchars are usually supplied by the user according to the character encoding specified in HTML headers. This means, for example, you could set the site encoding to iso-8859-1, submit a copyright char through a form, and that's all it takes to break the system.

#7 @miqrogroove
10 years ago

This is fixed in PHP 5.6 according to http://us3.php.net/htmlspecialchars

#8 @miqrogroove
10 years ago

#28725 was marked as a duplicate.

#9 @miqrogroove
10 years ago

Likely duplicate #28725

#11 @chriscct7
9 years ago

  • Keywords needs-patch needs-unit-tests added
Note: See TracTickets for help on using tickets.