WordPress.org

Make WordPress Core

Opened 14 months ago

Closed 10 months ago

Last modified 10 months ago

#23688 closed defect (bug) (fixed)

esc_textarea, wp_richedit_pre and wp_htmledit_pre eat post content under PHP 5.4

Reported by: westi Owned by: westi
Milestone: 3.6 Priority: high
Severity: blocker Version: 3.6
Component: Formatting Keywords: needs-patch
Focuses: Cc:

Description

Because of a change in default behaviour in htmlspecialchars in PHP5.4 it is possible for these three functions to eat perfectly valid post content and make it impossible to edit existing posts.

Scenario:

  • blog_charset is ISO-8859-1
  • Post contains some 8bit characters

You try and edit the post and instead of the post content you are presented with a blank editor :(

On the front end the posts display fine.

The underlying cause it this change in htmlspecialchars

"5.4.0 The default value for the encoding parameter was changed to UTF-8."

Because the string is not a valid UTF-8 sequence an empty string is returned :(

Related to #20368

Attachments (5)

htmlspecialchars-with-charset.diff (1.1 KB) - added by westi 14 months ago.
Pass the blog_charset to htmlspecialchars
23688.diff (1.6 KB) - added by ryan 10 months ago.
23688-ut.diff (880 bytes) - added by ryan 10 months ago.
23688.2.diff (1.8 KB) - added by ryan 10 months ago.
23688-ut.2.diff (1.5 KB) - added by ryan 10 months ago.

Download all attachments as: .zip

Change History (20)

westi14 months ago

Pass the blog_charset to htmlspecialchars

comment:1 jkudish14 months ago

  • Cc jkudish added

comment:2 dllh13 months ago

  • Cc daryl@… added

comment:3 westi13 months ago

In 1243/tests:

Formatting: New test cases for a few core formatting functions which call htmlspecialchars without providing any charset info.

As of PHP5.4 the default charset for htmlspecialchars is changed to UTF-8 which means that iso-8859-1 data gets eaten.

See: #23688

comment:4 westi13 months ago

  • Owner set to westi
  • Resolution set to fixed
  • Status changed from new to closed

In 23685:

Formatting: Pass the blog charset to htmlspecialchars so that we don't eat non-UTF8 strings in PHP 5.4

Fixes #23688

comment:5 nacin13 months ago

  • Resolution fixed deleted
  • Severity changed from major to blocker
  • Status changed from closed to reopened

utf-8, UTF8 and utf8 are all semi-valid, according to core, but may crash and burn here. Another charset like UTF-16 could also presumably generate warnings and such.

comment:6 follow-up: azaozz13 months ago

Looks like we will need to normalize get_option( 'blog_charset' ) before using it here and limit it to the supported values.

Another case that fails is when the user pastes content with different encoding into the editor (Visual or Text). Then it can contain invalid code sequences and htmlspecialchars() will return an empty string. Perhaps will need to use ENT_SUBSTITUTE or ENT_DISALLOWED in PHP 5.4 to work around this (pending better documentation of how exactly these flags work, see http://php.net/manual/en/function.htmlspecialchars.php).

comment:7 in reply to: ↑ 6 trevHCS12 months ago

Can I just point out the ENT_DISALLOWED does not appear to work properly in this situation as it still gives a blank string and thus empty editing box.

The ENT_SUBSTITUTE does work albeit with odd output, eg: "You’ll" gets converted to "You�ll" when the text encoding is ISO-8859-1 which is better as an intermediate fix.

Last edited 12 months ago by trevHCS (previous) (diff)

comment:9 SergeyBiryukov12 months ago

  • Keywords needs-patch added; has-patch removed

comment:10 follow-up: azaozz12 months ago

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

comment:11 in reply to: ↑ 10 trevHCS12 months ago

Replying to azaozz:

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

I am wondering whether normalising the blog_charset would maybe be better done within get_option() rather than a specific new function?

Reasoning being there are 4x PHP 5.4 functions where this change to UTF-8 was made:

htmlspecialchars()
htmlentities()
html_entity_decode()
get_html_translation_table() [not sure this is used in WP]

There are already scripts within WP core which use the proposed get_option('blog_charset') fix as documented below, so fixing it in a new function wouldn't actually help those if the charset was incorrect.

Scripts found to already use the fix:

  • default-widgets.php -> function widget() -> html_entity_decode()
  • feed.php -> get_the_category_rss() -> html_entity_decode()

This doesn't remove the need to update the code to use the get_option('blog_charset') within any of the above function calls, but it would seem to me at least that fixing it once would be easier than fixing it lots of times?

Fixing it in get_options() would also fix any errors in HTML headers, albeit that is outside the remit of this ticket.

Last edited 12 months ago by trevHCS (previous) (diff)

comment:12 WraithKenny10 months ago

  • Cc Ken@… added

ryan10 months ago

ryan10 months ago

comment:13 aaroncampbell10 months ago

32688.diff seems reasonable to me. If there are others that need to be fixed int he future, we can just add them there.

ryan10 months ago

ryan10 months ago

comment:14 ryan10 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 24510:

Normalize the UTF-8 and ISO-8859-1 charset strings stored in blog_charset to make them friendlier with PHP functions that accept a charset such as htmlspecialchars().

fixes #23688

comment:15 ryan10 months ago

In 1298/tests:

Tests for charset string normalization. see #23688

Note: See TracTickets for help on using tickets.