Make WordPress Core

Opened 12 years ago

Closed 11 years ago

Last modified 10 years ago

#23688 closed defect (bug) (fixed)

esc_textarea, wp_richedit_pre and wp_htmledit_pre eat post content under PHP 5.4

Reported by: westi's profile westi Owned by: westi's profile westi
Milestone: 3.6 Priority: high
Severity: blocker Version: 3.6
Component: Formatting Keywords: needs-patch
Focuses: Cc:

Description

Because of a change in default behaviour in htmlspecialchars in PHP5.4 it is possible for these three functions to eat perfectly valid post content and make it impossible to edit existing posts.

Scenario:

  • blog_charset is ISO-8859-1
  • Post contains some 8bit characters

You try and edit the post and instead of the post content you are presented with a blank editor :(

On the front end the posts display fine.

The underlying cause it this change in htmlspecialchars

"5.4.0 The default value for the encoding parameter was changed to UTF-8."

Because the string is not a valid UTF-8 sequence an empty string is returned :(

Related to #20368

Attachments (5)

htmlspecialchars-with-charset.diff (1.1 KB) - added by westi 12 years ago.
Pass the blog_charset to htmlspecialchars
23688.diff (1.6 KB) - added by ryan 11 years ago.
23688-ut.diff (880 bytes) - added by ryan 11 years ago.
23688.2.diff (1.8 KB) - added by ryan 11 years ago.
23688-ut.2.diff (1.5 KB) - added by ryan 11 years ago.

Download all attachments as: .zip

Change History (21)

@westi
12 years ago

Pass the blog_charset to htmlspecialchars

#1 @jkudish
12 years ago

  • Cc jkudish added

#2 @dllh
12 years ago

  • Cc daryl@… added

#3 @westi
12 years ago

In 1243/tests:

Formatting: New test cases for a few core formatting functions which call htmlspecialchars without providing any charset info.

As of PHP5.4 the default charset for htmlspecialchars is changed to UTF-8 which means that iso-8859-1 data gets eaten.

See: #23688

#4 @westi
12 years ago

  • Owner set to westi
  • Resolution set to fixed
  • Status changed from new to closed

In 23685:

Formatting: Pass the blog charset to htmlspecialchars so that we don't eat non-UTF8 strings in PHP 5.4

Fixes #23688

#5 @nacin
12 years ago

  • Resolution fixed deleted
  • Severity changed from major to blocker
  • Status changed from closed to reopened

utf-8, UTF8 and utf8 are all semi-valid, according to core, but may crash and burn here. Another charset like UTF-16 could also presumably generate warnings and such.

#6 follow-up: @azaozz
12 years ago

Looks like we will need to normalize get_option( 'blog_charset' ) before using it here and limit it to the supported values.

Another case that fails is when the user pastes content with different encoding into the editor (Visual or Text). Then it can contain invalid code sequences and htmlspecialchars() will return an empty string. Perhaps will need to use ENT_SUBSTITUTE or ENT_DISALLOWED in PHP 5.4 to work around this (pending better documentation of how exactly these flags work, see http://php.net/manual/en/function.htmlspecialchars.php).

#7 in reply to: ↑ 6 @trevHCS
12 years ago

Can I just point out the ENT_DISALLOWED does not appear to work properly in this situation as it still gives a blank string and thus empty editing box.

The ENT_SUBSTITUTE does work albeit with odd output, eg: "You’ll" gets converted to "You�ll" when the text encoding is ISO-8859-1 which is better as an intermediate fix.

Last edited 12 years ago by trevHCS (previous) (diff)

#9 @SergeyBiryukov
12 years ago

  • Keywords needs-patch added; has-patch removed

#10 follow-up: @azaozz
12 years ago

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

#11 in reply to: ↑ 10 @trevHCS
12 years ago

Replying to azaozz:

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

I am wondering whether normalising the blog_charset would maybe be better done within get_option() rather than a specific new function?

Reasoning being there are 4x PHP 5.4 functions where this change to UTF-8 was made:

htmlspecialchars()
htmlentities()
html_entity_decode()
get_html_translation_table() [not sure this is used in WP]

There are already scripts within WP core which use the proposed get_option('blog_charset') fix as documented below, so fixing it in a new function wouldn't actually help those if the charset was incorrect.

Scripts found to already use the fix:

  • default-widgets.php -> function widget() -> html_entity_decode()
  • feed.php -> get_the_category_rss() -> html_entity_decode()

This doesn't remove the need to update the code to use the get_option('blog_charset') within any of the above function calls, but it would seem to me at least that fixing it once would be easier than fixing it lots of times?

Fixing it in get_options() would also fix any errors in HTML headers, albeit that is outside the remit of this ticket.

Last edited 12 years ago by trevHCS (previous) (diff)

#12 @WraithKenny
11 years ago

  • Cc Ken@… added

@ryan
11 years ago

@ryan
11 years ago

#13 @aaroncampbell
11 years ago

32688.diff seems reasonable to me. If there are others that need to be fixed int he future, we can just add them there.

@ryan
11 years ago

@ryan
11 years ago

#14 @ryan
11 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 24510:

Normalize the UTF-8 and ISO-8859-1 charset strings stored in blog_charset to make them friendlier with PHP functions that accept a charset such as htmlspecialchars().

fixes #23688

#15 @ryan
11 years ago

In 1298/tests:

Tests for charset string normalization. see #23688

This ticket was mentioned in Slack in #core by johnbillion. View the logs.


10 years ago

Note: See TracTickets for help on using tickets.