WordPress.org

Make WordPress Core

Opened 2 years ago

Closed 22 months ago

Last modified 6 months ago

#23688 closed defect (bug) (fixed)

esc_textarea, wp_richedit_pre and wp_htmledit_pre eat post content under PHP 5.4

Reported by: westi Owned by: westi
Milestone: 3.6 Priority: high
Severity: blocker Version: 3.6
Component: Formatting Keywords: needs-patch
Focuses: Cc:

Description

Because of a change in default behaviour in htmlspecialchars in PHP5.4 it is possible for these three functions to eat perfectly valid post content and make it impossible to edit existing posts.

Scenario:

  • blog_charset is ISO-8859-1
  • Post contains some 8bit characters

You try and edit the post and instead of the post content you are presented with a blank editor :(

On the front end the posts display fine.

The underlying cause it this change in htmlspecialchars

"5.4.0 The default value for the encoding parameter was changed to UTF-8."

Because the string is not a valid UTF-8 sequence an empty string is returned :(

Related to #20368

Attachments (5)

htmlspecialchars-with-charset.diff (1.1 KB) - added by westi 2 years ago.
Pass the blog_charset to htmlspecialchars
23688.diff (1.6 KB) - added by ryan 22 months ago.
23688-ut.diff (880 bytes) - added by ryan 22 months ago.
23688.2.diff (1.8 KB) - added by ryan 22 months ago.
23688-ut.2.diff (1.5 KB) - added by ryan 22 months ago.

Download all attachments as: .zip

Change History (21)

@westi2 years ago

Pass the blog_charset to htmlspecialchars

comment:1 @jkudish2 years ago

  • Cc jkudish added

comment:2 @dllh2 years ago

  • Cc daryl@… added

comment:3 @westi2 years ago

In 1243/tests:

Formatting: New test cases for a few core formatting functions which call htmlspecialchars without providing any charset info.

As of PHP5.4 the default charset for htmlspecialchars is changed to UTF-8 which means that iso-8859-1 data gets eaten.

See: #23688

comment:4 @westi2 years ago

  • Owner set to westi
  • Resolution set to fixed
  • Status changed from new to closed

In 23685:

Formatting: Pass the blog charset to htmlspecialchars so that we don't eat non-UTF8 strings in PHP 5.4

Fixes #23688

comment:5 @nacin2 years ago

  • Resolution fixed deleted
  • Severity changed from major to blocker
  • Status changed from closed to reopened

utf-8, UTF8 and utf8 are all semi-valid, according to core, but may crash and burn here. Another charset like UTF-16 could also presumably generate warnings and such.

comment:6 follow-up: @azaozz2 years ago

Looks like we will need to normalize get_option( 'blog_charset' ) before using it here and limit it to the supported values.

Another case that fails is when the user pastes content with different encoding into the editor (Visual or Text). Then it can contain invalid code sequences and htmlspecialchars() will return an empty string. Perhaps will need to use ENT_SUBSTITUTE or ENT_DISALLOWED in PHP 5.4 to work around this (pending better documentation of how exactly these flags work, see http://php.net/manual/en/function.htmlspecialchars.php).

comment:7 in reply to: ↑ 6 @trevHCS2 years ago

Can I just point out the ENT_DISALLOWED does not appear to work properly in this situation as it still gives a blank string and thus empty editing box.

The ENT_SUBSTITUTE does work albeit with odd output, eg: "You’ll" gets converted to "You�ll" when the text encoding is ISO-8859-1 which is better as an intermediate fix.

Last edited 2 years ago by trevHCS (previous) (diff)

comment:9 @SergeyBiryukov2 years ago

  • Keywords needs-patch added; has-patch removed

comment:10 follow-up: @azaozz2 years ago

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

comment:11 in reply to: ↑ 10 @trevHCS2 years ago

Replying to azaozz:

Thinking we need wp_htmlspecialchars() where blog_charset can be normalized, etc.

I am wondering whether normalising the blog_charset would maybe be better done within get_option() rather than a specific new function?

Reasoning being there are 4x PHP 5.4 functions where this change to UTF-8 was made:

htmlspecialchars()
htmlentities()
html_entity_decode()
get_html_translation_table() [not sure this is used in WP]

There are already scripts within WP core which use the proposed get_option('blog_charset') fix as documented below, so fixing it in a new function wouldn't actually help those if the charset was incorrect.

Scripts found to already use the fix:

  • default-widgets.php -> function widget() -> html_entity_decode()
  • feed.php -> get_the_category_rss() -> html_entity_decode()

This doesn't remove the need to update the code to use the get_option('blog_charset') within any of the above function calls, but it would seem to me at least that fixing it once would be easier than fixing it lots of times?

Fixing it in get_options() would also fix any errors in HTML headers, albeit that is outside the remit of this ticket.

Last edited 2 years ago by trevHCS (previous) (diff)

comment:12 @WraithKenny22 months ago

  • Cc Ken@… added

@ryan22 months ago

@ryan22 months ago

comment:13 @aaroncampbell22 months ago

32688.diff seems reasonable to me. If there are others that need to be fixed int he future, we can just add them there.

@ryan22 months ago

@ryan22 months ago

comment:14 @ryan22 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 24510:

Normalize the UTF-8 and ISO-8859-1 charset strings stored in blog_charset to make them friendlier with PHP functions that accept a charset such as htmlspecialchars().

fixes #23688

comment:15 @ryan22 months ago

In 1298/tests:

Tests for charset string normalization. see #23688

comment:16 @slackbot6 months ago

This ticket was mentioned in Slack in #core by johnbillion. View the logs.

Note: See TracTickets for help on using tickets.