Make WordPress Core

Opened 12 months ago

Closed 12 months ago

Last modified 12 months ago

#60295 closed defect (bug) (invalid)

esc_html() function returns an empty string when the last character of the input string variable is ASCII 145 or 146

Reported by: jani20's profile jani20 Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: Formatting Keywords: reporter-feedback
Focuses: Cc:

Description

When the last character of a string variable used as an argument by esc_html() is a left or right single quotation mark (ASCII 145 or 146), the esc_html() function returns an empty string.

Change History (5)

#1 @TobiasBg
12 months ago

  • Keywords reporter-feedback added
  • Version 6.4.2 deleted

Hi @jani20,
thanks for your report!

However, I can't seem to reproduce this...
When I add the code

echo esc_html( "test’" );

in an admin page, I correctly get test’ as the output.

Can you maybe provide a more detailed code example that is not working for you?

#2 @dmsnell
12 months ago

@TobiasBg I think your example is using U+2019 _right single quotation mark_. It's hard to see because PHP is probably using UTF-8 by default and your string is the byte sequence "test\xe2\x80\x99"

I'm able to reproduce using this.

<?php
'' === esc_html( "test\x91" );
'' === esc_html( "test\x92" );

Now these single quotation marks @jani20 are not actually ASCII, but CP-1252, which is the default character encoding Microsoft used for its products for a long time. I'm guessing that your blog's charset is set to UTF-8, where these bytes form an invalid string.

php > iconv( 'utf-8', 'utf-8', "test\x91" );
PHP Notice:  iconv(): Detected an illegal character in input string in php shell code on line 1

Notice: iconv(): Detected an illegal character in input string in php shell code on line 1

Things you might want to check:

  • the database character encoding.
  • your browser might have a character encoding selection in the Edit menu, or elsewhere. UTF-8 is what it likely should be. I've seen "Default" fail for some sites that don't indicate their charset.
  • ensure your theme is generating a META element with the right character encoding, or "charset"

These characters may legitimately appear in HTML; when they do, WordPress should be treating them as CP-1252 treats them. It does this right now if they appear through character references like &#146; but not if they come through directly as normal text.

#3 @jani20
12 months ago

Thank you @TobiasBg, this is it. The characters are in post content pasted from some non-WP source (Hubspot probably). The issue was noticed when get_the_excerpt() cut the content so U+0092 was the last character to be shown in the excerpt, after esc_html() the excerpt disappears.

#4 follow-up: @dmsnell
12 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to invalid
  • Status changed from new to closed

Glad to hear it, @jani20. I'm going to close this ticket as invalid because I think that's the way to say "this isn't a bug with WordPress directly". It sounds like the culprit is telling WordPress that some text is UTF-8 when it's invalid.

#5 in reply to: ↑ 4 @jani20
12 months ago

Thank you @dmsnell for your thorough explanation. Have a great weekend!

Replying to dmsnell:

Glad to hear it, @jani20. I'm going to close this ticket as invalid because I think that's the way to say "this isn't a bug with WordPress directly". It sounds like the culprit is telling WordPress that some text is UTF-8 when it's invalid.

Note: See TracTickets for help on using tickets.