Make WordPress Core

Opened 6 weeks ago

Closed 4 weeks ago

Last modified 2 weeks ago

#64054 closed defect (bug) (fixed)

HTML API: Attribute escaping should escape all HTML entities

Reported by: jonsurrell's profile jonsurrell Owned by: dmsnell's profile dmsnell
Milestone: 6.9 Priority: normal
Severity: normal Version: 6.2
Component: HTML API Keywords: has-patch
Focuses: Cc:

Description

Attribute values set with the HTML API method set_attribute() are escaped with esc_attr(). That function avoids "double encoding" things that look like HTML character references.

The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.

By contrast, set_modifiable_text() does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.

The text & appears to be an encoded character reference:

<?php
$amp_text = '&amp;';
$p = WP_HTML_Processor::create_fragment( '<p>x</p>' );
$p->next_tag();
$p->set_attribute( 'data-attr', $amp_text );
$p->next_token();
$p->set_modifiable_text( $amp_text );
echo $p->get_updated_html();

This prints the following HTML:

<p data-attr="&amp;">&amp;amp;</p>

Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &amp;. The text node in the P element correctly renders &amp; as expected.

Here's a demo of the difference in behavior between setting attributes and modifiable text.

Change History (11)

#1 @dmsnell
5 weeks ago

You’ve stumbled upon something fairly contentious in the design. Originally we wanted zero dependencies on WordPress, especially from functions like esc_attr() and esc_url() and esc_html() which carry so much existing baggage.

For those attributes which aren’t being escaped by esc_url() it’s probably fair.

So far, I have largely collapsed onto adopting the risk profile of esc_attr() with the hopes of improving that function. I did some work on that function a couple years ago and got distracted by other priorities. So perhaps it would have been best to readdress this then.

You can examine wordpress/wordpress-develop#5949 where I was starting to move more attribute processing into the Tag Processor. There’s a tie-in with several kses functions that would be worth coordinating.

This ticket was mentioned in PR #10143 on WordPress/wordpress-develop by @jonsurrell.


5 weeks ago
#2

  • Keywords has-patch added

Attribute values set with the HTML API method set_attribute() are escaped with esc_attr(). That function avoids "double encoding" things that look like HTML character references.

The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.

By contrast, set_modifiable_text() does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.

The text &amp; appears to be an encoded character reference:

$amp_text = '&amp;';
$p = WP_HTML_Processor::create_fragment( '<p>x</p>' );
$p->next_tag();
$p->set_attribute( 'data-attr', $amp_text );
$p->next_token();
$p->set_modifiable_text( $amp_text );
echo $p->get_updated_html();

This prints the following HTML:

<p data-attr="&amp;">&amp;amp;</p>

Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &amp;. The text node in the P element correctly renders &amp; as expected.

Here's a demo of the difference in behavior between setting attributes and modifiable text.

Trac ticket: https://core.trac.wordpress.org/ticket/64054

#3 @jonsurrell
5 weeks ago

Here's a demo of how this behavior is not consistent. A round trip set_attribute() + get_attribute() decodes anything that looked like entities:

<?php
$p = new WP_HTML_Tag_Processor( '<p>' );
$p->next_tag();
$p->set_attribute( 'data-lt-gt', '&lt; &gt;' );
var_dump( $p->get_attribute('data-lt-gt') );
// string(3) "< >"
// expected: "&lt; &gt;"

@jonsurrell commented on PR #10143:


5 weeks ago
#4

The "double encode" behavior of esc_attr() (which really comes from htmlspecialchars()) is very strange when considering named character references that appear with and without a trailing ;. They're both character references and will be decoded by the browser (it's slightly more complex) but the double-encoded behavior does not treat them the same. Here's a good demo and explanation of that.

@jonsurrell commented on PR #10143:


4 weeks ago
#6

&amp; seemed like the most common, expected, and legible encoding. r60616 addressed the KSES character reference issues I was aware aware of. Do you think we should use the numeric form?

@dmsnell commented on PR #10143:


4 weeks ago
#7

Do you think we should use the numeric form?

Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now.

@dmsnell commented on PR #10143:


4 weeks ago
#8

I’ve expanded the examples even further to try and make it even more obvious what’s going on when character references are sent into the function. Additionally I have copied this update over to the HTML Processor, as the same changes apply there.

@dmsnell commented on PR #10143:


4 weeks ago
#9

Also I’ve added a similar note in the docblock for set_modifiable_test(). This method only exists in the Tag Processor so there’s no corresponding update in the HTML Processor.

#10 @dmsnell
4 weeks ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 60919:

HTML API: Escape all submitted HTML character references.

The HTML API has relied on esc_attr() and esc_html() when setting string attribute values or the contents of modifiable text. This leads to unexpected behavior when those functions attempt to prevent double-escaping of existing character references, and it can make certain contents impossible to represent.

After this change, the HTML API will reliably escape all submitted plaintext such that it appears in the browser the way it was submitted to the HTML API, with all character references escaped. This does not change the behavior of how URL attributes are escaped.

Developed in https://github.com/WordPress/wordpress-develop/pull/10143
Discussed in https://core.trac.wordpress.org/ticket/64054

Props dmsnell, jonsurrell, westonruter.
Fixes #64054.

#11 @sabernhardt
2 weeks ago

  • Milestone changed from Awaiting Review to 6.9
Note: See TracTickets for help on using tickets.