#64054 closed defect (bug) (fixed)
HTML API: Attribute escaping should escape all HTML entities
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 6.9 | Priority: | normal |
| Severity: | normal | Version: | 6.2 |
| Component: | HTML API | Keywords: | has-patch |
| Focuses: | Cc: |
Description
Attribute values set with the HTML API method set_attribute() are escaped with esc_attr(). That function avoids "double encoding" things that look like HTML character references.
The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.
By contrast, set_modifiable_text() does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.
The text & appears to be an encoded character reference:
<?php $amp_text = '&'; $p = WP_HTML_Processor::create_fragment( '<p>x</p>' ); $p->next_tag(); $p->set_attribute( 'data-attr', $amp_text ); $p->next_token(); $p->set_modifiable_text( $amp_text ); echo $p->get_updated_html();
This prints the following HTML:
<p data-attr="&">&amp;</p>
Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &. The text node in the P element correctly renders & as expected.
Here's a demo of the difference in behavior between setting attributes and modifiable text.
Change History (11)
This ticket was mentioned in PR #10143 on WordPress/wordpress-develop by @jonsurrell.
5 weeks ago
#2
- Keywords has-patch added
Attribute values set with the HTML API method
set_attribute()are escaped withesc_attr(). That function avoids "double encoding" things that look like HTML character references.
The HTML API should encode whatever it receives, and apply "double encoding." The HTML API expects to receive plain string inputs and manage any necessary encoding itself. The fact that "double encoding" is disabled violates this expectation and makes it difficult correctly to set attribute values that contain sequences that appear to be HTML character references.
By contrast,
set_modifiable_text()does not rely on esc_html() and uses htmlspecialchars() directly. It will encode HTML character references as expected.
The text
&appears to be an encoded character reference:
$amp_text = '&'; $p = WP_HTML_Processor::create_fragment( '<p>x</p>' ); $p->next_tag(); $p->set_attribute( 'data-attr', $amp_text ); $p->next_token(); $p->set_modifiable_text( $amp_text ); echo $p->get_updated_html();
This prints the following HTML:
<p data-attr="&">&amp;</p>
Notice how the input text is treated differently between an attribute and text. The HTML encoding of the & character is the same in both contexts. The attribute has the value & instead of the expected &. The text node in the P element correctly renders & as expected.
Here's a demo of the difference in behavior between setting attributes and modifiable text.
Trac ticket: https://core.trac.wordpress.org/ticket/64054
#3
@
5 weeks ago
Here's a demo of how this behavior is not consistent. A round trip set_attribute() + get_attribute() decodes anything that looked like entities:
<?php $p = new WP_HTML_Tag_Processor( '<p>' ); $p->next_tag(); $p->set_attribute( 'data-lt-gt', '< >' ); var_dump( $p->get_attribute('data-lt-gt') ); // string(3) "< >" // expected: "< >"
@jonsurrell commented on PR #10143:
5 weeks ago
#4
The "double encode" behavior of esc_attr() (which really comes from htmlspecialchars()) is very strange when considering named character references that appear with and without a trailing ;. They're both character references and will be decoded by the browser (it's slightly more complex) but the double-encoded behavior does not treat them the same. Here's a good demo and explanation of that.
@jonsurrell commented on PR #10143:
5 weeks ago
#5
@jonsurrell commented on PR #10143:
4 weeks ago
#6
& seemed like the most common, expected, and legible encoding. r60616 addressed the KSES character reference issues I was aware aware of. Do you think we should use the numeric form?
@dmsnell commented on PR #10143:
4 weeks ago
#7
Do you think we should use the numeric form?
Not particularly; I just wanted your input on whether it would likely be more durable if we made this change here and now.
@dmsnell commented on PR #10143:
4 weeks ago
#8
I’ve expanded the examples even further to try and make it even more obvious what’s going on when character references are sent into the function. Additionally I have copied this update over to the HTML Processor, as the same changes apply there.
@dmsnell commented on PR #10143:
4 weeks ago
#9
Also I’ve added a similar note in the docblock for set_modifiable_test(). This method only exists in the Tag Processor so there’s no corresponding update in the HTML Processor.
You’ve stumbled upon something fairly contentious in the design. Originally we wanted zero dependencies on WordPress, especially from functions like
esc_attr()andesc_url()andesc_html()which carry so much existing baggage.For those attributes which aren’t being escaped by
esc_url()it’s probably fair.So far, I have largely collapsed onto adopting the risk profile of
esc_attr()with the hopes of improving that function. I did some work on that function a couple years ago and got distracted by other priorities. So perhaps it would have been best to readdress this then.You can examine wordpress/wordpress-develop#5949 where I was starting to move more attribute processing into the Tag Processor. There’s a tie-in with several
ksesfunctions that would be worth coordinating.