Opened 6 months ago
Last modified 5 days ago
#63724 reopened enhancement
HTML API: Reliably parse HTML attributes in `wp_kses_hair()`
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | 6.9 |
| Component: | HTML API | Keywords: | has-patch has-unit-tests needs-refresh |
| Focuses: | Cc: |
Description
wp_kses_hair() attempts to parse HTML attributes given the span of text inside an HTML tag, but excluding the tag name, opening <, and closing >. For example:
<?php $attrs = wp_kses_hair( ' class="description"', wp_allowed_protocols() ); $attrs === array( 'class' => array( 'name' => 'class', 'value' => 'description', 'whole' => 'class="description"', 'vless' => 'n', ) );
While this has served WordPress for years, there are fundamental issues in the parsing model; namely, that it isn’t spec-compliant with the HTML5 living standard. Categories of legitimate attribute values are rejected and overlooked, while certain values are misinterpreted as something other than what they are.
Ideally, there would be no need for this function:
- Passing in the string of text inside the HTML tags covered by the attributes is awkward and carries forward parsing errors in determining what that string is.
- The output format does not decode attribute values, which passes along confusion about whether content is escaped or not.
- The HTML API provides more efficient and reliable tools to work directly with HTML and not have to split it and pass around sub-spans of whole HTML tokens.
However, use of the function is pervasive in the plugin space and so the function must remain.
WordPress should lean on the reliability that the HTML API affords to properly parse attributes inside of wp_kses_hair() as part of a push away from ad-hoc HTML parsing and towards reliance on the HTML API.
Change History (16)
This ticket was mentioned in PR #9248 on WordPress/wordpress-develop by @dmsnell.
6 months ago
#1
- Keywords has-patch added
#2
@
6 months ago
- Keywords needs-unit-tests added
With the noted pervasiveness of this function by plugins, I think that this needs some unit tests that can show the behavior isn't changing, or if it does change it is minimally done in documented and understandable ways which can be messaged.
#3
@
6 months ago
Thanks @jorbin — I’m slow, but I plan on adding tests that assert the mapping between HTML attributes and the kinds of outputs that wp_kses_hair() produces.
This ticket will probably be a good proving ground for these kinds of changes, as there is a long list of core functionality that I would like to see us update over the coming years. Almost every one of them will involve behavioral changes, but I believe that leaving the kinds of breakages in place is not the best way forward.
At least in the ideal cases the behaviors are preserved. That is, the intent matches. I don’t know how much belongs in a test to show those changes, but I know some of those tests will lose relevance with time. For example, if wp_kses_hair() misparses an HTML attribute but WordPress now properly understands that attribute, do you have suggestions on how you would like to see that?
At the extreme end I know we don’t document previous bugs, but as you say, this has been long established. Still, in some cases, logging notable changes is also broadcasting some sensitive vulnerabilities that remain in other parts of Core.
This ticket was mentioned in Slack in #core by welcher. View the logs.
4 months ago
#5
@
4 months ago
@dmsnell this was mentioned in the scrub today. Do you think you'll be able to pick this up for 6.9?
This ticket was mentioned in Slack in #core by benjamin_zekavica. View the logs.
4 months ago
This ticket was mentioned in PR #10142 on WordPress/wordpress-develop by @jignesh.nakrani.
3 months ago
#7
- Keywords has-unit-tests added; needs-unit-tests removed
Trac ticket: https://core.trac.wordpress.org/ticket/63724
#9
@
3 months ago
- Keywords needs-refresh added
- Milestone changed from 6.9 to 7.0
The 6.9 Beta1 release is coming soon, so I'd like to punt this ticket to 7.0. Also, both PRs have unit test failures that require updates.
#10
@
3 months ago
@wildworks @welcher hard to tell. I can give an update within the next few hours hopefully
@jonsurrell commented on PR #9248:
6 days ago
#11
This function had quirks that change with this PR and I want to understand them.
I created a test suite for wp_kses_hair(), then I merged this branch and updated to get a diff of test changes. I also looked at several of the most popular results from WP Directory to understand usage.
My review of the most common usages on suggest that _this change is safe to make and would not negatively impact plugin authors_.
- Historically the value and whole properties of the returned array indicate the raw parsed bytes from the HTML (with some exceptions). This means that HTML character references are not decoded. This represents an abstraction leak between the HTML and structural return value.
- - Should this refactor leave the messy return values in place or should it decode the attribute values to enforce the view of the world developers are imagining when calling it? (that all values are normal PHP strings and not HTML text node strings)?
This is a tricky question. It doesn't _seem_ like folks rely on specifics of the input representation being present in the output, however it's certainly possible.
In one of the examples from plugins, esc_attr() is called on the attribute value to construct a new HTML string. This should be perfectly fine because the original HTML was re-encoded in this PR and esc_attr() will avoid double-encoding. They also statically wrap with ", which made the esc_attr() necessary because the attribute value could have contained "!
After some reflection, I believe the behavior you've implemented here _is a good decision_. Consider that the input is HTML and the output (value and whole) have always been some form of HTML. The difference here is a _normalization_ of the HTML in the output.
---
<details>
<summary>behavior diff</summary>
-
tests/phpunit/tests/kses/wpKsesHair.php
diff --git a/tests/phpunit/tests/kses/wpKsesHair.php b/tests/phpunit/tests/kses/wpKsesHair.php index 2ed83679f2e3d..05d573bc070bc 100644
a b public function data_attribute_parsing() { 57 57 'title' => array( 58 58 'name' => 'title', 59 59 'value' => 'My Title', 60 'whole' => "title='My Title'",60 'whole' => 'title="My Title"', 61 61 'vless' => 'n', 62 62 ), 63 63 ), … … public function data_attribute_parsing() { 188 188 array( 189 189 'title' => array( 190 190 'name' => 'title', 191 'value' => '& #60;test>',192 'whole' => 'title="& #60;test>"',191 'value' => '<test>', 192 'whole' => 'title="<test>"', 193 193 'vless' => 'n', 194 194 ), 195 195 ), … … public function data_attribute_parsing() { 200 200 array( 201 201 'title' => array( 202 202 'name' => 'title', 203 'value' => '& #x3C;hex>',204 'whole' => 'title="& #x3C;hex>"',203 'value' => '<hex>', 204 'whole' => 'title="<hex>"', 205 205 'vless' => 'n', 206 206 ), 207 207 ), … … public function data_attribute_parsing() { 212 212 array( 213 213 'title' => array( 214 214 'name' => 'title', 215 'value' => '& #X3C;HEX>',216 'whole' => 'title="& #X3C;HEX>"',215 'value' => '<HEX>', 216 'whole' => 'title="<HEX>"', 217 217 'vless' => 'n', 218 218 ), 219 219 ), … … public function data_attribute_parsing() { 224 224 array( 225 225 'title' => array( 226 226 'name' => 'title', 227 'value' => '& invalid; &#; &#x;',228 'whole' => 'title="& invalid; &#; &#x;"',227 'value' => '&invalid; &#; &#x;', 228 'whole' => 'title="&invalid; &#; &#x;"', 229 229 'vless' => 'n', 230 230 ), 231 231 ), … … public function data_attribute_parsing() { 249 249 'data-text' => array( 250 250 'name' => 'data-text', 251 251 'value' => 'Single quoted value', 252 'whole' => "data-text='Single quoted value'",252 'whole' => 'data-text="Single quoted value"', 253 253 'vless' => 'n', 254 254 ), 255 255 ), … … public function data_attribute_parsing() { 267 267 'alt' => array( 268 268 'name' => 'alt', 269 269 'value' => 'single', 270 'whole' => "alt='single'",270 'whole' => 'alt="single"', 271 271 'vless' => 'n', 272 272 ), 273 273 'id' => array( … … public function data_attribute_parsing() { 284 284 array( 285 285 'title' => array( 286 286 'name' => 'title', 287 'value' => "It's working",288 'whole' => 'title="It \'s working"',287 'value' => 'It's working', 288 'whole' => 'title="It's working"', 289 289 'vless' => 'n', 290 290 ), 291 291 ), … … public function data_attribute_parsing() { 296 296 array( 297 297 'title' => array( 298 298 'name' => 'title', 299 'value' => 'He said "hello"',300 'whole' => 'title= \'He said "hello"\'',299 'value' => 'He said "hello"', 300 'whole' => 'title="He said "hello""', 301 301 'vless' => 'n', 302 302 ), 303 303 ), … … public function data_attribute_parsing() { 327 327 328 328 yield 'invalid attribute name starting with number' => array( 329 329 '1invalid="value"', 330 array(), 330 array( 331 '1invalid' => array( 332 'name' => '1invalid', 333 'value' => 'value', 334 'whole' => '1invalid="value"', 335 'vless' => 'n', 336 ), 337 ), 331 338 ); 332 339 333 340 yield 'invalid attribute name special chars' => array( 334 341 '@invalid="value" $bad="value"', 335 array(), 342 array( 343 '@invalid' => array( 344 'name' => '@invalid', 345 'value' => 'value', 346 'whole' => '@invalid="value"', 347 'vless' => 'n', 348 ), 349 '$bad' => array( 350 'name' => '$bad', 351 'value' => 'value', 352 'whole' => '$bad="value"', 353 'vless' => 'n', 354 ), 355 ), 336 356 ); 337 357 338 358 yield 'duplicate attributes first wins' => array( … … public function data_attribute_parsing() { 355 375 356 376 yield 'malformed unclosed double quote' => array( 357 377 'title="unclosed class="test"', 358 array(), 378 array( 379 'title' => array( 380 'name' => 'title', 381 'value' => 'unclosed class=', 382 'whole' => 'title="unclosed class="', 383 'vless' => 'n', 384 ), 385 'test"' => array( 386 'name' => 'test"', 387 'value' => '', 388 'whole' => 'test"', 389 'vless' => 'y', 390 ), 391 ), 359 392 ); 360 393 361 394 yield 'very long attribute value' => array( … … public function data_attribute_parsing() { 610 643 'alt' => array( 611 644 'name' => 'alt', 612 645 'value' => '', 613 'whole' => "alt=''",646 'whole' => 'alt=""', 614 647 'vless' => 'n', 615 648 ), 616 649 'class' => array( … … public function data_attribute_parsing() { 625 658 yield 'forward slashes between attributes' => array( 626 659 'att / att2=2 /// att3="3"', 627 660 array( 628 'att' => array(661 'att' => array( 629 662 'name' => 'att', 630 663 'value' => '', 631 664 'whole' => 'att', … … public function data_attribute_parsing() { 652 685 'att' => array( 653 686 'name' => 'att', 654 687 'value' => 'val', 655 'whole' => "att='val'",688 'whole' => 'att="val"', 656 689 'vless' => 'n', 657 690 ), 658 691 'att2' => array( 659 692 'name' => 'att2', 660 693 'value' => 'val2', 661 'whole' => "att2='val2'",694 'whole' => 'att2="val2"', 662 695 'vless' => 'n', 663 696 ), 664 697 ), … … public function data_attribute_parsing() { 670 703 'att' => array( 671 704 'name' => 'att', 672 705 'value' => 'val', 673 'whole' => "att='val'",706 'whole' => 'att="val"', 674 707 'vless' => 'n', 675 708 ), 676 709 'att2' => array( 677 710 'name' => 'att2', 678 711 'value' => 'val2', 679 'whole' => "att2='val2'",712 'whole' => 'att2="val2"', 680 713 'vless' => 'n', 681 714 ), 682 715 ), … … public function data_attribute_parsing() { 688 721 'att' => array( 689 722 'name' => 'att', 690 723 'value' => 'val', 691 'whole' => "att='val'",724 'whole' => 'att="val"', 692 725 'vless' => 'n', 693 726 ), 694 727 'att2' => array( 695 728 'name' => 'att2', 696 729 'value' => 'val2', 697 'whole' => "att2='val2'",730 'whole' => 'att2="val2"', 698 731 'vless' => 'n', 699 732 ), 700 733 ), … … public function data_attribute_parsing() { 706 739 'att' => array( 707 740 'name' => 'att', 708 741 'value' => 'val', 709 'whole' => "att='val'",742 'whole' => 'att="val"', 710 743 'vless' => 'n', 711 744 ), 712 745 'att2' => array( 713 746 'name' => 'att2', 714 747 'value' => 'val2', 715 'whole' => "att2='val2'",748 'whole' => 'att2="val2"', 716 749 'vless' => 'n', 717 750 ), 718 751 ), … … public function data_attribute_parsing() { 739 772 // Malformed Equals Patterns. 740 773 yield 'multiple equals signs' => array( 741 774 'att=="val"', 742 array(), 775 array( 776 'att' => array( 777 'name' => 'att', 778 'value' => '="val"', 779 'whole' => 'att="="val""', 780 'vless' => 'n', 781 ), 782 ), 743 783 ); 744 784 745 785 yield 'equals with strange spacing' => array( 746 786 'att= ="val"', 747 array(), 787 array( 788 'att' => array( 789 'name' => 'att', 790 'value' => '="val"', 791 'whole' => 'att="="val""', 792 'vless' => 'n', 793 ), 794 ), 748 795 ); 749 796 750 797 yield 'triple equals signs' => array( 751 798 'att==="val"', 752 array(), 799 array( 800 'att' => array( 801 'name' => 'att', 802 'value' => '=="val"', 803 'whole' => 'att="=="val""', 804 'vless' => 'n', 805 ), 806 ), 753 807 ); 754 808 755 809 yield 'equals echo pattern' => array( 756 810 "att==echo 'something'", 757 811 array( 758 'att' => array(812 'att' => array( 759 813 'name' => 'att', 760 814 'value' => '=echo', 761 815 'whole' => 'att="=echo"', 762 816 'vless' => 'n', 763 817 ), 818 "'something'" => array( 819 'name' => "'something'", 820 'value' => '', 821 'whole' => "'something'", 822 'vless' => 'y', 823 ), 764 824 ), 765 825 ); 766 826 767 827 yield 'attribute starting with equals' => array( 768 828 '= bool k=v', 769 829 array( 830 '=' => array( 831 'name' => '=', 832 'value' => '', 833 'whole' => '=', 834 'vless' => 'y', 835 ), 770 836 'bool' => array( 771 837 'name' => 'bool', 772 838 'value' => '', … … public function data_attribute_parsing() { 785 851 yield 'mixed quotes and equals chaos' => array( 786 852 'k=v ="' . "' j=w", 787 853 array( 788 'k' => array(854 'k' => array( 789 855 'name' => 'k', 790 856 'value' => 'v', 791 857 'whole' => 'k="v"', 792 858 'vless' => 'n', 793 859 ), 860 '="' . "'" => array( 861 'name' => '="' . "'", 862 'value' => '', 863 'whole' => '="' . "'", 864 'vless' => 'y', 865 ), 866 'j' => array( 867 'name' => 'j', 868 'value' => 'w', 869 'whole' => 'j="w"', 870 'vless' => 'n', 871 ), 794 872 ), 795 873 ); 796 874 797 875 yield 'triple equals quoted whitespace' => array( 798 876 '===" "', 799 array(), 877 array( 878 '=' => array( 879 'name' => '=', 880 'value' => '="', 881 'whole' => '=="=""', 882 'vless' => 'n', 883 ), 884 '"' => array( 885 'name' => '"', 886 'value' => '', 887 'whole' => '"', 888 'vless' => 'y', 889 ), 890 ), 800 891 ); 801 892 802 893 yield 'boolean with contradictory value' => array( … … public function data_attribute_parsing() { 820 911 yield 'empty attribute name with value' => array( 821 912 '="value" class="test"', 822 913 array( 823 'class' => array( 914 '="value"' => array( 915 'name' => '="value"', 916 'value' => '', 917 'whole' => '="value"', 918 'vless' => 'y', 919 ), 920 'class' => array( 824 921 'name' => 'class', 825 922 'value' => 'test', 826 923 'whole' => 'class="test"', … … public function data_protocol_filtering() { 890 987 'href' => array( 891 988 'name' => 'href', 892 989 'value' => 'alert(1)', 893 'whole' => "href='alert(1)'",990 'whole' => 'href="alert(1)"', 894 991 'vless' => 'n', 895 992 ), 896 993 ), … … public function data_protocol_filtering() { 925 1022 array( 926 1023 'src' => array( 927 1024 'name' => 'src', 928 'value' => 'text/html, <script>alert(1)</script>',929 'whole' => 'src="text/html, <script>alert(1)</script>"',1025 'value' => 'text/html,<script>alert(1)</script>', 1026 'whole' => 'src="text/html,<script>alert(1)</script>"', 930 1027 'vless' => 'n', 931 1028 ), 932 1029 ),
</details>
Here are two examples from the most most popular plugins in the WP Directory search:
From YITH (this appears to be part of the yith library used in many of their plugins):
/**
* Transform attributes array to HTML attributes string.
* If using a string, the attributes will be escaped.
* Prefer using arrays.
*
* @param array|string $attributes The attributes.
* @param bool $echo Set to true to print it directly; false otherwise.
*
* @return string
* @since 3.7.0
* @since 3.8.0 Escaping attributes when using strings; allow value-less attributes by setting value to null.
*/
function yith_plugin_fw_html_attributes_to_string( $attributes = array(), $echo = false ) {
$html_attributes = '';
if ( ! ! $attributes ) {
if ( is_string( $attributes ) ) {
$parsed_attrs = wp_kses_hair( $attributes, wp_allowed_protocols() );
$attributes = array();
foreach ( $parsed_attrs as $attr ) {
$attributes[ $attr['name'] ] = 'n' === $attr['vless'] ? $attr['value'] : null;
}
}
if ( is_array( $attributes ) ) {
$html_attributes = array();
foreach ( $attributes as $key => $value ) {
if ( ! is_null( $value ) ) {
$html_attributes[] = esc_attr( $key ) . '="' . esc_attr( $value ) . '"';
} else {
$html_attributes[] = esc_attr( $key );
}
}
$html_attributes = implode( ' ', $html_attributes );
}
}
if ( $echo ) {
// Already escaped above.
echo $html_attributes; // phpcs:ignore WordPress.Security.EscapeOutput.OutputNotEscaped
}
return $html_attributes;
}
$params = wp_kses_hair( $params, array( 'http' ) );
$width = isset( $params['width'] ) ? (int) $params['width']['value'] : 0;
$height = isset( $params['height'] ) ? (int) $params['height']['value'] : 0;
$wh = '';
if ( $width && $height ) {
$wh = "&w=$width&h=$height";
}
$url = esc_url_raw( "https://www.youtube.com/watch?v={$match[3]}{$wh}" );
#12
@
6 days ago
The implementation in PR 9248 is great. It's a big simplification and it brings clarity and rigor to this function. Frankly, I like it.
Ideally, there would be no need for this function.
The proposed change is good. But should we bother touching this function at all? What about just deprecating it and suggesting folks use the HTML API?
#13
@
6 days ago
What about just deprecating it and suggesting folks use the HTML API?
I think it will take more time to eradicate its use from within Core. at that point I’m all for deprecating it. Until then, my take is that it’s worth improving the reliability. Some of the examples you shared highlight critical categories of inputs where Core is currently confused.
#14
@
6 days ago
@jorbin thanks to @jonsurrell’s work we have a test suite, and it demonstrates the behaviors as well as where they different with this change.
I’m going to merge this, expecting to watch things and revert if necessary, but I think it will be stable. None of the cases that were previously broken were part of the function contract, and additionally, calling code had to already expect proper results.
This is hard for me to verbalize, but here is an example. Suppose we had id=< as our input. In any situation that code previously wanted to detect something about this pattern, it already had to also accept id="<", id="<", id='<', and a few other variants.
So I think this change is not presenting any meaningful differences in expectations but rather normalizing inputs so that only a subset of the pre-existing expectations are necessary. It’s shrinking the domain of required support.
The test suite shows some great examples of updates that definitely change behavior but which are also definitely wanted: in many cases Core is entirely unaware of the presence of existing attributes and can lead calling code to duplicate attribute or defeat valuable checks because of a presumed absence or misparse.
The original description on the function suggests it will perform normalization, but now it will be done comprehensively.
Trac ticket: Core-63724
Replaces #7407, dmsnell#5
Coordination in #9256
wp_kses_hair()is built around an impressive state machine for parsing the$attrof an HTML tag, that is, the span of text after the tag name and before the closing>. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing.This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the
$attrstring and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag.## Dependencies