Make WordPress Core

Opened 5 months ago

Closed 3 months ago

Last modified 3 months ago

#61052 closed enhancement (fixed)

WP_KSES data attributes: Allow double dash

Reported by: cbravobernal's profile cbravobernal Owned by: dmsnell's profile dmsnell
Milestone: 6.6 Priority: normal
Severity: normal Version: 6.5
Component: Security Keywords: has-patch has-unit-tests needs-testing dev-feedback
Focuses: Cc:

Description

Right now, WordPress does not allow double hyphens -- inside data attributes.

https://github.com/WordPress/wordpress-develop/blob/16237a11586b022861933fa738acd957eef6653e/src/wp-includes/kses.php#L1267

That's a problem for Interactivity API directives like data-wp-on--event or data-wp-bind--value.

Right now, all PHP with those directives parsed with this function, will not be rendered, stopping all kind of interactions.

Because of that, static blocks are not supported yet(as they are saved in DB, and pass through a wp_kses before sending them to the browser).

Some plugins are starting to notice the issue, like WooCommerce:
https://github.com/woocommerce/woocommerce/issues/46722

Can this function be updated to allow them?

Is there any concern or edge case I could miss before creating a PR?

Change History (37)

#1 @cbravobernal
5 months ago

  • Summary changed from WP_KSES: Allow double dash to WP_KSES data attributes: Allow double dash

This ticket was mentioned in PR #6429 on WordPress/wordpress-develop by @dmsnell.


5 months ago
#2

  • Keywords has-patch has-unit-tests added

Trac ticket: Core-61052

Allow spec-compliant data-attributes in wp_kses_attr_check()

#3 @dmsnell
5 months ago

I've proposed a patch which is permissive here. Core code may be used to more restrictions, but all of these are allowed in the browser and appear after requesting the dataset for an element. Is there a good reason to reject such valid data attributes, even if they are somewhat obscure?

If there's one imaginable objection it would be to prevent storing data attributes named things like data-no-<img-"tag"-in-here. This is because such code might trip up weaker parsers down the line. However, this would or should be caught higher up than in wp_kses_attr_check() and so I think it's fine here not having that additional restriction.

#4 @gziolo
5 months ago

In this case, it would be easier to start with a very specific expectation that covers only the syntax allowed for directives:

  • attribute name prefixed with data-wp-
  • attribute name contains only a-z, 0-9 characters, single dash -, and double dash --
  • value should contain one of state, context, actions, callbacks
  • value can start with !
  • value can be prefixed with a namespace, not sure what is allowed
  • value can only contain a-Z, 0-9, :, .

#5 @gziolo
5 months ago

  • Component changed from General to Security
  • Milestone changed from Awaiting Review to 6.6
  • Version set to 6.5

#6 @oglekler
4 months ago

  • Keywords needs-testing added

#7 @jonsurrell
4 months ago

I checked into some history about why double dashes (or other characters) are not allowed in data attributes. The change landed in r43981 and was discussed in #33121, especially starting after this comment:

This (two hyphens or end hyphen) is true but it does some strange things to the element.dataset property available in JavaScript

Good point. Lets not allow it :)

The reasoning does not seem to be related to any security issues, but more around the potential for strange behavior when accessed via dataset thanks to its automatic dash-style to camelCase conversion.

Given the immediate need to allow double-dashes, the history, and the fact that more restrictive data attribute handling does not seem to have been an issue, I'd try to move ahead with a minimal PR that just allows leading, trailing, or double-dashes.

Ping @azaozz and @peterwilsoncc as the folks involved in the original data attributes with -- decision.

This ticket was mentioned in PR #6598 on WordPress/wordpress-develop by @jonsurrell.


4 months ago
#8

Allow data attrbitues with leading, trailing, and double - in the attribute name.

This relaxes the hyphen restriction introduced in https://core.trac.wordpress.org/changeset/43981. It is intended to be a simple change to allow data attributes used frequently by the Interactivity API like data-wp-on--click="…".

It is _not_ a substitute for https://github.com/WordPress/wordpress-develop/pull/6429. I'd like to see that move ahead the data attribute restrictions further relaxed to align with what's allowed by the specification. This _is_ intended to be a smaller and easier change to solve the immediate need that can be landed without too much difficulty.

Trac ticket: https://core.trac.wordpress.org/ticket/61052

@dmsnell commented on PR #6429:


4 months ago
#9

I dug into the relevant section of the spec, but I think the interesting point here is that data attributes in the browser are handled much more loosely (as already mentioned in the description). The spec seems to explicitly disallow ASCII uppercase and : in data attributes, but Chrome doesn't seem to care at all.

this is going to be another case where it's evident that the HTML5 specification is a family of interrelated specifications for different purposes. there are different rules for what we allow _producing_ than there are for what we demand for _reading_.

for example, attributes aren't allowed to be created with brackets in their name, like [if]="var.hasAccent" and yet every parser _must_ recognize this _as_ an attribute. so it's more or less a blur and everything that's allowed is spec, but then we "shouldn't" do things that are against the rules, but they are benign.

on data attributes what I was most concerned about is matching browser behavior. we should have a direct correspondence between the attributes on the server that appear in a dataset for an Element in the browser. if we find cases that all three browsers reject, let's add those in the test suite and note the omission.

@dmsnell commented on PR #6429:


4 months ago
#10

I've rearranged the code somewhat, doing what I originally wanted to avoid, in creating a new function. Still, the desire to add a test suite overcame my hesitation.

Now, I've added wp_kses_transform_custom_data_attribute_name, which is a mouthful, but does exactly what I believe Core needs to do. That is, it needs to start by recognizing what _is_ and what _isn't_ a data attribute.

Beyond this, we can talk about arbitrarily rejecting some of them, but we can add that on top of this work.

That is, if we only want to allow a subset of data attributes, we can start inside the if where we've detected one, and then reject it, knowing for sure we didn't miss any.

By the way, I ran a test in Safari, Chrome, and Firefox to confirm the test cases I added. They all operate uniformly. Surprisingly, data- (with nothing after the one -) is a valid data attribute whose name is the empty string. This test confirms the relationship between attribute name and appearing in an element's dataset property.

<details><summary>Test Code</summary>

<?php

$variations = [
        'data-',
        'post-id',
        'data-post-id',
        'data-Post-ID',
        "data-\u{2003}",
        'data-🐄-pasture',
        'data-wp-bind--enabled',
        'data--before',
        'data-after-',
        'data---one---two---',
        'data-[wish:granted]',
];

foreach ( $variations as $name ) {
        echo "<div><code>{$name}</code>: <span {$name}='{$name}'></span></div>\n";
}

?>
<script>
document.querySelectorAll( 'span' ).forEach( node => {
        node.innerText = JSON.stringify( node.dataset );
} );
</script>

</details>

https://github.com/WordPress/wordpress-develop/assets/5431237/2e81c244-2645-4324-8839-81ab0dca314b

#11 @peterwilsoncc
4 months ago

Thanks for the callout, Jon, I hadn't seen this ticket.

Given a core API is using double hyphens, then I think we need to allow it. I'll follow up with a review on the linked pull request.

@dmsnell commented on PR #6429:


4 months ago
#12

Interestingly, before this change data--before would have been rejected while afterwards it's allowed. The name of the dataset value doesn't have any dashes though, it's Before

@dmsnell commented on PR #6429:


4 months ago
#13

@peterwilsoncc I've refactored the test in 5294a86

Also I've updated the PR description to reflect how this PR is no longer attempting to allow all legitimate custom data attributes, but still only those WordPress special-cases, plus the ones with a dash in their dataset name.

Of note, this is slightly different than the approach in #6598. Instead of deciding to accept or reject based on the attribute name, in this patch WordPress is basing the decision off of the translated name. This subtlety means that the test is more directly related to the name of the custom data attribute.

For example, if we only test for the presence of the double dash --, we are comparing that against three different outputs: one with no dashes, one with one dash, and one with two dashes. This is evidenced in the screenshot of the table at the start of this post with the custom data attribute data--one--two-- whose name in JavaScript becomes One-Two--. Each double-dash in that name corresponds to a different translation.

Now, the rule to allow a - in the translated name implicitly allows double-dashes in the attribute name because that's the only way to produce one in the translated name. For example, wpBind-Enabled.

This ticket was mentioned in Slack in #core by joemcgill. View the logs.


4 months ago

@dmsnell commented on PR #6429:


4 months ago
#15

@sirreal what are your thoughts on this vs. #6598? and also about @peterwilsoncc's thoughts.

I'm a little unclear as to the need for the new function as it's the attribute name that KSES is concerned about, the validation regex within the existing function should be for convertible names.

Personally I feel like I could be convinced both ways, but there's this part of me, especially after working with so many issues in WordPress' HTML handling and kses issues, that starting with the specification and then adding constraints is going to pay off over starting with a mix of special constraints and ad-hoc parsing.

That is, for example, maybe today kses rejects anything that looks like a custom data attribute, but tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?" (By the way, @peterwilsoncc, this is another phrasing of my run-around reason for including the function - to define a new semantic which can answer the question "is this a custom data attribute?")

@jonsurrell commented on PR #6429:


4 months ago
#16

what are your thoughts on this vs. https://github.com/WordPress/wordpress-develop/pull/6598?

I'm happy to see this PR move forward if we're happy with it, it seems more complete. My only intention with #6598 was to have a PR that was small and easy to land. I'm happy to close that PR.

@peterwilsoncc commented on PR #6429:


4 months ago
#17

@dmsnell I can see it leading to confusing if the new function doesn't return null for a value that will end up being stripped:

wp> wp_kses_transform_custom_data_attribute_name( "data-love-a-little-🐮-love-a-little-tippin'" )
=> string(34) "loveALittle-🐮LoveALittleTippin'"
wp> wp_kses_post( "<div data-love-a-little-🐮-love-a-little-tippin'>Yes, a musical theatre reference</div>" )
=> string(43) "<div>Yes, a musical theatre reference</div>"

As it's a kses function it suggests support that isn't available. Generally KSES sub-functions aren't intended to be called directly but that can change over time, as we've seen with safecss_filter_attr()

@dmsnell commented on PR #6429:


4 months ago
#18

@peterwilsoncc 🤔 many of the kses sub-functions _are_ there just for parsing, like wp_kses_hair and wp_kses_hair_parse and wp_kses_attr_parse

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

@peterwilsoncc commented on PR #6429:


4 months ago
#19

as a reminder, my purpose is to fully vet and adapt this solution for the feedback, not to push to adopt it or choose it in the end.

Yeah, I've got ya. We're working to the same end & it's a complex question :)

would it make more sense to add this into the HTML API and then have kses call it? it would have a clear home there as being a semantic of the HTML layer.

That works as the HTML API seems more about parsing without regard to kses. As the maintainer of the HTML API is that something you are happy with? If it's not something you had planned and don't think is a valid use of the feature then don't feel the need to wedge it in there on my account.

tomorrow some other code tries to reuse the logic and misses some valid ones, or lets through one that it shouldn't, because its test wasn't "is this a custom data attribute" but rather "does this attribute follow this regex?

Are you able to clarify this a little more?

I'm still trying to get clear on the purpose of the new function in my head and how it differs from the regex validation.

@dmsnell commented on PR #6429:


3 months ago
#20

thanks again @peterwilsoncc, and sorry for my late response, which is due to some transit that threw me off.

That works as the HTML API seems more about parsing without regard to kses.

I'm happy to put this in there since it relates to the spec issues. Originally we had a WP_HTML_Spec class in the HTML API proposal but we removed it. It was basically an all-static class with knowledge like this from the HTML spec, and no other stateful or complicated logic.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

Are you able to clarify this a little more?

Thank you for asking. I don't mean to be unclear, but I don't always get things out very well.

I could change the question here a bit. This entire patch and need to open up kses is complicated by the fact that it's unclear what Core wants or expects. Here's was the existing code claims…

/**
 * Allow `data-*` attributes.
 *
 * …
 * 
 * Note: the attribute name should only contain `A-Za-z0-9_-` chars,
 * double hyphens `--` are not accepted by WordPress.
 */

The existing code makes a claim for the purpose of the regex but then immediately contradicts that claim by specifying new rules. It's there to allow data-* attributes, but then also it's there to only allow _some_ of them. It doesn't explain why these new restrictions are in place or who decided. It's evident that there are additional rules being imposed, but there's no explanation for a process to determine how and if it's safe to change those rules, or know if there's a bug in the original implementation.

For example, should Core allow data-________? It's allowed by the existing regex, but it looks wrong. What makes it wrong? Looks?

Getting back to your question, by separating the question into two steps I find it easier to disambiguate situations like these: is this a custom data attribute? is this one Core allows?

We have a clear goalpost for proper parsing, so we don't have to ask ourselves if the regex was intentionally overlooking other custom data attributes vs. whether it was accidentally overlooking them. Since this entire patch is about changing what WordPress allows, we can focus on WordPress additional rules and not conflate that with how to properly detect legitimate ones. We can change our own sanitization without ever effecting the underlying parsing code.

Anyway I hope I haven't made this more confusing. This is all about making it easier to talk about and verify that code we write is correct and about separating the act of detecting things from allowing some of those things we detect.

It's similar to using the HTML API to find IMG tags with a given class first by finding all of the legitimate IMG tags and then applying further rules, vs. attempting to find the subset directly with some pattern like ~[[Image(https://[^)]]]>~ - it invites conflation of spec. vs. custom behaviors which open up opportunity to miss things.

#21 @oglekler
3 months ago

  • Milestone changed from 6.6 to 6.7

We have 2 days before Beta 1, so I am moving it into the next milestone. 

#22 @dmsnell
3 months ago

@oglekler I'm not sure this should be punted. it's something that I think several folks would still like to put into 6.6

#23 @oglekler
3 months ago

  • Keywords dev-feedback added

@dmsnell There is no time for testing, and from my point of view patch has changes that are not strictly related to the issue mentioned in the ticket. So, it is not clear why this simple problem is getting such a complex patch. Tests are also much more complex than I could have been expected, for example, why WP_HTML_Tag_Processor is brought to the kses tests. There is another patch that is purely addressing the issue in question. Possibly, it is worth to proceed with simpler one make all other changes with another ticket.

@whyisjake and @audrasjb, please, take a look.

@peterwilsoncc commented on PR #6429:


3 months ago
#24

Thanks for clarifying.

I've been considering this over the weekend and my strong inclination is to go with the source code changes in PR #6598 to allow for the lightest touch changes to land in core.

Doing some quick testing, I can't see it leading to problems but it's worth coming up with additional edge cases. (Also doing quick tests, I can see that some of the browser interpretations are interesting https://jsbin.com/devefix/edit?html,js,console )

That said, if you can think of problematic cases it would lead to, I am happy to listen. It's best to know about these things pre- rather than post-commit.

The only challenge I see in moving into the HTML API right now is that it's not clear where to put it, and that would probably take some time to see how everyone feels.

As an alternative, we could inline the code for now and move it over when we have a stronger idea where it belongs. Or I guess we could still leave it as its own function so that we can leave in the tests for it, but communicate that it's not to be used or called from the outside, and that it will probably move. As in, maybe we can keep it here for now for practical reasons and defer the bigger design question until we can answer it.

If we're to validate the dataset, I'd be inclined to inline it for the time being if the intent is to deprecate in the shortish term.

In the past Core has removed private delegations because it didn't reflect reality (List Tables being the poster-child for this) so we'd need to keep it around. Inlining also allows a more considered approach to inclusion in the HTML API while you figure out where best to place it.

@dmsnell commented on PR #6429:


3 months ago
#25

If you feel happy with #6598 then let's merge that in and leave this open for further consideration. The two aren't contradictory, but this one encompasses more choice. I'll approve that one, and if nobody has merged it by later today I'll try and do that. The most important thing to me is that we allow these where WordPress currently removes them for the Interactivity API's purposes.

In aa63e2b0756f21043b74a8cc1fe45bc4ceddf552 I added your additional test cases. Cases like these are exactly why I approached the problem the way I did, because intuition of the mapping from HTML to JavaScript can be fuzzy, meaning the rules we apply in PHP are in high risk for being different than we think they are unless we write the rules to the output of that transformation.

Thanks @peterwilsoncc!

@jonsurrell commented on PR #6598:


3 months ago
#26

I agree, I'd still prefer to see https://github.com/WordPress/wordpress-develop/pull/6429. Landing this is compatible https://github.com/WordPress/wordpress-develop/pull/6429 as a future enhancement.

@cbravobernal commented on PR #6598:


3 months ago
#27

I tested writing custom HTML in a post with a contributor role and:

https://github.com/WordPress/wordpress-develop/assets/37012961/e0a63d78-4fce-491b-b078-07e0017eb9f5

  • In trunk, the data-wp-on--click is not saved on the database.

https://github.com/WordPress/wordpress-develop/assets/37012961/afac21b8-afbf-4dcf-b037-7a80b36dbf8d

  • Within this branch, data-wp--on--click is saved on the database.

https://github.com/WordPress/wordpress-develop/assets/37012961/43436ef1-b3e2-4bb2-a397-4063a0a02f90

This ticket was mentioned in PR #6598 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#28

Trac ticket: Core-61052

Allow data attrbitues with leading, trailing, and double - in the attribute name.

This relaxes the hyphen restriction introduced in https://core.trac.wordpress.org/changeset/43981. It is intended to be a simple change to allow data attributes used frequently by the Interactivity API like data-wp-on--click="…".

It is _not_ a substitute for https://github.com/WordPress/wordpress-develop/pull/6429. I'd like to see that move ahead the data attribute restrictions further relaxed to align with what's allowed by the specification. This _is_ intended to be a smaller and easier change to solve the immediate need that can be landed without too much difficulty.

#29 @dmsnell
3 months ago

  • Milestone changed from 6.7 to 6.6

#30 @dmsnell
3 months ago

  • Owner set to dmsnell
  • Resolution set to fixed
  • Status changed from new to closed

In 58294:

KSES: Allow leading trailing double hyphen in data attributes

Expand allowable set of custom data attribute names to include those containing
leading, trailing, and double - characters. Previously, WordPress was
removing data attributes that are used in the Interactivity API. By allowing
these additional custom data attributes, the related Interactivity API
directives will preserve through kses.

For example, the Interactivity API frequently relies on custom data attributes
such as data-wp-on--click="...". The change in [43981] would strip these out
of the processed HTML, however.

Developed in https://github.com/WordPress/wordpress-develop/pull/6598
Discussed in https://core.trac.wordpress.org/ticket/61052

Props cbravobernal, dmsnell, gziolo, jonsurrell.
Follow-up to [43981].
Fixes #61052.

This ticket was mentioned in PR #6598 on WordPress/wordpress-develop by @jonsurrell.


3 months ago
#32

Trac ticket: Core-61052

Allow data attrbitues with leading, trailing, and double - in the attribute name.

This relaxes the hyphen restriction introduced in https://core.trac.wordpress.org/changeset/43981. It is intended to be a simple change to allow data attributes used frequently by the Interactivity API like data-wp-on--click="…".

It is _not_ a substitute for https://github.com/WordPress/wordpress-develop/pull/6429. I'd like to see that move ahead the data attribute restrictions further relaxed to align with what's allowed by the specification. This _is_ intended to be a smaller and easier change to solve the immediate need that can be landed without too much difficulty.

#33 @jonsurrell
3 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

This ticket was mentioned in Slack in #core by nhrrob. View the logs.


3 months ago

#35 @hellofromTonya
3 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

The release is currently in commit freeze with 6.6 RC1 happening at the top of the hour.

The light weight implementation for this enhancement was committed in [58294] using PR 6598. As @peterwilsoncc noted:

I've been considering this over the weekend and my strong inclination is to go with the source code changes in PR 6598 to allow for the lightest touch changes to land in core.

It's too late to consider the follow-up work in PR 6429 in the 6.6 cycle. As discussed today in the bug scrub, this enhancement needs its own ticket to continue in the 6.7 cycle.

@jonsurrell @dmsnell would you mind opening a new ticket please?

I'm closing this ticket as resolved. If bugs are reported during 6.6 RC, those could be fixed in their bug report tickets.

Also @dmsnell @peterwilsoncc @jonsurrell is [58294] okay to stay in 6.6 given it's being referred to as a "partial fix"? Or should it be reverted with this ticket repurposed for 6.7?

#36 @jonsurrell
3 months ago

I've created #61501 as a follow-up ticket to consider PR#6429.

Is [58294] okay to stay in 6.6 given it's being referred to as a "partial fix"? Or should it be reverted with this ticket repurposed for 6.7?

Yes, that is an improvement on its own and should stay in 6.6.

@jonsurrell commented on PR #6429:


3 months ago
#37

We should update the trac ticket for this to https://core.trac.wordpress.org/ticket/61501. This would be an enhancement targeting the 6.7 release.

Note: See TracTickets for help on using tickets.