Make WordPress Core

Opened 4 months ago

Last modified 3 weeks ago

#63724 assigned enhancement

HTML API: Reliably parse HTML attributes in `wp_kses_hair()`

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 7.0 Priority: normal
Severity: normal Version: trunk
Component: HTML API Keywords: has-patch has-unit-tests needs-refresh
Focuses: Cc:

Description

wp_kses_hair() attempts to parse HTML attributes given the span of text inside an HTML tag, but excluding the tag name, opening <, and closing >. For example:

<?php
$attrs = wp_kses_hair( ' class="description"', wp_allowed_protocols() );
$attrs === array(
        'class' => array(
                'name'  => 'class',
                'value' => 'description',
                'whole' => 'class="description"',
                'vless' => 'n',
        )
);

While this has served WordPress for years, there are fundamental issues in the parsing model; namely, that it isn’t spec-compliant with the HTML5 living standard. Categories of legitimate attribute values are rejected and overlooked, while certain values are misinterpreted as something other than what they are.

Ideally, there would be no need for this function:

  • Passing in the string of text inside the HTML tags covered by the attributes is awkward and carries forward parsing errors in determining what that string is.
  • The output format does not decode attribute values, which passes along confusion about whether content is escaped or not.
  • The HTML API provides more efficient and reliable tools to work directly with HTML and not have to split it and pass around sub-spans of whole HTML tokens.

However, use of the function is pervasive in the plugin space and so the function must remain.

WordPress should lean on the reliability that the HTML API affords to properly parse attributes inside of wp_kses_hair() as part of a push away from ad-hoc HTML parsing and towards reliance on the HTML API.

Change History (10)

This ticket was mentioned in PR #9248 on WordPress/wordpress-develop by @dmsnell.


4 months ago
#1

  • Keywords has-patch added

Trac ticket: Core-63724

Replaces #7407, dmsnell#5
Coordination in #9256

wp_kses_hair() is built around an impressive state machine for parsing the $attr of an HTML tag, that is, the span of text after the tag name and before the closing >. Unfortunately, that parsing code doesn’t fully-implement the HTML specification and may be prone to mis-parsing.

This patch replaces the existing state machine with a straight-forward use of the HTML API to parse the attributes for us, constructing a shell take for the $attr string and reading the attributes structurally. This shell is necessary because a previous stage of the pipeline has already separated what it thinks is the so-called “attribute list” from a tag.

## Dependencies

#2 @jorbin
4 months ago

  • Keywords needs-unit-tests added

With the noted pervasiveness of this function by plugins, I think that this needs some unit tests that can show the behavior isn't changing, or if it does change it is minimally done in documented and understandable ways which can be messaged.

#3 @dmsnell
4 months ago

Thanks @jorbin — I’m slow, but I plan on adding tests that assert the mapping between HTML attributes and the kinds of outputs that wp_kses_hair() produces.

This ticket will probably be a good proving ground for these kinds of changes, as there is a long list of core functionality that I would like to see us update over the coming years. Almost every one of them will involve behavioral changes, but I believe that leaving the kinds of breakages in place is not the best way forward.

At least in the ideal cases the behaviors are preserved. That is, the intent matches. I don’t know how much belongs in a test to show those changes, but I know some of those tests will lose relevance with time. For example, if wp_kses_hair() misparses an HTML attribute but WordPress now properly understands that attribute, do you have suggestions on how you would like to see that?

At the extreme end I know we don’t document previous bugs, but as you say, this has been long established. Still, in some cases, logging notable changes is also broadcasting some sensitive vulnerabilities that remain in other parts of Core.

This ticket was mentioned in Slack in #core by welcher. View the logs.


6 weeks ago

#5 @welcher
6 weeks ago

@dmsnell this was mentioned in the scrub today. Do you think you'll be able to pick this up for 6.9?

This ticket was mentioned in Slack in #core by benjamin_zekavica. View the logs.


5 weeks ago

This ticket was mentioned in PR #10142 on WordPress/wordpress-develop by @jignesh.nakrani.


5 weeks ago
#7

  • Keywords has-unit-tests added; needs-unit-tests removed

#8 @westonruter
3 weeks ago

  • Owner set to dmsnell
  • Status changed from new to assigned

#9 @wildworks
3 weeks ago

  • Keywords needs-refresh added
  • Milestone changed from 6.9 to 7.0

The 6.9 Beta1 release is coming soon, so I'd like to punt this ticket to 7.0. Also, both PRs have unit test failures that require updates.

#10 @dmsnell
3 weeks ago

@wildworks @welcher hard to tell. I can give an update within the next few hours hopefully

Note: See TracTickets for help on using tickets.