Make WordPress Core

Opened 23 months ago

Closed 23 months ago

Last modified 22 months ago

#56504 closed defect (bug) (duplicate)

`sanitize_html_class()` is both too restrictive, and too permissive so it may return an invalid class name

Reported by: anrghg's profile anrghg Owned by:
Milestone: Priority: normal
Severity: normal Version:
Component: General Keywords: close needs-dev-note
Focuses: Cc:

Description

sanitize_html_class() returns invalid class when arguments start with a digit 0-9, or a hyphen followed by a digit 0-9, per CSS spec (https://www.w3.org/TR/CSS21/syndata.html#characters). Brave/Chrome does not support these invalid classes, they do not work, so they are not “sane”, the return value is not sanitized.

At the other end, sanitize_html_class() needlessly degrades class names containing or made of non-ASCII Unicode, from no-break space and above, accented letters and emoji are allowed in class names (and IDs) and work perfectly, provided of course they are not URL-encoded, but they may be backslash escaped, e.g. UTF-8. Best is to have them in plain Unicode, per https://www.w3.org/International/questions/qa-escapes

A sanitizing function conforming to the spec and providing a better user experience could be coded for example like so:

function anrghg_sanitize_html_id_class( $p_s_string, $p_s_prefix = '_', $p_b_decode = true ) {
	if ( preg_match( '/[0-9]/', $p_s_string[0] )
		||
		( preg_match( '/-/', $p_s_string[0] ) && preg_match( '/[-0-9]/', $p_s_string[1] ) )
	) {
		$p_s_string = $p_s_prefix . $p_s_string;
	}
	if ( $p_b_decode ) {
		$p_s_string = urldecode( $p_s_string );
	} else {
		$p_s_string = preg_replace( '/%[0-9A-Fa-f]{2}/', '', $p_s_string );
	}
	$p_s_string = preg_replace( '/((?<!\\\\[0-9A-Fa-f]{2})\s|(?<!\\\\)[%^{}~@`\'"&#$()+[\]|\/*<>=?;:!,.])/', '', $p_s_string );
	return $p_s_string;
}

Prepending an underscore probably provides a better UX than escaping the first digit.

The apply_filters() is skipped for brevity. We can use this filter to override default processing, but the issue is not so much about customization, rather about conformance to the CSS specification.

Change History (7)

#1 follow-up: @joyously
23 months ago

I discovered the leading digit part of this in my testing of my theme, which adds body classes using page slugs.
Since my theme has options that are class names, changing the sanitizing function can cause user input to not match, so any change to this function needs a prominent Dev Note or some way to be backward compatible.

#2 in reply to: ↑ 1 @anrghg
23 months ago

Replying to joyously:

I discovered the leading digit part of this in my testing of my theme, which adds body classes using page slugs.
Since my theme has options that are class names, changing the sanitizing function can cause user input to not match, so any change to this function needs a prominent Dev Note or some way to be backward compatible.

Backcompat could be ensured by changing the function to:

function sanitize_html_class(
	string $class,
	string $fallback = '',
	bool $css_compliant = false,
	string $prefix = '_',
	bool $decode = true
) {
	if ( $css_compliant ) {

		// Do the right thing.

	} else {

		// Legacy behavior.

	}
	return apply_filters( 'sanitize_html_class', $sanitized, $class, $fallback );
}

However, with your themes fixing the leading digit issue, users inputting classes with a leading digit or hyphen-digit only need to know the string that their classes will be prefixed with.

The proposed function really supports user input. Where unpredictability comes in awkwardly is when copy-pasting a spec-conformant slug (supporting all Unicode in the page title) after selecting the slug in the URL bar so it is plain Unicode.

The legacy sanitize_html_class() leaves only bare-bone ASCII, and if there is no ASCII in the slug e.g. due to being written in a Non-Latin script, it falls back on an empty string.

Yet another pitfall in the legacy sanitize_html_class() is that it may return uppercase ASCII instead of converting it to lowercase while it is on it.

Uppercase in URL slugs may lead to 404 errors depending on [whether the server supports it or not](https://stackoverflow.com/questions/7996919/should-url-be-case-sensitive). Generally it’s so unsafe that A-Z should be banned from slugs. The Block Editor does so. I came across the issue while using my plugin and [hinted it in a comment](https://github.com/ampproject/amp-wp/issues/7238#issuecomment-1236226182).

Indeed a prominent Dev Note should be part of the solution. It should not only document that sanitize_html_class() has been fixed to *optionally* comply to the CSS spec.

The Dev Note should also promote best practice of adding not only a single class but three classes to the body element:

  1. id-1234 to complete the default postid-1234 OR page-id-1234 (with an extra hyphen for readability…);
  2. The legacy plain ASCII class but sanitized so in non-Latin scripts there will be no “-1”, “-2”, “-3” class names any longer, either;
  3. The CSS conformant plain Unicode class, with only ASCII symbols and punctuation removed except the hyphen and underscore and the backslash escaping numeric references plus the space delimiting them! Its really very tricky.

As a bonus, the legacy not-really-sanitized class could also be added, but I do not recommend invalid CSS any longer.

It is not up to the developers to complete sanitization after using a function designed and supposed to do the job.

Nor should CSS be pulled down to bare-bone ASCII, at the expense of Non-Latin scripts, emoji, and UX.

#3 follow-up: @peterwilsoncc
23 months ago

  • Keywords close added

The purpose of the sanitization and escaping functions in WordPress is to make code safe rather than validate it. In this case, sanitize_html_class()'s goal to to ensure the class name doesn't include characters that will allow a maliciously crafted class name to run JavaScript or alter the HTML:

<?php
$class_name = '300"; onload="/* some JavaScript */';
?>
<div class="<?php echo sanitize_html_class( $class_name ); ?>"></div>

Without sanitization the above code would execute the malicious code, with sanitization it is safe. See the example at https://3v4l.org/RRqr1

It is possible to use a number within CSS by properly escaping it, for example class="300" can be styled by using .\33 00 {}.

As classes are case sensitive, it would be problematic to convert them to lower case in this function. To cover the side note about title, sanitize_title() is applied to slugs and converts characters to lowercase:

wp> sanitize_title( 'This Is My Title Case Title' );
string(27) "this-is-my-title-case-title"

My inclination is to close this issue as sanitize_html_class() is achieving it's goal of making the class names safe.

#4 in reply to: ↑ 3 @anrghg
23 months ago

  • Keywords changes-requested needs-testing needs-dev-note needs-I18N-review added

Replying to peterwilsoncc:

It is possible to use a number within CSS by properly escaping it, for example class="300" can be styled by using .\33 00 {}.

Good point, thank you. .\00003300 {} would be an alternative working better in code highlighters. But both have the downside of getting class names look different in HTML than in CSS.

As classes are case sensitive, it would be problematic to convert them to lower case in this function.

Indeed I missed this point. The argument is likely to be already lowercase anyway.

To cover the side note about title, sanitize_title() is applied to slugs and converts characters to lowercase:

I know at least one website (Quora) that uses titlecase in slugs. It looks much better. But it seems to require support from the server; I got randomly 404 errors (no error for ‘Titlecase’, but errors for other slugs). As I’m filtering sanitize_title() to allow titlecase in fragment identifiers, I’ll need to add an opt-in to apply it to slugs.

The purpose of the sanitization and escaping functions in WordPress is to make code safe rather than validate it.

WordPress is preferring a non-conformant way, throwing the baby with the bathwater.

Since page slugs are used as class names, all scripts should be equal: Latin, Greek, Cyrillic, all 160 (number growing) Non-Latin scripts already supported by Unicode.

My inclination is to close this issue as sanitize_html_class() is achieving it's goal of making the class names safe.

At the expense of usability, equity, internationalization and localization.

Please do not close this issue without fixing it, e.g. by positive deletion like so ([Per spec](https://www.w3.org/International/questions/qa-escapes#css_other) we can even use all these symbols and punctuation provided they are backslash-escaped. This too prevents malicious code from running.)

$class = preg_replace( '/(?<!\\\\)[%^{}~@`\'"&#$()+[\]|\/*<>=?;:!,.]/', '', $class );

Don’t you think that this achieves sanitize_html_class()’s goal in a better way?

Last edited 23 months ago by anrghg (previous) (diff)

#5 follow-up: @peterwilsoncc
23 months ago

Since page slugs are used as class names, all scripts should be equal: Latin, Greek, Cyrillic, all 160 (number growing) Non-Latin scripts already supported by Unicode.

I do agree that the function ought to be more permissive for valid characters, there's an existing ticket for that #33924 which I've commented on. There are some backward compatibility concerns that never got resolved.

It's the validation side of this ticket that I wish to avoid. In part because CSS is more permissive than it once was; in part because spec changes could lead to further tickets like this in the future.

we can even use all these symbols and punctuation provided they are backslash-escaped. This too prevents malicious code from running.)

Are you happy to continue this discussion on #33924 and close this ticket as a duplicate?

Raising the issue of non-latin alphabets is an excellent point. If you post it to the original ticket, it will ensure you get props for contributing to the discussion.

#6 in reply to: ↑ 5 @anrghg
23 months ago

  • Keywords changes-requested needs-testing needs-I18N-review removed
  • Resolution set to duplicate
  • Status changed from new to closed

Replying to peterwilsoncc:

all scripts should be equal

I do agree that the function ought to be more permissive for valid characters, there's an existing ticket for that #33924

Thank you! I wasn’t aware.

CSS is more permissive than it once was;

Well done.

Are you happy to continue this discussion on #33924 and close this ticket as a duplicate?

Indeed as a duplicate it needs to be closed. I’ll replicate the potentially useful stuff on #33924.

Raising the issue of non-latin alphabets is an excellent point. If you post it to the original ticket, it will ensure you get props for contributing to the discussion.

Thank you very much. I’m happy that the issue of Non-Latin is worked on, and I’ll try to contribute.

Last edited 22 months ago by anrghg (previous) (diff)

#7 @peterwilsoncc
23 months ago

  • Milestone Awaiting Review deleted
Note: See TracTickets for help on using tickets.