Make WordPress Core

Opened 9 years ago

Last modified 2 years ago

#33924 new defect (bug)

sanitize_html_class valid characters

Reported by: m-e-h's profile m-e-h Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.4
Component: Formatting Keywords: has-patch 2nd-opinion has-unit-tests
Focuses: Cc:

Description

sanitize_html_class excludes some increasingly common valid html characters.
In particular the @ character.
The use of @ may not be extremely common for class names but it is being encouraged by some pretty renowned folks in the area of class naming conventions. http://csswizardry.com/2015/08/bemit-taking-the-bem-naming-convention-a-step-further/#responsive-suffixes

Actually pretty much anything is now valid for html classes except for spaces or tabs. I also use the / quite a bit in my classes but I thought I'd start with the @ .

Attachments (4)

33924.patch (480 bytes) - added by juliobox 9 years ago.
33924.diff (719 bytes) - added by peterwilsoncc 9 years ago.
33924.2.patch (1.6 KB) - added by juliobox 9 years ago.
33924.3.diff (2.6 KB) - added by peterwilsoncc 9 years ago.

Download all attachments as: .zip

Change History (22)

#1 @SergeyBiryukov
9 years ago

  • Keywords needs-patch added

#2 @wonderboymusic
9 years ago

  • Milestone changed from Awaiting Review to Future Release

@juliobox
9 years ago

#3 @juliobox
9 years ago

  • Keywords has-patch needs-testing added; needs-patch removed

#4 @peterwilsoncc
9 years ago

If the sanitize_html_class regex is to be altered, I'd like to see it expanded beyond just adding an @. My approach would be to accept anything that:

  • is a one-byte character
  • does not require escaping in CSS or document.querySelector,
  • has wide browser support

Some background from Mathias Bynens:

The following characters have a special meaning in CSS: !, ", #, $, %, &, ', (, ), *, +, ,, -, ., /, :, ;, <, =, >, ?, @, [, \, ], ^, `, {, |, }, and ~.

Any characters matching [\t\n\v\f\r] need to be escaped based on their Unicode code points. The space character ( ) can simply be backslashed (\ ). Other whitespace characters don’t need to be escaped.

Other than that, characters that can’t possibly convey any meaning in CSS (e.g. ) can and should just be used unescaped.

I'll put together some reduced test cases and upload.

@peterwilsoncc
9 years ago

#5 @peterwilsoncc
9 years ago

The more I consider my comment above, the more I think over-sanitization is unnecessary and possibly counter productive. Both HTML and CSS are happy with an emoji class name, preventing this will make the function less usable.

Removing the CSS special characters seems more than enough.

$pattern = '/[\\\\#%&\',-\/:;<=>@`~\^\$\.\!\[\]\|\{\}\(\)\?\*\+"\s]/';
$sanitized_html_class = preg_replace( $pattern, "", $unsanitized_html_class );

Per 33924.diff.

#6 follow-up: @dd32
9 years ago

'/[\\\\#%&\',-\/:;<=>@`~\^\$\.\!\[\]\|\{\}\(\)\?\*\+"\s]/'

FYI this is over-escaped, it's not causing an issue, but it makes it harder to read.
This is more readable and should match the same:
'/[\\\\#%&\',:;<=>@`~^$.!|{}()?*+"\s\[\]\/-]/'

Most characters don't need to be escaped within character classes, the exception being the character class brackets [] and the regular expression delimiter in use /, and personally I always put - at the end simply so it can't confuse the PCRE library (,-\/ is fine, but for example, :-; isn't, unless escaped)

#7 in reply to: ↑ 6 @peterwilsoncc
9 years ago

Replying to dd32:

FYI this is over-escaped, it's not causing an issue, but it makes it harder to read.
This is more readable and should match the same:
'/[\\\\#%&\',:;<=>@`~^$.!|{}()?*+"\s\[\]\/-]/'

Thanks Dion, you also highlight that I blindly copied over the - character from the source. It does not require escaping so 33924.diff does not work. I'll do some further research and upload another file for the accept all valid characters option.

This ticket was mentioned in Slack in #core by peterwilsoncc. View the logs.


9 years ago

#9 @juliobox
9 years ago

  • Keywords 2nd-opinion added

Read https://wordpress.slack.com/archives/core/p1447803525000771
and https://wordpress.slack.com/archives/core/p1447803777000775

The new patch adds a 3rd parameter named "$strict = false", when set to "true", the new code from peter will be used, else, the older one (including the new @) is used.

This is made to have a better backcompat when scripts/plugins are using "#foo:Bar" to obtain "fooBar".

@juliobox
9 years ago

#10 @peterwilsoncc
9 years ago

  • Keywords needs-refresh added; needs-testing 2nd-opinion removed

As 33924.2.patch includes the - error I introduced in the first patch, labelling needs refresh.

I do like the $strict parameter as a technique to allow for developers intending the function to as described in comment:9.

#11 @peterwilsoncc
9 years ago

  • Keywords 2nd-opinion has-unit-tests added; needs-refresh removed

Refreshed in 33924.3.diff:

  • adds unit tests for sanitize_html_class
  • add strict/permissive mode for the function (defaults to strict)
  • in strict mode the output is unchanged
  • in permissive mode the function removes white space and defers to esc_attr

Permissive mode is compatible with the HTML5 spec. It allows for characters that need escaping in CSS.

#12 follow-up: @jamesrandom
9 years ago

I think permissive mode should allow spaces. This enables multiple classes to be applied. (I have just run into this as a problem with a plugin.)

#13 in reply to: ↑ 12 @peterwilsoncc
9 years ago

Replying to jamesrandom:

I think permissive mode should allow spaces. This enables multiple classes to be applied.

That would turn the function into an esc_attr alias with a fallback. As the function name uses the singular - html class - and the way it is applied in post_class and body_class, I'm -1 on allowing multiple classes.

#14 @SergeyBiryukov
8 years ago

Stumbled upon this while working on #39493.

CSS classes support Unicode characters, so sanitize_html_class() should support them as well in order for multibyte post slugs to be used as body classes.

#15 @peterwilsoncc
4 years ago

#44988 was marked as a duplicate.

#16 @peterwilsoncc
4 years ago

From the discussion on #44988:

@ayeshrajans pointed out the W3 spec does say that class names cannot start with a digit, single or double hyphens. https://www.w3.org/TR/CSS21/syndata.html#characters

@ChiefAlchemist also suggested an additional parameter for strict and permissive mode

@davidwebca mentioned with libraries such as tailwind using characters outside of the permitted set, this has potential to become a bigger issue.

#17 @anrghg
2 years ago

#56504 was marked as a duplicate.

#18 @anrghg
2 years ago

Thank you for supporting Non-Latin scripts so everybody gets the same opportunity of using the slug as a class.

As @peterwilsoncc wrote in #56504 — my apologies for opening a duplicate — today:

Raising the issue of non-latin alphabets is an excellent point.
I do agree that the function ought to be more permissive for valid characters

Since page slugs are used as class names, all scripts should be equal: Latin, Greek, Cyrillic, all 160 (number growing) Non-Latin scripts already supported by Unicode. (Plus non-ASCII Latin, since for slugs, Latin-script users can choose between simplified Latin (remove accents) and real Latin.)

Currently, sanitize_html_class() provides security at the expense of usability, equity, internationalization and localization. By deleting all non-ASCII characters along with the non-alphanumeric ASCII (except hyphen, underscore), WordPress is throwing the baby with the bathwater.

That behavior surely breaks WordPress’ internationalization and inclusion policies.

Test example of added body classes based on a page slug /χαιρε-εν-αμπ/ (slashes for clarity, Greek transliteration intentional, quoted from https://anrghg.sunsite.fr/test-amp-compat/64-characters-%e2%96%b6-css-allows-all-non-ascii-%f0%9f%98%91/#1129-id-1164-%cf%87%ce%b1%ce%b9%cf%81%ce%b5-%ce%b5%ce%bd-%ce%b1%ce%bc%cf%80):

  • CSS spec conformant or permissive (simplified markup):
<body class="id-1164 χαιρε-εν-αμπ">
  • Legacy aka strict:
<body class="id-1164 --">

These examples are made up to demonstrate levels of usability. In real life, all three classes are added together, like in the [view-source:https://anrghg.sunsite.fr/test-amp-compat/%cf%87%ce%b1%ce%b9%cf%81%ce%b5-%ce%b5%ce%bd-%ce%b1%ce%bc%cf%80/ full source]:

<body class="page-template-default page page-id-1164 logged-in wp-embed-responsive id-1164 _-- χαιρε-εν-αμπ">

On a side note: The double-hyphen class is invalid CSS so it has a (configurable) underscore prepended as a more intuitive alternative to escaping the second hyphen in CSS: .-\2D. The goal is maximum intuitivity and usability for users adding Custom CSS and to avoid screwing things up.

In Ukrainian, the equivalent CSS spec conformant or permissive class would be (courtesy Google Translate):

<body class="id-1164 ласкаво-просимо-до-amp">

The derived legacy aka strict class would not be very specific:

<body class="id-1164 _---amp">

Using the built-in (and screwed-up — it requires picking the correct prefix among postid- and page-id-) post ID selector is currently still the only option for Non-Latin users. Using the convenient slug selector is currently still a privilege of Latin script users.

Thank you to everyone for striving to lift that limitation.

Last edited 2 years ago by anrghg (previous) (diff)
Note: See TracTickets for help on using tickets.