Opened 9 years ago
Last modified 2 years ago
#33924 new defect (bug)
sanitize_html_class valid characters
Reported by: | m-e-h | Owned by: | |
---|---|---|---|
Milestone: | Future Release | Priority: | normal |
Severity: | normal | Version: | 4.4 |
Component: | Formatting | Keywords: | has-patch 2nd-opinion has-unit-tests |
Focuses: | Cc: |
Description
sanitize_html_class
excludes some increasingly common valid html characters.
In particular the @
character.
The use of @
may not be extremely common for class names but it is being encouraged by some pretty renowned folks in the area of class naming conventions. http://csswizardry.com/2015/08/bemit-taking-the-bem-naming-convention-a-step-further/#responsive-suffixes
Actually pretty much anything is now valid for html classes except for spaces or tabs. I also use the /
quite a bit in my classes but I thought I'd start with the @
.
Attachments (4)
Change History (22)
#5
@
9 years ago
The more I consider my comment above, the more I think over-sanitization is unnecessary and possibly counter productive. Both HTML and CSS are happy with an emoji class name, preventing this will make the function less usable.
Removing the CSS special characters seems more than enough.
$pattern = '/[\\\\#%&\',-\/:;<=>@`~\^\$\.\!\[\]\|\{\}\(\)\?\*\+"\s]/';
$sanitized_html_class = preg_replace( $pattern, "", $unsanitized_html_class );
Per 33924.diff.
#6
follow-up:
↓ 7
@
9 years ago
'/[\\\\#%&\',-\/:;<=>@`~\^\$\.\!\[\]\|\{\}\(\)\?\*\+"\s]/'
FYI this is over-escaped, it's not causing an issue, but it makes it harder to read.
This is more readable and should match the same:
'/[\\\\#%&\',:;<=>@`~^$.!|{}()?*+"\s\[\]\/-]/'
Most characters don't need to be escaped within character classes, the exception being the character class brackets []
and the regular expression delimiter in use /
, and personally I always put -
at the end simply so it can't confuse the PCRE library (,-\/
is fine, but for example, :-;
isn't, unless escaped)
#7
in reply to:
↑ 6
@
9 years ago
Replying to dd32:
FYI this is over-escaped, it's not causing an issue, but it makes it harder to read.
This is more readable and should match the same:
'/[\\\\#%&\',:;<=>@`~^$.!|{}()?*+"\s\[\]\/-]/'
Thanks Dion, you also highlight that I blindly copied over the -
character from the source. It does not require escaping so 33924.diff does not work. I'll do some further research and upload another file for the accept all valid characters option.
This ticket was mentioned in Slack in #core by peterwilsoncc. View the logs.
9 years ago
#9
@
9 years ago
- Keywords 2nd-opinion added
Read https://wordpress.slack.com/archives/core/p1447803525000771
and https://wordpress.slack.com/archives/core/p1447803777000775
The new patch adds a 3rd parameter named "$strict = false", when set to "true", the new code from peter will be used, else, the older one (including the new @) is used.
This is made to have a better backcompat when scripts/plugins are using "#foo:Bar" to obtain "fooBar".
#10
@
9 years ago
- Keywords needs-refresh added; needs-testing 2nd-opinion removed
As 33924.2.patch includes the -
error I introduced in the first patch, labelling needs refresh.
I do like the $strict
parameter as a technique to allow for developers intending the function to as described in comment:9.
#11
@
9 years ago
- Keywords 2nd-opinion has-unit-tests added; needs-refresh removed
Refreshed in 33924.3.diff:
- adds unit tests for
sanitize_html_class
- add strict/permissive mode for the function (defaults to strict)
- in strict mode the output is unchanged
- in permissive mode the function removes white space and defers to
esc_attr
Permissive mode is compatible with the HTML5 spec. It allows for characters that need escaping in CSS.
#12
follow-up:
↓ 13
@
9 years ago
I think permissive mode should allow spaces. This enables multiple classes to be applied. (I have just run into this as a problem with a plugin.)
#13
in reply to:
↑ 12
@
9 years ago
Replying to jamesrandom:
I think permissive mode should allow spaces. This enables multiple classes to be applied.
That would turn the function into an esc_attr
alias with a fallback. As the function name uses the singular - html class - and the way it is applied in post_class
and body_class
, I'm -1 on allowing multiple classes.
#14
@
8 years ago
Stumbled upon this while working on #39493.
CSS classes support Unicode characters, so sanitize_html_class()
should support them as well in order for multibyte post slugs to be used as body classes.
#16
@
4 years ago
From the discussion on #44988:
@ayeshrajans pointed out the W3 spec does say that class names cannot start with a digit, single or double hyphens. https://www.w3.org/TR/CSS21/syndata.html#characters
@ChiefAlchemist also suggested an additional parameter for strict and permissive mode
@davidwebca mentioned with libraries such as tailwind using characters outside of the permitted set, this has potential to become a bigger issue.
#18
@
2 years ago
Thank you for supporting Non-Latin scripts so everybody gets the same opportunity of using the slug as a class.
As @peterwilsoncc wrote in #56504 — my apologies for opening a duplicate — today:
Raising the issue of non-latin alphabets is an excellent point.
I do agree that the function ought to be more permissive for valid characters
Since page slugs are used as class names, all scripts should be equal: Latin, Greek, Cyrillic, all 160 (number growing) Non-Latin scripts already supported by Unicode. (Plus non-ASCII Latin, since for slugs, Latin-script users can choose between simplified Latin (remove accents) and real Latin.)
Currently, sanitize_html_class()
provides security at the expense of usability, equity, internationalization and localization. By deleting all non-ASCII characters along with the non-alphanumeric ASCII (except hyphen, underscore), WordPress is throwing the baby with the bathwater.
That behavior surely breaks WordPress’ internationalization and inclusion policies.
Test example of added body classes based on a page slug /χαιρε-εν-αμπ/
(slashes for clarity, Greek transliteration intentional, quoted from https://anrghg.sunsite.fr/test-amp-compat/64-characters-%e2%96%b6-css-allows-all-non-ascii-%f0%9f%98%91/#1129-id-1164-%cf%87%ce%b1%ce%b9%cf%81%ce%b5-%ce%b5%ce%bd-%ce%b1%ce%bc%cf%80):
- CSS spec conformant or permissive (simplified markup):
<body class="id-1164 χαιρε-εν-αμπ">
- Legacy aka strict:
<body class="id-1164 --">
These examples are made up to demonstrate levels of usability. In real life, all three classes are added together, like in the [view-source:https://anrghg.sunsite.fr/test-amp-compat/%cf%87%ce%b1%ce%b9%cf%81%ce%b5-%ce%b5%ce%bd-%ce%b1%ce%bc%cf%80/ full source]:
<body class="page-template-default page page-id-1164 logged-in wp-embed-responsive id-1164 _-- χαιρε-εν-αμπ">
On a side note: The double-hyphen class is invalid CSS so it has a (configurable) underscore prepended as a more intuitive alternative to escaping the second hyphen in CSS: .-\2D
. The goal is maximum intuitivity and usability for users adding Custom CSS and to avoid screwing things up.
In Ukrainian, the equivalent CSS spec conformant or permissive class would be (courtesy Google Translate):
<body class="id-1164 ласкаво-просимо-до-amp">
The derived legacy aka strict class would not be very specific:
<body class="id-1164 _---amp">
Using the built-in (and screwed-up — it requires picking the correct prefix among postid-
and page-id-
) post ID selector is currently still the only option for Non-Latin users. Using the convenient slug selector is currently still a privilege of Latin script users.
Thank you to everyone for striving to lift that limitation.
If the
sanitize_html_class
regex is to be altered, I'd like to see it expanded beyond just adding an@
. My approach would be to accept anything that:document.querySelector
,Some background from Mathias Bynens:
I'll put together some reduced test cases and upload.