Make WordPress Core

Opened 8 months ago

Closed 5 months ago

Last modified 4 months ago

#61531 closed enhancement (fixed)

HTML API: Audit class name methods for consistency and correctness

Reported by: jonsurrell's profile jonsurrell Owned by:
Milestone: 6.7 Priority: normal
Severity: normal Version: 6.5
Component: HTML API Keywords: has-patch has-unit-tests
Focuses: Cc:

Description (last modified by jonsurrell)

The Tag Processor CSS class name methods class_list, add_class, remove_class, and has_class. These should behave consistently regarding case-sensistivity.

These methods are intended to align with CSS class selectors. CSS class selector matching behavior is complicated and may depend on the document. This makes it difficult to determine the correct behavior for these methods.

At the moment, class_list yields ASCII lower-cased class names, but remove_class and add_class match case sensitive class names.

Attachments (2)

quirks-mode.png (232.4 KB) - added by dmsnell 8 months ago.
Illustration of CSS selector behavior in quirks mode.
no-quirks-mode.png (222.7 KB) - added by dmsnell 8 months ago.
Illustration of CSS selector behavior in no-quirks mode.

Download all attachments as: .zip

Change History (26)

This ticket was mentioned in PR #6931 on WordPress/wordpress-develop by @jonsurrell.


8 months ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket: https://core.trac.wordpress.org/ticket/61531

The Tag Processor CSS class name methods class_list, add_class, and remove_class should be consistent regarding case sensitivity.

These methods are intended to align with CSS class names, meaning that matching should be done ASCII case-insensitive. class_list already yields lower case unique class names, but remove_class and add_class do not have similar behavior of treating case-insensitive matching classes as equal.

  • add_class should only add classes that are not already present (compared ASCII case-insensitive).
  • remove_class should remove all matching classes (compared ASCII case-insensitive).

This was discussed with @dmsnell on Slack here: https://wordpress.slack.com/archives/C05NFB818PQ/p1719403633636769

Related to #61520 which documents the lower-casing behavior of class_list.

#2 @jonsurrell
8 months ago

  • Description modified (diff)
  • Milestone changed from 6.7 to Future Release
  • Owner jonsurrell deleted
  • Summary changed from HTML API: Tag processor class name methods should behave consistently with case sensitivity to HTML API: Audit class name methods for consistency and correctness
  • Type changed from defect (bug) to enhancement

After further review, it's difficult to determine exactly the correct behavior of these class methods should be. I've updated the summary to talk about auditing the behavior and I've changed the milestone to "future release."

@jonsurrell commented on PR #6931:


8 months ago
#3

This doesn't seem like the right approach, it may not align with browser behavior.

@dmsnell
8 months ago

Illustration of CSS selector behavior in quirks mode.

@dmsnell
8 months ago

Illustration of CSS selector behavior in no-quirks mode.

#4 @dmsnell
8 months ago

After a lengthy and extremely valuable chat with @jonsurrell and some independent testing I think we can say comfortably that the behavior in the HTML API is wrong, but that making it right is a bit more complicated than a simple bug fix would be.

In the HTML API we have to make ourselves aware of the quirks mode of the document being parsed. For normative posts it's probably safe to assume no quirks mode, as most WordPress themes output an HTML doctype that enters no-quirks mode. But, we may encounter quirks-mode documents, and the HTML Processor will likely need to communicate eventually which one a given document is.

When matching against a document which is in quirks mode, class names must be matched ASCII case-insensitively; class selectors are otherwise case-sensitive, only matching class names they are identical to.

CSS Selectors Level 4

I created the following document to illustrate the relevant semantics: in no-quirks mode, CSS class selectors match byte-for-byte with what's in the decoded class attribute; in quirks mode, they match in an ASCII-case-insensitive manner.

Remove the <!DOCTYPE html> line to test in quirks mode.

<!DOCTYPE html>
<style>
.one { border-left: 3px solid red; }
.ONE { border-top: 3px solid red; }
.right { border-right: 3px solid blue; }
.green { color: green; }
.italic { font-style: italic; }
.underline { text-decoration: underline; }
</style>

<div>
	<p class="one">One</p>
	<p class="ONE">ONE</p>
	<p class="&#x6f;ne">&amp;#x64;ne</p>
	<p class="&#79;NE">&amp;#79;NE</p>
	<p class="oNe one ONE oNe &#x6f;ne oNe &#79;NE">All of them</p>
</div>

<script>
const by = s => document.querySelectorAll( s );

by( '.one' ).forEach( node => node.classList.add('grEen') );
by( '.ONE' ).forEach( node => node.classList.add('right') );
by( '[class~="one"]' ).forEach( node => node.classList.add( 'italic' ) );
by( '[class~="ONE"]' ).forEach( node => node.classList.add( 'underline' ) );
</script>
No-Quirks Mode Quirks Mode
Illustration of CSS selector behavior in no-quirks mode. Illustration of CSS selector behavior in quirks mode.

#5 @jonsurrell
7 months ago

#61655 was marked as a duplicate.

This ticket was mentioned in PR #6933 on WordPress/wordpress-develop by @jonsurrell.


7 months ago
#6

The HTML API remove_class regarding white space opts to maintain "inter-class" whitespace.

https://github.com/WordPress/wordpress-develop/blob/5effffccd005ca3dae6631ee73b1096b8f54a861/src/wp-includes/html-api/class-wp-html-tag-processor.php#L2223-L2230

This seems reasonable and justified, but normalization seems inconsistent. Leading space is maintained, but trailing space is not.

It seems more consistent to trim leading and trailing space, and actually maintain _only_ inter-class whitespace.

Trac ticket: https://core.trac.wordpress.org/ticket/61531 ~https://core.trac.wordpress.org/ticket/61655~

#7 @dmsnell
7 months ago

In 58740:

HTML API: Remove leading whitespace after removing class names.

In part of a larger review of CSS semantics and behaviors, this patch
takes the opportunity to remove leading whitespace in an updated class
attribute after the first class in the attribute has been removed.

Previously, if the first class name had been removed, the whitespace
that formerly followed it would remain in the class attribute. This
stood in contrast to removing other class names, which removed their
associated whitespace.

There should be no semantic or functional changes in this patch, only
a slightly-large diff for modified HTML documents that looks prettier
when removing the first class name in a class attribute.

Developed in https://github.com/WordPress/wordpress-develop/pull/6933
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.

This ticket was mentioned in PR #7169 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#9

Update that HTML Processor and Tag Processor to handle CSS classes in a case-sensitive way by default.
This aligns with "no-quirks" or "standards" mode behavior for class name handling.

Remove forced lowercasing in ::class_list.

When the HTML Processor is in no-quirks mode, add_class, remove_class and has_class operate in a _case-sensitive_ way:

$processor = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><div class="FOO">' );
$processor->next_tag( 'DIV' );
var_dump( $processor->has_class('foo') );
// bool(false)
var_dump( $processor->has_class('FOO') );
// bool(true)

When the HTML Processor is in quirks mode (only available via the fragment parser at the moment), add_class, remove_class and has_class operate in a _case-sensitive_ way:

$processor = WP_HTML_Processor::create_fragment('<div class="FOO">', '<body>', 'UTF-8', 'quirks-mode');
$processor->next_tag( 'DIV' );
var_dump( $processor->has_class('foo') );
// bool(true)
var_dump( $processor->has_class('FOO') );
// bool(true)

add_class and remove_class get similar treatment. Case-insensitive classes matching remove_class( $class_name ) will be removed. Case insensitive duplicate classes will not be added when calling add_class( $class_name ).

class_list _does_ produce case-insensitive duplicates in all cases. This deduplication would be easy to perform in quirks mode, however it's unclear what form of casing should be yielded.

---

The Tag Processor and HTML Processor classes have several methods for dealing with CSS class names: class_list, add_class, remove_class and has_class.

These methods are intended to provide a CSS class selector-like interface with the class attribute.

Class name matching (CSS class selectors `.className {}` or `getElementsByClassName( "className" )`) is case sensitive in no-quirks mode and case-insensitive in quirks-mode.

The list of elements with class names classNames for a node root is the HTMLCollection returned by the following algorithm:

Let classes be the result of running the ordered set parser on classNames.
If classes is the empty set, return an empty HTMLCollection.
Return an HTMLCollection rooted at root, whose filter matches descendant elements that have all their classes in classes.

The comparisons for the classes must be done in an ASCII case-insensitive manner if root’s node document’s mode is "quirks"; otherwise in an identical to manner.

When matching against a document which is in quirks mode, class names must be matched ASCII case-insensitively; class selectors are otherwise case-sensitive, only matching class names they are identical to.

Trac ticket: https://core.trac.wordpress.org/ticket/61531

This ticket was mentioned in PR #7187 on WordPress/wordpress-develop by @jonsurrell.


6 months ago
#10

Replace null bytes with the � replacement character in ::class_list.

The browser will replace null bytes with � in attribute values:-state)

U+0000 NULL
This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's value.

If we consider that class_list is a processed version of the class attribute, this null-byte replacement is appropriate. At this time, get_attribute is closer to "raw" access to the attribute value as authored.

Related to https://github.com/WordPress/wordpress-develop/pull/7184.

Trac ticket: https://core.trac.wordpress.org/ticket/61531

@jonsurrell commented on PR #7169:


6 months ago
#12

I plan to revisit this when https://github.com/WordPress/wordpress-develop/pull/7195 is complete. That will allow us to test with quirks mode using the full parser and not need to change the fragment factory method signature.

@jonsurrell commented on PR #7169:


6 months ago
#13

I've moved quirks mode into the Tag Processor, it impacts CSS class name handling and it makes sense to have that there.

Now that #7195 / [58925] has landed, changes in quirks mode handling are tested by using the full HTML parser in quirks mode and no-quirks mode.

Quirks mode is not exposed publicly at this time (no change from current trunk). In a previous commit, an argument was added to ::create_fragment to adjust the compat mode. This is no longer necessary and has been reverted.

@jonsurrell commented on PR #7169:


6 months ago
#14

I think the concerns have been addressed and this is ready for another review.

#15 @dmsnell
6 months ago

In 58969:

HTML API: Replace null-bytes in class_list class names

As part of an audit of HTML API CSS behaviors, this patch resolves an issue with how the HTML API reports class names containing the NULL byte. NULL bytes should be replaced by the Unicode replacement character, U+FFFD, but previously weren't. This patch performs that replacement.

Developed in https://github.com/WordPress/wordpress-develop/pull/7187
Discussed in https://core.trac.wordpress.org/ticket/61531

Follow-up to [56703].

Props dmsnell, jonsurrell.
See #61531.

@jonsurrell commented on PR #7187:


6 months ago
#17

I'm pretty confident that with @coversDefaultClass correctly set, then @covers ::methodName works fine. There are existing examples of that in the codebase.

@dmsnell commented on PR #7187:


6 months ago
#18

Absolutely @sirreal - but I'm speaking more about WPCS conventions than working code. I know there are examples in the codebase already demonstrating :: without the class and that it works fine. I just want to make sure we don't introduce code that is going to upset anyone.

@dmsnell commented on PR #7169:


6 months ago
#19

@sirreal I've updated some comments and changed the behavior of add_class() and remove_class() so that they preserve the casing of the first-provided class name for all lexical variations. I'll entertain ideas on this, but it's how we handle attributes, and I think it's the most respectful we can be to developers who are setting values when casing doesn't matter.

it's really quite a surprise, and I hope quirks mode is almost never used, given the conflict between CSS selectors matching the class attribute value, and those given as class name selectors.

@dmsnell commented on PR #7169:


6 months ago
#20

@sirreal I've also moved the Trac ticket reference to the top of the description to make it easier to find, and used the shorthand notation instead of the full link.

#21 @dmsnell
6 months ago

In 58985:

HTML API: Respect document compat mode when handling CSS class names.

The HTML API has been behaving as if CSS class name selectors matched class names in an ASCII case-insensitive manner. This is only true if the document in question is set to quirks mode. Unfortunately most documents processed will be set to no-quirks mode, meaning that some CSS behaviors have been matching incorrectly when provided with case variants of class names.

In this patch, the CSS methods have been audited and updated to adhere to the rules governing ASCII case sensitivity when matching classes. This includes add_class(), remove_class(), has_class(), and class_list(). Now, it is assumed that a document is in no-quirks mode unless a full HTML parser infers quirks mode, and these methods will treat class names in a byte-for-byte manner. Otherwise, when a document is in quirks mode, the methods will compare the provided class names against existing class names for the tag in an ASCII case insensitive way, while class_list() will return a lower-cased version of the existing class names.

The lower-casing in class_list() is performed for consistency, since it's possible that multiple case variants of the same comparable class name exists on a tag in the input HTML.

Developed in https://github.com/WordPress/wordpress-develop/pull/7169
Discussed in https://core.trac.wordpress.org/ticket/61531

Props dmsnell, jonsurrell.
See #61531.

#23 @jonsurrell
5 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

This as fixed by [58985].

#24 @desrosj
4 months ago

  • Milestone changed from Future Release to 6.7
Note: See TracTickets for help on using tickets.