Make WordPress Core

Opened 7 weeks ago

Closed 7 weeks ago

Last modified 7 weeks ago

#64340 closed defect (bug) (fixed)

HTML API may double-escape class names when adding repeatedly

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 6.9.1 Priority: normal
Severity: normal Version: 6.9
Component: HTML API Keywords: has-patch has-unit-tests dev-reviewed
Focuses: Cc:

Description (last modified by jonsurrell)

WP_HTML_Tag_Processor and WP_HTML_Processor may incorrectly encode class names containing the characters &, <, >, ", or ' when modifying them via class methods like ::add_class() and calling ::get_updated_html().

For example:

<?php
$p = new WP_HTML_Tag_Processor('<div></div>');
$p->next_tag();
$p->add_class('&');
echo $p->get_updated_html() . "\n";
$p->add_class('OK');
echo $p->get_updated_html() . "\n";

Will print:

<div class="&amp;"></div>
<div class="&amp;amp; OK"></div>

Notice that the first pass is correct, & has been correctly encoded in the class attribute as &amp;. However, after calling ::add_class() and ::get_updated_html() again, the & has incorrectly been double-encoded as &amp;amp;.

The same code in WordPress 6.8 would print:

<div class="&amp;"></div>
<div class="&amp; OK"></div>

This is related to [60919] that was released in WordPress 6.9. The double-encoding behavior was present before, but it was "corrected" in this case by the use of esc_attr() that avoids any double-encoding. When esc_attr() usage was removed in [60919], the double-escaping behavior manifests causing this issue.


This was originally reported by GitHub user ktmn in Gutenberg issue 73713.

Change History (10)

This ticket was mentioned in PR #10591 on WordPress/wordpress-develop by @jonsurrell.


7 weeks ago
#1

  • Keywords has-patch has-unit-tests added

Trac ticket:

#2 @jonsurrell
7 weeks ago

  • Description modified (diff)

#4 @jonsurrell
7 weeks ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 61346:

HTML API: Ensure correct encoding of modified class names.

Some class names with HTML character references could be mishandled, for example:

  • Failure to remove an existing class like & with ::remove_class( '&' )
  • Double-encoding of an existing class like & after a modification, becoming &amp;

The second case manifested after double-encoding prevention was removed from ::set_attribute() in [60919].

Developed in https://github.com/WordPress/wordpress-develop/pull/10591.

Props jonsurrell, dmsnell.
Fixes #64340.

#5 @jonsurrell
7 weeks ago

  • Keywords dev-feedback added
  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening for backport of [61346] to the 6.9 branch for inclusion in 6.9.1.

#6 @dmsnell
7 weeks ago

  • Keywords dev-reviewed added; dev-feedback removed

Looks good for the backport.

#7 @Mamaduka
7 weeks ago

There's another report - https://github.com/WordPress/gutenberg/issues/73733. Agreed that this is a good candidate for the minor release.

#8 @jonsurrell
7 weeks ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 61350:

HTML API: Ensure correct encoding of modified class names.

Some class names with HTML character references could be mishandled, for example:

  • Failure to remove an existing class like & with ::remove_class( '&' )
  • Double-encoding of an existing class like & after a modification, becoming &amp;

The second case manifested after double-encoding prevention was removed from ::set_attribute() in [60919].

Developed in https://github.com/WordPress/wordpress-develop/pull/10591.

Reviewed by dmsnell, mamaduka.
Merges [61346] to the 6.9 branch.

Props jonsurrell, dmsnell.
Fixes #64340.

ktmn commented on PR #10591:


7 weeks ago
#9

The fix works well for me.

Couple things I tested (working as intended I think):

$html = '<div class="& &amp;"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
var_dump($p->has_class('&')); // true
var_dump($p->has_class('&amp;')); // false
var_dump($p->has_class('&amp;amp;')); // false
$p->add_class('test');
$html = $p->get_updated_html();
echo $html; // <div class="&amp; test"></div>
$html = '<div class="&"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->add_class('&');
$html = $p->get_updated_html();
echo $html; // <div class="&"></div>
$html = '<div class="&"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->remove_class('&');
$html = $p->get_updated_html();
echo $html; // <div ></div>
$html = '<div class="&"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->remove_class('&amp;');
$html = $p->get_updated_html();
echo $html; // <div class="&"></div>
$html = '<div class="&amp;"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->remove_class('&');
$html = $p->get_updated_html();
echo $html; // <div ></div>
$html = '<div class="&amp;"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->remove_class('&amp;');
$html = $p->get_updated_html();
echo $html; // <div class="&amp;"></div>
$html = '<div class="&"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->add_class('&amp;');
$html = $p->get_updated_html();
echo $html; // <div class="&amp; &amp;amp;"></div>
$html = '<div class="&"></div>';
$p = new WP_HTML_Tag_Processor($html);
$p->next_tag();
$p->add_class('&amp;amp;');
$html = $p->get_updated_html();
echo $html; // <div class="&amp; &amp;amp;amp;"></div>

Any & is always converted to &amp;, for CSS selectors both work OK:

echo <<<HTML
<style>.\[\&\]\:text-red { color: red; }</style>
<div class="[&]:text-red">Text is red</div>
<div class="[&amp;]:text-red">Text is still red</div>
HTML;

(not encoding the chars when not needed would look prettier (and shorter) though, if it's possible to do it safely)

While the class names here get encoded, CSS stays working:

echo '<style>.\[\&_\.class\\_\\_name\]\:font-bold .class__name { font-weight: bold; }</style>';

$html = <<<HTML
<div class="[&_.class\_\_name]:font-bold"><p class="class__name">Text</p></div>
<div class="[&amp;_.class\_\_name]:font-bold"><p class="class__name">Text</p></div>
HTML;

$p = new WP_HTML_Tag_Processor($html);
while($p->next_tag()) {
        $p->add_class('test');
}
$html = $p->get_updated_html();

echo $html;
// <div class="[&amp;_.class\_\_name]:font-bold test"><p class="class__name test">Text</p></div>
// <div class="[&amp;_.class\_\_name]:font-bold test"><p class="class__name test">Text</p></div>
echo <<<HTML
<style>
        .\[\&\:\:before\]\:content-\[\'\\2193\'\]:before,
        .\[\&\:\:before\]\:content-\[\"\\2193\"\]:before,
        .\[\&\:\:before\]\:content-\[\'↓\'\]:before,
        .\[\&\:\:before\]\:content-\[\"↓\"\]:before {
                content: "↓";
        }
</style>
HTML;

$html = <<<HTML
<p class="[&::before]:content-['↓']">Test</p>
<p class="[&::before]:content-[&quot;↓&quot;]">Test</p>
<p class="[&amp;::before]:content-['↓']">Test</p>
<p class="[&amp;::before]:content-[&quot;↓&quot;]">Test</p>
<p class='[&::before]:content-["↓"]'>Test</p>
<p class="[&::before]:content-['\\2193']">Test</p>
<p class="[&::before]:content-[&quot;\\2193&quot;]">Test</p>
<p class="[&amp;::before]:content-['\\2193']">Test</p>
<p class="[&amp;::before]:content-[&quot;\\2193&quot;]">Test</p>
HTML;

$p = new WP_HTML_Tag_Processor($html);
while($p->next_tag('p')) {
        $p->add_class('test');
}
$html = $p->get_updated_html();
echo $html;

// <p class="[&amp;::before]:content-[&apos;↓&apos;] test">Test</p>
// <p class="[&amp;::before]:content-[&quot;↓&quot;] test">Test</p>
// <p class="[&amp;::before]:content-[&apos;↓&apos;] test">Test</p>
// <p class="[&amp;::before]:content-[&quot;↓&quot;] test">Test</p>
// <p class="[&amp;::before]:content-[&quot;↓&quot;] test">Test</p>
// <p class="[&amp;::before]:content-[&apos;\2193&apos;] test">Test</p>
// <p class="[&amp;::before]:content-[&quot;\2193&quot;] test">Test</p>
// <p class="[&amp;::before]:content-[&apos;\2193&apos;] test">Test</p>
// <p class="[&amp;::before]:content-[&quot;\2193&quot;] test">Test</p>

Encoding stayed stable after running processor output through the processor again.

@jonsurrell commented on PR #10591:


7 weeks ago
#10

Awesome testing, thanks for following up @ktmn. I agree, the HTML processing appears to be correct in the examples you shared.

---

Any & is always converted to &amp;, for CSS selectors both work OK

This is correct whenever the class attribute is modified. There are a few cases where the class attribute did not require modification and the tag processor leaves the plain & class attribute value as-is. We can see it in the examples that add a class that is already present or remove a class that is not present.

---

(not encoding the chars when not needed would look prettier (and shorter) though, if it's possible to do it safely)

This is technically possible and allowed, but generally not worth the extra complexity. &amp; _will decode as_ &, but & may be interpreted in different ways depending on what follows, so &amp; is just less error-prone. It's possible to encode fewer things, but encoding all of &<>"' aligns with what WordPress generally does and is safe.

---

The last two (more complex) cases don't render exactly the same for me. I think it's more related to how the style is being printed and may be related to quoting, escaping, PHP versions, and possibly even GitHub's code fence escaping. It's hard to say.

Here are the working style elements for me the do have the correct behavior with the example output from the tag processor:

<style>.\[\&_\.class\\_\\_name\]\:font-bold .class__name { font-weight: bold; }</style>

<style>
        .\[\&\:\:before\]\:content-\[\'\\2193\'\]:before,
        .\[\&\:\:before\]\:content-\[\"\\2193\"\]:before,
        .\[\&\:\:before\]\:content-\[\'↓\'\]:before,
        .\[\&\:\:before\]\:content-\[\"↓\"\]:before {
                content: "↓";
        }
</style>

Here's the example in the Live DOM Viewer. If you don't know it, it's a very handy tool for inspecting HTML processing.

Note: See TracTickets for help on using tickets.