Make WordPress Core

Opened 7 months ago

Closed 7 months ago

Last modified 7 months ago

#61170 closed enhancement (fixed)

Interactivity API: Improve JSON store serialization

Reported by: jonsurrell's profile jonsurrell Owned by: jonsurrell's profile jonsurrell
Milestone: 6.6 Priority: normal
Severity: normal Version: 6.5
Component: Interactivity API Keywords: has-patch
Focuses: javascript, performance Cc:

Description

The Interactivity API serializes JSON data into a <script type="application/json"> tag.

@bjorsch noticed some potential improvements to the serialization in ​this PR review.

Improving the JSON serialization may make the result more readable and decrease the serialized size.

Change History (30)

This ticket was mentioned in ​PR #6520 on ​WordPress/wordpress-develop by ​@jonsurrell.


7 months ago
#1

  • Keywords has-patch added

​@jonsurrell commented on ​PR #6520:


7 months ago
#2

@westonruter brought up a point about JSON_UNESCAPED_SLASHES (​https://github.com/WordPress/wordpress-develop/pull/6433#discussion_r1594762631). I'll reply here.

Thanks for flagging that! This is interesting feedback, do you have more information on what an attack might look like?

I decided to investigate more deeply, so I reviewed ​reviewing the standard and as far as I can tell to break out of the script you need a literal sequence like </script in order to break out of a script tag. Either JSON unicode escaping the < or escaping the / (as \/) seems sufficient based on my interpretation, but scrutiny would be very welcome.

To close a script tag, we must find:

Switch to the ​script data less-than sign state.

  • U+002F SOLIDUS (/)

Set the ​temporary buffer to the empty string. Switch to the ​script data end tag open state.

Create a new end tag token, set its tag name to the empty string. ​Reconsume in the ​script data end tag name state (anything else adds </ to the script data and continues)

My interpretation: ASCII characters are lowercased and added to the "end tag" name. When the end tag name ends with a non-ascii character the following step checks if it matches the opening tag (script).

  • 13.2.5.17 Script data end tag name state
    • ASCII upper and lower alpha are added to the tag name (lowercased)
    • Space, linefeed, formfeed, tab, / and > do what they do in normal tags _if_ the tag name was matched (script in any upper/lowercase combination)
    • Anything else (including the spacing, /, > characters if the tag name wasn't matched) treats the characters as script data and continues to parse in script data state.

Based on the above, my interpretation is that we need to make sure that </script{{ non-ascii alpha }} cannot be produced. Minimally, ensuring </ cannot be produced seems sufficient.

Here are the relevant JSON encoding flags:

If my analysis so far is correct, either _omitting_ JSON_UNESCAPED_SLASHES (so / is escaped as \/) *OR* includeing JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

That is to say, it seems that neither flag (0) _OR_ both flags JSON_HEX_TAG | JSON_UNESCAPED_SLASHES seems sufficient to ensure the dangerous </script is not produced.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter ​Script data less-than sign state.

---

JSON_HEX_AMP doesn't seem to have any value, character references like &amp; are not decoded in a script tag. We can remove it. &lt;/script&gt; is not dangerous in a script tag.

JSON_UNESCAPED_UNICODE seems good to include if we're confident the page encoding is appropriate, so we can include it if blog_charset is "UTF-8".

JSON_UNESCAPED_LINE_TERMINATORS also seems good to include. \u2028 (LINE SEPARATOR) and \u2029 (PARAGRAPH SEPARATOR) do not need to be escaped. ​It's tricky to find documented, but it seems that these are the only characters that are handled.

#3 @jonsurrell
7 months ago

@whyisjake As the security component maintainer, would you (or whoever you deem appropriate) please review my analysis and the proposed change from a security perspective?

​@bjorsch commented on ​PR #6520:


7 months ago
#4

If my analysis so far is correct, either _omitting_ JSON_UNESCAPED_SLASHES (so / is escaped as \/) _OR_ includeing JSON_HEX_TAG (so < is escaped as \u003c) is sufficient. These conditions should print the dangerous</script as <\/script or \u003c/script respectively, both of which should be harmless.

Exactly this.

My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES. I suspect / is more common in encoded data like URLs than that < and > characters, so / can remain unencoded and < and > are encoded.

My intuition as well.

One thing that concerns me is a sequence of states in a script tag that begins <!--, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script and close the script tag at that point. Nonetheless, escaping < alone seems sufficient to prevent entering this state because without < we can never enter ​Script data less-than sign state.

I've not looked too much into this either, but I've also had that in the back of my head when considering JSON_HEX_TAG | JSON_UNESCAPED_SLASHES over neither.

JSON_HEX_AMP doesn't seem to have any value, character references like &amp; are not decoded in a script tag. We can remove it. &lt;/script&gt; is not dangerous in a script tag.

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

​@jonsurrell commented on ​PR #6520:


7 months ago
#5

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm been unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

WordPress still officially supports HTML4 and XHTML, but the browsers it serves and the broader web effectively don't.

​@westonruter commented on ​PR #6520:


7 months ago
#6

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:

@sirreal Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: ​https://xhtml-test-page.glitch.me/

​@bjorsch commented on ​PR #6520:


7 months ago
#7

Did you try sending Content-Type: application/xhtml+xml? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: ​https://xhtml-test-page.glitch.me/

Based on that, here's a PHP file that illustrates a script injection that would be prevented by JSON_HEX_AMP:

<?php

header('Content-Type: application/xhtml+xml; charset=UTF-8');

$data = "&quot;;alert('eek');//";

?><?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US">
  <head>
   <script>
    var data = <?php echo json_encode( $data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES ); ?>;
   </script>
  </head>
  <body>
  </body>
</html>

Attacking the <script type="application/json"> used here would be harder, of course, as you can't directly inject code there. But you might be able to override the data structure in some manner that exposes a problem in whatever uses the data. Or at least you could break the page.

​@westonruter commented on ​PR #6520:


7 months ago
#8

as you can't directly inject code there

But you could still if your malicious code looked like </script>[[Image(bad)]] or even have it open another non-JSON script tag.

​@bjorsch commented on ​PR #6520:


7 months ago
#9

If you can figure out a way to get that </script> past JSON_HEX_TAG there, please give an example.

​@jonsurrell commented on ​PR #6520:


7 months ago
#10

Fascinating! This was helpful. I'll add JSON_HEX_AMP for XHTML, detailed reasoning below.

---

In XHTML the entities will be decoded in script tags, although this seems to be a source of data corruption more than anything else. I don't think they can be abused to break out of the script tag - that's their purpose: &lt; is not the start of a tag like < is πŸ™‚

However, it can be dangerous _not_ to escape & because:

  • it's easy to break the xhtml
  • html entities will be transformed

Here's the demo script I was playing with:

<?php
$xhtml = isset($_GET['xhtml']);
header( 'Content-Type: '
        . ( $xhtml ? 'application/xhtml+xml' : 'text/html' )
        . '; charset=UTF-8'
);

// PLAY WITH THE VALUE HERE
$s = "&lt;";

$flags = 0
        // always these
        | JSON_HEX_TAG
        | JSON_UNESCAPED_SLASHES

        // these with UTF-8
        | JSON_UNESCAPED_UNICODE
        | JSON_UNESCAPED_LINE_TERMINATORS
        | 0 ;

if ( isset( $_GET['amp'] ) ) {
        $flags |= JSON_HEX_AMP;
}

if ( $xhtml ) {
        echo '<!DOCTYPE html>'
        . PHP_EOL
        . '<html xmlns="http://www.w3.org/1999/xhtml">';
} else {
        echo '<!DOCTYPE html>'
        . PHP_EOL
        . '<html>';
}
?>
  <body>
        <script id='1' type="application/json"><?php echo json_encode($s,$flags); ?></script>
        <script>
                const j = JSON.parse(document.getElementById('1').textContent);
                console.log("%o", j);
        </script>
  </body>
</html>
  • HTML seems to work fine with these flags: JSON_HEX_TAG | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS
  • XHTML will print < to the JavaScript console, we expected &lt;!
  • XHTML breaks if we JSON encode and print &lt - it doesn't like that character reference without the ; termination!
  • Using JSON_HEX_AMP in XHTML seems to fix both of these issues.

​@jonsurrell commented on ​PR #6520:


7 months ago
#11

Does anyone know of themes to test that don't support html5 πŸ˜…

I'm wondering if a check like this is sufficient:

if ( ! current_theme_supports( 'html5' ) ) {
        $json_encode_flags |= JSON_HEX_AMP;
}

Of if this need to check whether in WP Admin as well πŸ€”

​@westonruter commented on ​PR #6520:


7 months ago
#12

Given that pages written XHTML is almost _never_ served as actual XHTML (see ​https://github.com/GoogleChromeLabs/wpp-research/pull/74), I think we should always assume HTML5.

​@westonruter commented on ​PR #6520:


7 months ago
#13

The WP Admin is considered HTML5, at least ​according to `wp_get_inline_script_tag()`.

​@dmsnell commented on ​PR #6520:


7 months ago
#14

This is a great discussion, and I think I'd need far more time to digest what's going on, but here are a few initial thoughts.

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from. It's hard to succinctly say other than to say we cannot produce a closing SCRIPT tag. </script{{ non-ascii alpha }} is safe to produce because it is not going to be a closing tag unless what follows is whitespace or / or >. What this rule is saying, is that once we start parsing a tag, if the tag name is SCRIPT _and then we continue to complete the tag_ then it closes out of the SCRIPT state. However, if it's not a tag name match (e.g. SCRIPT😈 or DIV or SCRIPT:EVIL) then the entire tag parsing concludes and the consumed characters are flushed out as plaintext within the script.

You can see this with the Tag Processor, and note that it mirrors what you see in a browser.

$p = new WP_HTML_Tag_Processor( '<script>This is </scriptπŸ˜„> and this is </script>' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is </scriptπŸ˜„> and this is "

One final note is that once we match the SCRIPT tag name and enter the tag parsing, attributes may exist on the closing tag, even though they are ignored.

$p = new WP_HTML_Tag_Processor( '<script>This is a </script closing tag="</script>">' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is a "

---

The double-escaped script state can be very confusing, largely because of the wording around it. It's specifically there to allow printing SCRIPT elements from JavaScript. You not only need to open an HTML comment with ` </script>' );
$p->next_token(); echo $p->get_modifiable_text();
Output: "This "

$p = new WP_HTML_Tag_Processor( '<script>This </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
Output: "This "
`

You can see here how either a closing --> or a </script> tag exists the double-escaped state. The important thing that I see is that if we can prevent </script then we are guaranteed to prevent any escape regardless. It can be very challenging to try and reliably and safely preserve the ability to enter the literal </script>, though we should do that if we can. It's not a matter of entering the double-escaped state when we that to come through, because double-escaped states do not nest.

$p = new WP_HTML_Tag_Processor( '<script>This  </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This  "

---

I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).

Personally I'd rather see us move forward and focus on HTML5 because it's so extremely rare in practice to find XHTML. Yes it's possible to send the content type header, and yes if you serve as a page as .xml the browser will follow, but that effectively doesn't happen in practice and for good reason. Even if WordPress claims to send XHTML the rest of the page will be broken for a variety of other reasons.

We can keep in mind that _HTML cannot be fully expressed through XML_ and this is one of the failures of XHTML.

Why this is important to me is that once we start escaping content within a SCRIPT element we are knowingly corrupting that script and entering uncertain territory, possibly _introducing_ vulnerabilities that wouldn't have been there through inaction.

This is one of the reasons I discourage use of DOMDocument, because even the act of loading and re-saving a document can introduce corruption and vulnerabilities.

$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><meta charset=utf8><body><script>This  </script> alert(1);</body></html>" );
echo $dom->saveHTML();
// Output:
// <!DOCTYPE html>
// <html><head><meta charset="utf8"></head><body><script>This <!-- <script> </script> --&gt;  alert(1);</body></html>

In this case we instructed DOMDocument to make no changes to the document, and yet it has broken out of the SCRIPT. This causes the alert(1); to be executed, when properly it should have appeared as plaintext on the page.

When HTML5\DOMDocument appears it should not make these mistakes.

​@jonsurrell commented on ​PR #6520:


7 months ago
#15

we need to make sure that </script{{ non-ascii alpha }} cannot be produced

I'm not sure where this comes from.

This was perhaps a poor attempt at expressing that we need to find at least </script to close a script tag. This is not sufficient, but it is necessary. It implies that we must find </scr etc… and finally if we can't find </ or even < then there's no way to escape the script tag.

​@jonsurrell commented on ​PR #6520:


7 months ago
#16

Given all the discussion and just how difficult it is to serve XHTML, is the consensus that JSON_HEX_AMP should always be omitted?

It doesn't seem to be a security issue. There's only an edge case that the data may not be interpreted correctly or the data may cause the xhtml page to be invalid due to malformed cahracter references. And that's all _if_ the page is actually served correctly as XHTML.

​@sabernhardt commented on ​PR #6520:


7 months ago
#17

Does anyone know of themes to test that don't support html5 πŸ˜…

Themes that do not _declare_ html5 support include Twenty Ten, Twenty Eleven and Twenty Twelve.

if ( ! current_theme_supports( 'html5' ) ) {

If you use a current_theme_supports() condition, you may need to check whether the theme supports HTML script specifically:
current_theme_supports( 'html5', 'script' )

​@jonsurrell commented on ​PR #6520:


7 months ago
#18

Thanks @sabernhardt. I did some testing with twentyten and it doesn't declare html5 support, but it certainly doesn't render an xhtml page. Unescaped & in script tags seems harmless and are interpreted correctly.

I'm fairly convinced at this point that we don't need JSON_HEX_AMP. Agree with @westonruter on this (​https://github.com/WordPress/wordpress-develop/pull/6520#issuecomment-2103166719).

I think most discussion is settled and this is ready for review.

​@sabernhardt commented on ​PR #6520:


7 months ago
#19

If you want to test with a theme that uses an XHTML doctype in the header.php template, you could try the old ​"Default" (Kubrick) theme. Directory searches found others too. Note that some of these themes are not available for download anymore, and some of them only use the doctype for a documentation page (not in header.php).
​XHTML
​HTML 4

​@jonsurrell commented on ​PR #6520:


7 months ago
#20

Thanks again @sabernhardt. I tried with Kubrick and Colorway, and those themes still are not rendered as XHTML, no issues without JSON_HEX_AMP encoding.

#21 @cbravobernal
7 months ago

  • Milestone Awaiting Review deleted
  • Resolution set to invalid
  • Status changed from assigned to closed

#22 @cbravobernal
7 months ago

  • Resolution invalid deleted
  • Status changed from closed to reopened

#23 @cbravobernal
7 months ago

  • Milestone set to 6.6

#24 @cbravobernal
7 months ago

Sorry for the mess with an unwanted resolution.

​@jonsurrell commented on ​PR #6520:


7 months ago
#25

Thanks @dmsnell, I'm happy with your changes πŸ‘

#26 @dmsnell
7 months ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 58159:

Improve legibility of JSON-encoded Interactivity API store data.

The Interactivity API has been rendering client data in a SCRIPT element with the
type application/json so that it's not executed as a script, but is available
to one. The data runs through wp_json_encode() and is encoded with some flags
to ensure that potentially-dangerous characters are escaped.

However, this can lead to some challenges. Eagerly escaping when not necessary
can make the data difficult to comprehend when reading the output HTML. For example,
all non-ASCII Unicode characters are escaped with their code point equivalent.
This results in \ud83c\udd70 instead of πŸ…°.

In this patch, the flags for JSON encoding are refined to ensure what's necessary
while relaxing other rules (leaving in those Unicode characters if the blog charset
is UTF-8). This makes for Interactivity API data that's quicker as a human reader
to decipher and diagnose.

In summary:

  • This data is JSON encoded and printed in a <script type="application/json"> tag.
  • If we ensure that < is never printed inside the data, it should be impossible to break out of the script tag and the browser treats everything as the element's textContent.
  • All other escaping becomes unnecessary at that point, including unicode escaping if the page uses the UTF-8 charset (the same encoding as JSON).

See ​https://github.com/WordPress/wordpress-develop/pull/6433#pullrequestreview-2043218338

Developed in ​https://github.com/WordPress/wordpress-develop/pull/6520
Discussed in https://core.trac.wordpress.org/ticket/61170

Fixes: #61170
Follow-up to: [57563].
Props: bjorsch, dmsnell, jonsurrell, sabernhardt, westonruter.

#28 @TobiasBg
7 months ago

@dmsnell:
Could/Should that bitwise-OR operation be written as

$json_encode_flags = JSON_HEX_TAG | JSON_UNESCAPED_SLASHES;
if ( is_utf8_charset() ) {
        $json_encode_flags |= JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS;
}

to reduce duplication and for clarity (and for micro-optimization)?

#29 @dmsnell
7 months ago

Thanks @TobiasBg! It was originally like this, but I rearranged it intentionally to streamline the UTF-8 case, and to signify that UTF-8 is and should be normative. (We can observe that if the blog charset is not UTF-8 then there's no computation, but in the normative case when it is UTF-8 we would have to perform additional computation the way you propose).

Note too that it may not be the case that it's micro-optimized faster. the |= adds a data-dependency that cannot be pre-computed by the CPU in the same way that alternate assignment could. I think it's highly likely that there is no measurable or meaningful optimization here, so unless we have evidence otherwise (outside of any synthetic micro-benchmarks) it would be best to focus on the nuance.

I'm not worried about duplication either, because it's just a couple of integer flags. I could see comprehension being better in each case, so the tradeoff is perhaps subjective. I like seeing the full set in each case so I don't have to track the execution in my head.

What do you think?

#30 @TobiasBg
7 months ago

Thanks for the explanation, @dmsnell! Yes, focussing on and emphasizing UTF-8 makes sense.

And the micro-optimization can likely be neglected anyways (which is why I had it parentheses). One would really have to compare the numbers of bitwise-ORs, variable assignments, the negation in the if, etc., and all that with the pre-computations, which I hadn't even thought about.

Note: See TracTickets for help on using tickets.