#61170 closed enhancement (fixed)
Interactivity API: Improve JSON store serialization
Reported by: | jonsurrell | Owned by: | jonsurrell |
---|---|---|---|
Milestone: | 6.6 | Priority: | normal |
Severity: | normal | Version: | 6.5 |
Component: | Interactivity API | Keywords: | has-patch |
Focuses: | javascript, performance | Cc: |
Description
The Interactivity API serializes JSON data into a <script type="application/json">
tag.
@bjorsch noticed some potential improvements to the serialization in βthis PR review.
Improving the JSON serialization may make the result more readable and decrease the serialized size.
Change History (30)
This ticket was mentioned in βPR #6520 on βWordPress/wordpress-develop by β@jonsurrell.
7 months ago
#1
- Keywords has-patch added
β@jonsurrell commented on βPR #6520:
7 months ago
#2
@westonruter brought up a point about JSON_UNESCAPED_SLASHES
(βhttps://github.com/WordPress/wordpress-develop/pull/6433#discussion_r1594762631). I'll reply here.
Thanks for flagging that! This is interesting feedback, do you have more information on what an attack might look like?
I decided to investigate more deeply, so I reviewed βreviewing the standard and as far as I can tell to break out of the script you need a literal sequence like </script
in order to break out of a script tag. Either JSON unicode escaping the <
or escaping the /
(as \/
) seems sufficient based on my interpretation, but scrutiny would be very welcome.
To close a script tag, we must find:
- βstart in script data state (we could start from another of the script states, but we'd need to return here somehow)
- U+003C LESS-THAN SIGN (
<
)
Switch to the βscript data less-than sign state.
- U+002F SOLIDUS (
/
)
Set the βtemporary buffer to the empty string. Switch to the βscript data end tag open state.
Create a new end tag token, set its tag name to the empty string. βReconsume in the βscript data end tag name state (anything else adds </
to the script data and continues)
My interpretation: ASCII characters are lowercased and added to the "end tag" name. When the end tag name ends with a non-ascii character the following step checks if it matches the opening tag (
script
).
- 13.2.5.17 Script data end tag name state
- ASCII upper and lower alpha are added to the tag name (lowercased)
- Space, linefeed, formfeed, tab,
/
and>
do what they do in normal tags _if_ the tag name was matched (script
in any upper/lowercase combination) - Anything else (including the spacing,
/
,>
characters if the tag name wasn't matched) treats the characters as script data and continues to parse in script data state.
Based on the above, my interpretation is that we need to make sure that </script{{ non-ascii alpha }}
cannot be produced. Minimally, ensuring </
cannot be produced seems sufficient.
Here are the relevant JSON encoding flags:
- βJSON_HEX_TAG: All
<
and>
are converted to\u003C
and\u003E
. - βJSON_HEX_AMP: All
&
are converted to\u0026
. - βJSON_UNESCAPED_SLASHES: Don't escape
/
. - βJSON_UNESCAPED_UNICODE: Encode multibyte Unicode characters literally (default is to escape as
\uXXXX
). - βJSON_UNESCAPED_LINE_TERMINATORS: The line terminators are kept unescaped when
JSON_UNESCAPED_UNICODE
is supplied. It uses the same behaviour as it was before PHP 7.1 without this constant. Available as of PHP 7.1.0.
If my analysis so far is correct, either _omitting_ JSON_UNESCAPED_SLASHES
(so /
is escaped as \/
) *OR* includeing JSON_HEX_TAG
(so <
is escaped as \u003c
) is sufficient. These conditions should print the dangerous</script
as <\/script
or \u003c/script
respectively, both of which should be harmless.
That is to say, it seems that neither flag (0
) _OR_ both flags JSON_HEX_TAG | JSON_UNESCAPED_SLASHES
seems sufficient to ensure the dangerous </script
is not produced.
My intuition is to use JSON_HEX_TAG | JSON_UNESCAPED_SLASHES
. I suspect /
is more common in encoded data like URLs than that <
and >
characters, so /
can remain unencoded and <
and >
are encoded.
One thing that concerns me is a sequence of states in a script tag that begins <!--
, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match </script
and close the script tag at that point. Nonetheless, escaping <
alone seems sufficient to prevent entering this state because without <
we can never enter βScript data less-than sign state.
---
JSON_HEX_AMP
doesn't seem to have any value, character references like &
are not decoded in a script tag. We can remove it. </script>
is not dangerous in a script tag.
JSON_UNESCAPED_UNICODE
seems good to include if we're confident the page encoding is appropriate, so we can include it if blog_charset is "UTF-8"
.
JSON_UNESCAPED_LINE_TERMINATORS
also seems good to include. \u2028
(LINE SEPARATOR) and \u2029
(PARAGRAPH SEPARATOR) do not need to be escaped. βIt's tricky to find documented, but it seems that these are the only characters that are handled.
#3
@
7 months ago
@whyisjake As the security component maintainer, would you (or whoever you deem appropriate) please review my analysis and the proposed change from a security perspective?
β@bjorsch commented on βPR #6520:
7 months ago
#4
If my analysis so far is correct, either _omitting_
JSON_UNESCAPED_SLASHES
(so/
is escaped as\/
) _OR_ includeingJSON_HEX_TAG
(so<
is escaped as\u003c
) is sufficient. These conditions should print the dangerous</script
as<\/script
or\u003c/script
respectively, both of which should be harmless.
Exactly this.
My intuition is to use
JSON_HEX_TAG | JSON_UNESCAPED_SLASHES
. I suspect/
is more common in encoded data like URLs than that<
and>
characters, so/
can remain unencoded and<
and>
are encoded.
My intuition as well.
One thing that concerns me is a sequence of states in a script tag that begins
<!--
, this enters an "escaped" flow of states, but I can't find any difference with the normal flow, it still seems to try to match</script
and close the script tag at that point. Nonetheless, escaping<
alone seems sufficient to prevent entering this state because without<
we can never enter βScript data less-than sign state.
I've not looked too much into this either, but I've also had that in the back of my head when considering JSON_HEX_TAG | JSON_UNESCAPED_SLASHES
over neither.
JSON_HEX_AMP
doesn't seem to have any value, character references like&
are not decoded in a script tag. We can remove it.</script>
is not dangerous in a script tag.
I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script>
(unless you do the <![CDATA[
thing).
β@jonsurrell commented on βPR #6520:
7 months ago
#5
I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the
<script>
(unless you do the<![CDATA[
thing).
Good thought. I'm been unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:
WordPress still officially supports HTML4 and XHTML, but the browsers it serves and the broader web effectively don't.
β@westonruter commented on βPR #6520:
7 months ago
#6
I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the
<script>
(unless you do the<![CDATA[
thing).
Good thought. I'm unable to get any modern browser to interpret a page as anything other than HTML5 so this may be a non-issue. There's a related ticket that suggests the same:
@sirreal Did you try sending Content-Type: application/xhtml+xml
? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: βhttps://xhtml-test-page.glitch.me/
β@bjorsch commented on βPR #6520:
7 months ago
#7
Did you try sending
Content-Type: application/xhtml+xml
? This is the only way to trigger it. Here's a test page which is served as actual XHTML with a link to trigger a parse error to prove whether the XML parser is in use: βhttps://xhtml-test-page.glitch.me/
Based on that, here's a PHP file that illustrates a script injection that would be prevented by JSON_HEX_AMP
:
<?php header('Content-Type: application/xhtml+xml; charset=UTF-8'); $data = "";alert('eek');//"; ?><?xml version="1.0" encoding="UTF-8"?> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US"> <head> <script> var data = <?php echo json_encode( $data, JSON_HEX_TAG | JSON_UNESCAPED_SLASHES ); ?>; </script> </head> <body> </body> </html>
Attacking the <script type="application/json">
used here would be harder, of course, as you can't directly inject code there. But you might be able to override the data structure in some manner that exposes a problem in whatever uses the data. Or at least you could break the page.
β@westonruter commented on βPR #6520:
7 months ago
#8
as you can't directly inject code there
But you could still if your malicious code looked like </script>[[Image(bad)]]
or even have it open another non-JSON script tag.
β@bjorsch commented on βPR #6520:
7 months ago
#9
If you can figure out a way to get that </script>
past JSON_HEX_TAG
there, please give an example.
β@jonsurrell commented on βPR #6520:
7 months ago
#10
Fascinating! This was helpful. I'll add JSON_HEX_AMP
for XHTML, detailed reasoning below.
---
In XHTML the entities will be decoded in script tags, although this seems to be a source of data corruption more than anything else. I don't think they can be abused to break out of the script tag - that's their purpose: <
is not the start of a tag like <
is π
However, it can be dangerous _not_ to escape &
because:
- it's easy to break the xhtml
- html entities will be transformed
Here's the demo script I was playing with:
<?php $xhtml = isset($_GET['xhtml']); header( 'Content-Type: ' . ( $xhtml ? 'application/xhtml+xml' : 'text/html' ) . '; charset=UTF-8' ); // PLAY WITH THE VALUE HERE $s = "<"; $flags = 0 // always these | JSON_HEX_TAG | JSON_UNESCAPED_SLASHES // these with UTF-8 | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS | 0 ; if ( isset( $_GET['amp'] ) ) { $flags |= JSON_HEX_AMP; } if ( $xhtml ) { echo '<!DOCTYPE html>' . PHP_EOL . '<html xmlns="http://www.w3.org/1999/xhtml">'; } else { echo '<!DOCTYPE html>' . PHP_EOL . '<html>'; } ?> <body> <script id='1' type="application/json"><?php echo json_encode($s,$flags); ?></script> <script> const j = JSON.parse(document.getElementById('1').textContent); console.log("%o", j); </script> </body> </html>
- HTML seems to work fine with these flags:
JSON_HEX_TAG | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS
- XHTML will print
<
to the JavaScript console, we expected<
! - XHTML breaks if we JSON encode and print
<
- it doesn't like that character reference without the;
termination! - Using
JSON_HEX_AMP
in XHTML seems to fix both of these issues.
β@jonsurrell commented on βPR #6520:
7 months ago
#11
Does anyone know of themes to test that don't support html5
π
I'm wondering if a check like this is sufficient:
if ( ! current_theme_supports( 'html5' ) ) {
$json_encode_flags |= JSON_HEX_AMP;
}
Of if this need to check whether in WP Admin as well π€
β@westonruter commented on βPR #6520:
7 months ago
#12
Given that pages written XHTML is almost _never_ served as actual XHTML (see βhttps://github.com/GoogleChromeLabs/wpp-research/pull/74), I think we should always assume HTML5.
β@westonruter commented on βPR #6520:
7 months ago
#13
The WP Admin is considered HTML5, at least βaccording to `wp_get_inline_script_tag()`.
β@dmsnell commented on βPR #6520:
7 months ago
#14
This is a great discussion, and I think I'd need far more time to digest what's going on, but here are a few initial thoughts.
we need to make sure that </script{{ non-ascii alpha }} cannot be produced
I'm not sure where this comes from. It's hard to succinctly say other than to say we cannot produce a closing SCRIPT tag. </script{{ non-ascii alpha }}
is safe to produce because it is not going to be a closing tag unless what follows is whitespace or /
or >
. What this rule is saying, is that once we start parsing a tag, if the tag name is SCRIPT
_and then we continue to complete the tag_ then it closes out of the SCRIPT state. However, if it's not a tag name match (e.g. SCRIPTπ
or DIV
or SCRIPT:EVIL
) then the entire tag parsing concludes and the consumed characters are flushed out as plaintext within the script.
You can see this with the Tag Processor, and note that it mirrors what you see in a browser.
$p = new WP_HTML_Tag_Processor( '<script>This is </scriptπ> and this is </script>' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is </scriptπ> and this is "
One final note is that once we match the SCRIPT
tag name and enter the tag parsing, attributes may exist on the closing tag, even though they are ignored.
$p = new WP_HTML_Tag_Processor( '<script>This is a </script closing tag="</script>">' );
$p->next_token();
echo $p->get_modifiable_text();
// Output: "This is a "
---
The double-escaped script state can be very confusing, largely because of the wording around it. It's specifically there to allow printing SCRIPT elements from JavaScript. You not only need to open an HTML comment with ` </script>' );
$p->next_token(); echo $p->get_modifiable_text();
Output: "This "
$p = new WP_HTML_Tag_Processor( '<script>This </script> --> </script>' );
$p->next_token(); echo $p->get_modifiable_text();
Output: "This "
`
You can see here how either a closing -->
or a </script>
tag exists the double-escaped state. The important thing that I see is that if we can prevent </script
then we are guaranteed to prevent any escape regardless. It can be very challenging to try and reliably and safely preserve the ability to enter the literal </script>
, though we should do that if we can. It's not a matter of entering the double-escaped state when we that to come through, because double-escaped states do not nest.
$p = new WP_HTML_Tag_Processor( '<script>This </script>' );
$p->next_token(); echo $p->get_modifiable_text();
// Output: "This "
---
I've heard that if the page is being interpreted as XHTML4 rather than as HTML5, the entities might get interpreted inside the <script> (unless you do the <![CDATA[ thing).
Personally I'd rather see us move forward and focus on HTML5 because it's so extremely rare in practice to find XHTML. Yes it's possible to send the content type header, and yes if you serve as a page as .xml
the browser will follow, but that effectively doesn't happen in practice and for good reason. Even if WordPress claims to send XHTML the rest of the page will be broken for a variety of other reasons.
We can keep in mind that _HTML cannot be fully expressed through XML_ and this is one of the failures of XHTML.
Why this is important to me is that once we start escaping content within a SCRIPT
element we are knowingly corrupting that script and entering uncertain territory, possibly _introducing_ vulnerabilities that wouldn't have been there through inaction.
This is one of the reasons I discourage use of DOMDocument
, because even the act of loading and re-saving a document can introduce corruption and vulnerabilities.
$dom = new DOMDocument();
$dom->loadHTML( "<!DOCTYPE html><html><meta charset=utf8><body><script>This </script> alert(1);</body></html>" );
echo $dom->saveHTML();
// Output:
// <!DOCTYPE html>
// <html><head><meta charset="utf8"></head><body><script>This <!-- <script> </script> --> alert(1);</body></html>
In this case we instructed DOMDocument
to make no changes to the document, and yet it has broken out of the SCRIPT
. This causes the alert(1);
to be executed, when properly it should have appeared as plaintext on the page.
When HTML5\DOMDocument
appears it should not make these mistakes.
β@jonsurrell commented on βPR #6520:
7 months ago
#15
we need to make sure that </script{{ non-ascii alpha }} cannot be produced
I'm not sure where this comes from.
This was perhaps a poor attempt at expressing that we need to find at least </script
to close a script tag. This is not sufficient, but it is necessary. It implies that we must find </scr
etc⦠and finally if we can't find </
or even <
then there's no way to escape the script tag.
β@jonsurrell commented on βPR #6520:
7 months ago
#16
Given all the discussion and just how difficult it is to serve XHTML, is the consensus that JSON_HEX_AMP
should always be omitted?
It doesn't seem to be a security issue. There's only an edge case that the data may not be interpreted correctly or the data may cause the xhtml page to be invalid due to malformed cahracter references. And that's all _if_ the page is actually served correctly as XHTML.
β@sabernhardt commented on βPR #6520:
7 months ago
#17
Does anyone know of themes to test that don't support
html5
π
Themes that do not _declare_ html5
support include Twenty Ten, Twenty Eleven and Twenty Twelve.
if ( ! current_theme_supports( 'html5' ) ) {
If you use a current_theme_supports()
condition, you may need to check whether the theme supports HTML script
specifically:
current_theme_supports( 'html5', 'script' )
β@jonsurrell commented on βPR #6520:
7 months ago
#18
Thanks @sabernhardt. I did some testing with twentyten and it doesn't declare html5 support, but it certainly doesn't render an xhtml page. Unescaped &
in script tags seems harmless and are interpreted correctly.
I'm fairly convinced at this point that we don't need JSON_HEX_AMP
. Agree with @westonruter on this (βhttps://github.com/WordPress/wordpress-develop/pull/6520#issuecomment-2103166719).
I think most discussion is settled and this is ready for review.
β@sabernhardt commented on βPR #6520:
7 months ago
#19
If you want to test with a theme that uses an XHTML doctype in the header.php
template, you could try the old β"Default" (Kubrick) theme. Directory searches found others too. Note that some of these themes are not available for download anymore, and some of them only use the doctype for a documentation page (not in header.php
).
βXHTML
βHTML 4
β@jonsurrell commented on βPR #6520:
7 months ago
#20
Thanks again @sabernhardt. I tried with Kubrick and Colorway, and those themes still are not rendered as XHTML, no issues without JSON_HEX_AMP encoding.
#21
@
7 months ago
- Milestone Awaiting Review deleted
- Resolution set to invalid
- Status changed from assigned to closed
β@jonsurrell commented on βPR #6520:
7 months ago
#25
Thanks @dmsnell, I'm happy with your changes π
β@dmsnell commented on βPR #6520:
7 months ago
#27
#28
@
7 months ago
@dmsnell:
Could/Should that bitwise-OR operation be written as
$json_encode_flags = JSON_HEX_TAG | JSON_UNESCAPED_SLASHES;
if ( is_utf8_charset() ) {
$json_encode_flags |= JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_LINE_TERMINATORS;
}
to reduce duplication and for clarity (and for micro-optimization)?
#29
@
7 months ago
Thanks @TobiasBg! It was originally like this, but I rearranged it intentionally to streamline the UTF-8 case, and to signify that UTF-8 is and should be normative. (We can observe that if the blog charset is not UTF-8 then there's no computation, but in the normative case when it is UTF-8 we would have to perform additional computation the way you propose).
Note too that it may not be the case that it's micro-optimized faster. the |=
adds a data-dependency that cannot be pre-computed by the CPU in the same way that alternate assignment could. I think it's highly likely that there is no measurable or meaningful optimization here, so unless we have evidence otherwise (outside of any synthetic micro-benchmarks) it would be best to focus on the nuance.
I'm not worried about duplication either, because it's just a couple of integer flags. I could see comprehension being better in each case, so the tradeoff is perhaps subjective. I like seeing the full set in each case so I don't have to track the execution in my head.
What do you think?
#30
@
7 months ago
Thanks for the explanation, @dmsnell! Yes, focussing on and emphasizing UTF-8 makes sense.
And the micro-optimization can likely be neglected anyways (which is why I had it parentheses). One would really have to compare the numbers of bitwise-ORs, variable assignments, the negation in the if, etc., and all that with the pre-computations, which I hadn't even thought about.
Trac ticket: https://core.trac.wordpress.org/ticket/61170