#64419 closed enhancement (fixed)
HTML API: Escape JavaScript, JSON script tag contents automatically
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | HTML API | Keywords: | has-patch has-unit-tests |
| Focuses: | javascript | Cc: |
Description (last modified by )
The HTML API prevents setting SCRIPT tag that could modify the tree either by closing the SCRIPT element prematurely, or by preventing the SCRIPT element from closing at the expected close tag.
This is handled by rejecting any script tag contents that are potentially dangerous and is safe. There are some improvements that could be made.
If the contents are found to be unsafe and the type of the script tag is JSON or JavaScript (this is well specified in the HTML standard), it should be possible to apply a syntactic transformation to the contents in such a way that the script contents become safe without semantically altering the script.
If the HTML API can safely and automatically escape the majority of SCRIPT tag contents, it can then be used to for SCRIPT tag creation and has the potential to eliminate the class of problem from #40737, #62797, and #63851. It also has the potential to address part of #51159 where SCRIPT tag escaping becomes less of an issue.
JSON
Replacing < with its Unicode escape sequence was originally proposed. However, the same escaping mechanism used for JavaScript can be applied to JSON and produce more legible output with minimal escaping.
In JSON SCRIPT tags, the transformation is a simple replacement of < with its Unicode escape sequence \u003C. This can be applied to the entire contents of the script or specifically in case-insensitive matches for <script and </script.
JavaScript
JavaScript SCRIPT tags are more difficult because the language has vastly more syntax. Fortunately, there is prior art described in this 2022 blog post (external) from React team member Sophie Alpert. It's the same the JavaScript `SCRIPT` tag contents escaping strategy that React continues to employ today. In summary, the problematic text <script and </script syntactically appear in places where Unicode escape sequences can be used in the script part (Strings, Identifiers, and RegExp literals). React takes the approach of replacing the s character, resulting in <\u0073cript or </\u0073cript, completely safe in a Script tag.
There are a few notable exceptions where the transformed JavaScript has observably different runtime behavior. These are the only examples I'm aware of. They're more esoteric parts of the language and the likelihood of them being used in inline JavaScript with the problematic text sequences seems an acceptable tradeoff to me to enable cheap, automatic JavaScript escaping.
String.raw does not process escape sequences.
'<script>' === '<\u0073cript>'; // true String.raw`<script>` === String.raw`<\u0073cript>`; // false
Tagged templates can also access the raw strings, again a form without processing escape sequences.
function taggedCooked( strings ) { return strings[0]; } taggedCooked`<script>` === taggedCooked`<\u0073cript>`; // true function taggedRaw( strings ) { return strings.raw[0]; } taggedRaw`<script>` === taggedRaw`<\u0073cript>`; // false
The source property of RegExp contains a string representation of the pattern. JavaScript RegExp support Unicode escape sequences, but the Unicode escape sequence is not transformed in the source.
const rPlain = /<script>/; const rEscaped = /<\u0073cript>/ rPlain.test('<script>'); // true rEscaped.test('<script>'); // true rPlain.source === rEscaped.source; // false rPlain.source; // '<script>' rEscaped.source; // '<\\u0073cript>'
Any better JavaScript escaping would likely require a complete JavaScript parser and much more invasive changes. It would be much more costly to perform. Even then, I'm not sure that the escaping could be done faithfully.
String.raw() could be split and joined:
String.raw`<script>` === String.raw`<s` + String.raw`cript>`; true ✅
Tagged template raw and RegExp source seem much more challenging.
Change History (11)
This ticket was mentioned in PR #10635 on WordPress/wordpress-develop by @jonsurrell.
8 weeks ago
#1
- Keywords has-patch has-unit-tests added
@jonsurrell commented on PR #10635:
8 weeks ago
#2
The new is_javascript_script_tag() and is_json_script_tag() have been made private and have an ignore annotation (so they should not appear in documentation pages) as well as brief todo comments describing how a more general public method could be useful.
@jonsurrell commented on PR #10635:
8 weeks ago
#3
I've polished this, addressed feedback, and added comments to explain how the escaping is being done.
This is ready for review.
@jonsurrell commented on PR #10635:
6 weeks ago
#4
I've created https://github.com/WordPress/wordpress-develop/pull/10668 to align STYLE tag handling with this approach.
@dmsnell commented on PR #10635:
6 weeks ago
#5
@sirreal after my changes the tests could be merged, and in fact, can/should be refactored a fair amount. I think he same might be true of the long comment in the JavaScript escaping function. for now, I moved the graphviz source code into a separate file in the test directory to get it out of the class. _at least_ I didn’t like it at the bottom since it was so disconnected from the code, and I think you didn’t like it inline because of how noisy it was already. I think in the end we could link to your blog post or to the SPEC and just let it be a little reference instead of a full explanation.
at one point I added an optional parameter which was $script_type and when passed, would be set to things like classic, module, etc… but then I took it out because we didn’t have a clear need for it, plus I figured any determination of it could be done quickly _after_ calling get_script_content_type().
if you look at the commits, you will realize I learned that script is six letters and not five 🤦♂️
@dmsnell commented on PR #10635:
4 weeks ago
#6
@sirreal I updated the .dot file to clarify labels and sizing of the nodes in the default graphical presentation of the file, and to harmonize with the change I made replacing the dagger with a superscript.
in the process I accidentally pushed out the build-change-revert that I’ve been using to test and develop. since I can’t force-push to your branch we’ll just have to remember to git rebase -i --rebase-merges trunk and remove that merge commit before we merge this PR.
| before | after |
|---|---|
| | |
The HTML API currently _rejects_ script tag contents that _may_ be dangerous. This is a proposal detect JavaScript and JSON script tags and automatically escape dangerous contents potentially dangerous when necessary.
<scriptor</script(case-insensitive) is found.## In JSON
<is replaced with\u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.This is proposed as a simple character replacement with
strtr. This should be highly performant. A less invasive replacement could be done to only replace<in<scriptor</scriptwhere it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).## In JavaScript
<scriptand</script(followed by a necessary tag termination character\t\n\r\f/>) thesis replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below).From the ticket:
Trac ticket: https://core.trac.wordpress.org/ticket/64419