Opened 2 weeks ago
Closed 44 hours ago
#64274 closed enhancement (fixed)
wp.sanitize.stripTags could rely on the browser for HTML parsing
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | 4.2 |
| Component: | General | Keywords: | has-patch |
| Focuses: | javascript | Cc: |
Description (last modified by )
This JavaScript function recreates emblematic problems in HTML parsing by relying on over-simplified regular expressions. It runs, however, inside a browser where a reliable HTML parser exists, and so could or should lean on that parser in order to operate.
See discussion. Follow-up to [60907] for #48054.
Change History (12)
#1
@
2 weeks ago
- Description modified (diff)
- Milestone changed from Awaiting Review to Future Release
#4
@
2 weeks ago
@hbhalodia yes exactly! though I believe that the script is trying to remove comments, tags, and SCRIPT or STYLE contents. I believe that .innerText returns this, but I always get confused between that and .textContent.
#5
@
2 weeks ago
Yeah, it's bit confusing. I guess innerText returns the content that is usually seen in frontend, so for example if div has display: none; it would not return innerText from that div, but if we use textContent, it would return and also preserves the spaces and indentation.
So I am not sure as well what to use here, but Ideally if we do not need to preserve indentation, spaces we should go with textContent, instead of innerText but I am open to suggestions.
Cc: @westonruter @dmsnell
#6
@
2 weeks ago
- Focuses javascript added
- Milestone changed from Future Release to 7.0
- Owner set to dmsnell
- Status changed from new to assigned
- Type changed from defect (bug) to enhancement
- Version set to 4.2
sounds good @hbhalodia; looking forward to your PR
looks like this came in through [31534], but that itself might have been influenced by Prototype.js which was added in 2006. nice longstanding opportunity 😉
This ticket was mentioned in PR #10536 on WordPress/wordpress-develop by @hbhalodia.
2 weeks ago
#7
- Keywords has-patch added; needs-patch removed
Trac ticket: https://core.trac.wordpress.org/ticket/64274
- Used HTML parsing instead of regex matching.
#8
@
2 weeks ago
Hi @dmsnell, I do have raised the PR, Can you please review and suggest improvements if any?
Thank You,
@westonruter commented on PR #10536:
6 days ago
#9
@dmsnell Here's the result of the file being minified as wp-includes/js/wp-sanitize.min.js via npm run build:dev:
/*! This file is auto-generated */ window.wp=window.wp||{},wp.sanitize={stripTags:function(t){t=(new DOMParser).parseFromString(t,"text/html");return t.body.innerText=t.body.innerText,t.body.innerHTML},stripTagsAndEncodeText:function(t){let e=wp.sanitize.stripTags(t),n=document.createElement("textarea");try{n.textContent=e,e=wp.sanitize.stripTags(n.value)}catch(t){}return e}};
With prettier formatting:
/*! This file is auto-generated */ ((window.wp = window.wp || {}), (wp.sanitize = { stripTags: function (t) { t = new DOMParser().parseFromString(t, "text/html"); return ((t.body.innerText = t.body.innerText), t.body.innerHTML); }, stripTagsAndEncodeText: function (t) { let e = wp.sanitize.stripTags(t), n = document.createElement("textarea"); try { ((n.textContent = e), (e = wp.sanitize.stripTags(n.value))); } catch (t) {} return e; }, }));
When combined with:
const unsafeText = `Hello: <style style="display:block"><script>document.write('evil');</script></style>[[Image(https://s.w.org/style/images/about/WordPress-logotype-wmark.png)]]`; console.log(wp.sanitize.stripTags(unsafeText));
The output is as expected:
Hello: <script>document.write('evil');</script>
@westonruter commented on PR #10536:
4 days ago
#11
If you don’t beat me to it, I’ll try and merge this today.
@dmsnell Looks like there is a commit freeze until after 6.9 is released.
I've drafted the following commit message:
General: Leverage `DOMParser` to implement `wp.sanitize.stripTags()`. Developed in https://github.com/WordPress/wordpress-develop/pull/10536. Follow-up to [60907]. Props hbhalodia, dmsnell, westonruter. See #48054. Fixes #64274.
Hi @westonruter @dmsnell, If I am collecting it correctly, we should remove the usage of
regexand use something likeDOMParseror a simple HTML tag viacreateElement, add it as an innerHTML and extract the innerText from it and return that?Something like below,
or
Let me know if this is something we need to update?
Thanks,