Make WordPress Core

Opened 2 weeks ago

Closed 44 hours ago

#64274 closed enhancement (fixed)

wp.sanitize.stripTags could rely on the browser for HTML parsing

Reported by: dmsnell's profile dmsnell Owned by: dmsnell's profile dmsnell
Milestone: 7.0 Priority: normal
Severity: normal Version: 4.2
Component: General Keywords: has-patch
Focuses: javascript Cc:

Description (last modified by westonruter)

This JavaScript function recreates emblematic problems in HTML parsing by relying on over-simplified regular expressions. It runs, however, inside a browser where a reliable HTML parser exists, and so could or should lean on that parser in order to operate.

See discussion. Follow-up to [60907] for #48054.

Change History (12)

#1 @westonruter
2 weeks ago

  • Description modified (diff)
  • Milestone changed from Awaiting Review to Future Release

#2 @westonruter
2 weeks ago

  • Keywords needs-patch added

#3 @hbhalodia
2 weeks ago

Hi @westonruter @dmsnell, If I am collecting it correctly, we should remove the usage of regex and use something like DOMParser or a simple HTML tag via createElement, add it as an innerHTML and extract the innerText from it and return that?

Something like below,

const parser = new DOMParser();
const doc = parser.parseFromString( text, 'text/html' );

return doc.body.innerText || '';

or

const element = document.createElement( 'div' );
element.innerHTML = text;

return element.innerText;

Let me know if this is something we need to update?

Thanks,

#4 @dmsnell
2 weeks ago

@hbhalodia yes exactly! though I believe that the script is trying to remove comments, tags, and SCRIPT or STYLE contents. I believe that .innerText returns this, but I always get confused between that and .textContent.

#5 @hbhalodia
2 weeks ago

Yeah, it's bit confusing. I guess innerText returns the content that is usually seen in frontend, so for example if div has display: none; it would not return innerText from that div, but if we use textContent, it would return and also preserves the spaces and indentation.

So I am not sure as well what to use here, but Ideally if we do not need to preserve indentation, spaces we should go with textContent, instead of innerText but I am open to suggestions.

Cc: @westonruter @dmsnell

Last edited 2 weeks ago by hbhalodia (previous) (diff)

#6 @dmsnell
2 weeks ago

  • Focuses javascript added
  • Milestone changed from Future Release to 7.0
  • Owner set to dmsnell
  • Status changed from new to assigned
  • Type changed from defect (bug) to enhancement
  • Version set to 4.2

sounds good @hbhalodia; looking forward to your PR

looks like this came in through [31534], but that itself might have been influenced by Prototype.js which was added in 2006. nice longstanding opportunity 😉

This ticket was mentioned in PR #10536 on WordPress/wordpress-develop by @hbhalodia.


2 weeks ago
#7

  • Keywords has-patch added; needs-patch removed

Trac ticket: https://core.trac.wordpress.org/ticket/64274

  • Used HTML parsing instead of regex matching.

#8 @hbhalodia
2 weeks ago

Hi @dmsnell, I do have raised the PR, Can you please review and suggest improvements if any?

Thank You,

@westonruter commented on PR #10536:


6 days ago
#9

@dmsnell Here's the result of the file being minified as wp-includes/js/wp-sanitize.min.js via npm run build:dev:

/*! This file is auto-generated */
window.wp=window.wp||{},wp.sanitize={stripTags:function(t){t=(new DOMParser).parseFromString(t,"text/html");return t.body.innerText=t.body.innerText,t.body.innerHTML},stripTagsAndEncodeText:function(t){let e=wp.sanitize.stripTags(t),n=document.createElement("textarea");try{n.textContent=e,e=wp.sanitize.stripTags(n.value)}catch(t){}return e}};

With prettier formatting:

/*! This file is auto-generated */
((window.wp = window.wp || {}),
  (wp.sanitize = {
    stripTags: function (t) {
      t = new DOMParser().parseFromString(t, "text/html");
      return ((t.body.innerText = t.body.innerText), t.body.innerHTML);
    },
    stripTagsAndEncodeText: function (t) {
      let e = wp.sanitize.stripTags(t),
        n = document.createElement("textarea");
      try {
        ((n.textContent = e), (e = wp.sanitize.stripTags(n.value)));
      } catch (t) {}
      return e;
    },
  }));

When combined with:

const unsafeText = `Hello: <style style="display:block"><script>document.write('evil');</script></style>[[Image(https://s.w.org/style/images/about/WordPress-logotype-wmark.png)]]`;

console.log(wp.sanitize.stripTags(unsafeText));

The output is as expected:

Hello: &lt;script&gt;document.write('evil');&lt;/script&gt;

#10 @westonruter
6 days ago

  • Status changed from assigned to reviewing

@westonruter commented on PR #10536:


4 days ago
#11

If you don’t beat me to it, I’ll try and merge this today.

@dmsnell Looks like there is a commit freeze until after 6.9 is released.

I've drafted the following commit message:

General: Leverage `DOMParser` to implement `wp.sanitize.stripTags()`.

Developed in https://github.com/WordPress/wordpress-develop/pull/10536.

Follow-up to [60907].

Props hbhalodia, dmsnell, westonruter.
See #48054.
Fixes #64274.

#12 @westonruter
44 hours ago

  • Resolution set to fixed
  • Status changed from reviewing to closed

In 61347:

General: Leverage DOMParser to implement wp.sanitize.stripTags().

Developed in https://github.com/WordPress/wordpress-develop/pull/10536

Follow-up to [60907].

Props hbhalodia, dmsnell, westonruter.
See #48054.
Fixes #64274.

Note: See TracTickets for help on using tickets.