Make WordPress Core

Opened 17 years ago

Closed 17 years ago

Last modified 16 years ago

#3070 closed defect (bug) (wontfix)

use of php's "strip_tags" gives improper/incomplete results

Reported by: _ck_'s profile _ck_ Owned by:
Milestone: Priority: high
Severity: major Version: 2.1
Component: Administration Keywords:
Focuses: Cc:


Wordpress uses the PHP strip_tags function at least 25 times.

The built-in function is known to have serious limitations including faults with SCRIPT and STYLE tags which can cause security issues and/or cause non-validation failures (ie. javacript is inserted into RSS feeds)

Solution: replace with the superior html2txt example as shown in the PHP manual

function html2txt($document){
$search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
               '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<![\s\S]*?--[ \t\n\r]*>@'        // Strip multi-line comments including CDATA
$text = preg_replace($search, '', $document);
return $text;

Change History (11)

#1 @_ck_
17 years ago

Apologies, this is the correct link to the 25 references

#2 @masquerade
17 years ago

I'd like to see some proof or somesuch on the problems with strip_tags.

#3 @masquerade
17 years ago

echo html2txt("<><script<>>alert('hey');</scr<>ipt>");

I'd say strip_tags is much better.

#4 @_ck_
17 years ago

weird, strip_tags is not removing javascript in my posts on my real-world setup for RSS (it's only php 4.4.1 but still it's not that old)

html2txt seems to grab it all?

I'll have to test and compare more carefully and post examples soon.

#5 @_ck_
17 years ago

Okay part of the problem with html2txt as it is posted on is the array search order is bad. It will pass your test if you move the HTML strip to the first item:

$search = array('@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
		'@<script[^>]*?>.*?</script>@si',  // Strip out javascript               
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<![\s\S]*?--[ \t\n\r]*>@'        // Strip multi-line comments including CDATA

#6 @_ck_
17 years ago

Okay I've figured out the problem is with desired behavior (and that I am not explaining it enough).

If javascript is used within a post (or possibly a comment if that is allowed) the problem is strip_tags will remove SCRIPT tags ONLY and leave the code inbetween!

So your post via RSS will look like:
blah blah blah document.write("example"); blah blah

html2txt will fix that behavior by stripping the code between SCRIPT first, then processing HTML tags (ignore my suggestion to change the processing order array in the previous comment).

You are correct in that it has a weakness for purposely maligned tags.

There must be a way to harden it, and I am working on that.

Certainly you'd agree that leaving the javascript code behind after removing SCRIPT tags is bad behavior via strip_tags?

#7 @_ck_
17 years ago

Okay this code is somewhat nasty but works.
Perhaps some PHP pros can improve upon it but keep it's functionality.
It now strips script/style/embed/object/iframe which are the most serious not just for attacks but for honest cleanup of your posts to RSS viewing.


echo html2txt("1. <><script< >>alert('hey <br> html2txt <scr<>ipt>');</scr<>ipt>")."<br>";

echo html2txt("2. before <scr<>ipt>document.write('html2txt <scr<>ipt>');</script> after ")."<br>";

echo strip_tags("3. <><script< >>alert('hey <br> strip_tags <scr<>ipt>');</scr<>ipt> ")."<br>";

echo strip_tags("4. before <scr<>ipt>document.write('strip_tags <scr<>ipt>');</script> after")."<br>";

function html2txt($text){
$search = array('@<>@',
		'@<script[^>]*?>.*?</script>@siU',  // Strip out javascript  
               '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
               '@<embed[^>]*?>.*?</embed>@siU',    // embed
               '@<object[^>]*?>.*?</object>@siU',    // object
	       '@<iframe[^>]*?>.*?</iframe>@siU',    // iframe	       
               '@<![\s\S]*?--[ \t\n\r]*>@',        // Strip multi-line comments including CDATA               
               '@</?[^>]*>*@' 		  // html tags

while($text != strip_tags($text)) { $text = preg_replace($search, '', $text); }
return $text;


#8 @jcwinnie
17 years ago

I have been having some problems with some neato add-ons and have suspected some such problem.

Possible examples

  1. I use Ultimate Tag Warrior. With 2.0.3 I could search ?tag=Christine+Davis and get all posts tagged Christine AND tagged Davis. With 2.0.4 the results are for posts tagged "Christine Davis"
  1. I use k2 theme and had modified my sidebar using SBM so that my delicious linkroll appeared and Technorati ID. Both are javascript.

Now Sidebar Module seems to remove all javascript from these added modules preventing their function.

#9 @jcwinnie
17 years ago

An update:

Which is consternation about the Ajax used in Sidebar Module breaking other stuff. In other words, reverting to an earlier pre-SBM k2 theme eliminated problem #1 -- Lost Multi-tag Search Functionailty.

#10 @matt
17 years ago

  • Resolution set to wontfix
  • Status changed from new to closed

I don't see what this provides that KSES doesn't. Showing what's in between the tags when the strip things is okay.

#11 @Nazgul
16 years ago

  • Milestone 2.1 deleted
Note: See TracTickets for help on using tickets.