WordPress.org

Make WordPress Core

Opened 17 months ago

Last modified 3 months ago

#22625 new defect (bug)

Lone less than (<) characters foul up balanceTags()

Reported by: markjaquith Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch has-unit-tests
Focuses: Cc:

Description

  1. Turn on tag balancing.
  2. Try this text:
    This is < That.
    
    <blockquote>foo</blockquote>
    
    Text <a href="#">link</a>. Post-link text.
    
  3. On save, the ending blockquote tag is stripped, and the subsequent text is blockquoted.

We should recognize lone < characters and entity encode them like we do for <3.

Attachments (6)

balanceTags-23001.patch (1.9 KB) - added by devesine 17 months ago.
balanceTags-unit-1162.patch (951 bytes) - added by devesine 17 months ago.
balanceTags-unit-1162-2.patch (1.0 KB) - added by devesine 17 months ago.
balanceTags-23001-2.patch (958 bytes) - added by devesine 17 months ago.
balanceTags-unit-1162-3.patch (1.4 KB) - added by devesine 17 months ago.
-3 is a slightly more extensive unit test for dangling < handling.
balanceTags-25012.patch (958 bytes) - added by devesine 8 months ago.

Download all attachments as: .zip

Change History (13)

comment:1 in reply to: ↑ description devesine17 months ago

Here's a unit test for this, as well as the start of a possible patch to provide this functionality. Seriously malformed things like

< blockquote > < foo < blockquote >

still get mangled to the amusing

&lt; &lt; foo < blockquote >

with that patch, which is obviously wrong - in part, because this patch only handles a single lone < leading a tag - but I wanted to put what I had done up.

I'm not sure what the expected behavior might be where floating '>' are concerned, either.

comment:2 follow-ups: azaozz17 months ago

Perhaps extend the <3 regex to search for any < that don't look like html tag start?

comment:3 in reply to: ↑ 2 devesine17 months ago

Replying to azaozz:

Perhaps extend the <3 regex to search for any < that don't look like html tag start?

Hm, that seems like it would be very inefficient, but that gives me an idea - it looks like we could change the tag regex, currently

"/<(\/?[\w:]*)\s*([^>]*)>/"

to require that a tag be present, like

"/<(\/?[\w:]+)\s*([^>]*)>/"

(and encode any < left sitting in the before-the-tag text). That seems to work fine (and is much simpler), demonstrated in the -2 patch and unit test.

comment:4 in reply to: ↑ 2 devesine17 months ago

As discussed in IRC, performance numbers over 100,000 trials (running against the HTML from http://wordpress.org/), using the following code:

$source = file_get_contents('http://wordpress.org/');
$i = 0;
$ts_arr = array();
while ($i < 100000) {
	unset($foo);
	$ts = microtime(true);
	$foo = balanceTags($source);
	$end_ts = microtime(true);
	$ts_arr[] = $end_ts - $ts;
	$i += 1;
}
echo "Min: " . min($ts_arr) . PHP_EOL;
echo "Max: " . max($ts_arr) . PHP_EOL;
echo "Mean: " . (array_sum($ts_arr) / count($ts_arr)) . PHP_EOL;

Original:

Min: 4.9829483032227E-5
Max: 0.0089290142059326
Mean: 5.6982216835022E-5

-2 patch:

Min: 4.9829483032227E-5
Max: 0.0017969608306885
Mean: 5.7010720252991E-5

Looks like the performance is pretty solidly comparable.

devesine17 months ago

-3 is a slightly more extensive unit test for dangling < handling.

comment:5 devesine16 months ago

  • Keywords has-patch added

comment:6 devesine8 months ago

Refreshed against trunk: attachment:balanceTags-25012.patch

attachment:balanceTags-unit-1162-3.patch (to add unit tests) still applies cleanly.

comment:7 nacin3 months ago

  • Component changed from General to Formatting
  • Keywords has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release
Note: See TracTickets for help on using tickets.