WordPress.org

Make WordPress Core

Opened 6 years ago

Last modified 3 years ago

#22625 new defect (bug)

Lone less than (<) characters foul up balanceTags()

Reported by: markjaquith Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch has-unit-tests close
Focuses: Cc:

Description

  1. Turn on tag balancing.
  2. Try this text:
    This is < That.
    
    <blockquote>foo</blockquote>
    
    Text <a href="#">link</a>. Post-link text.
    
  3. On save, the ending blockquote tag is stripped, and the subsequent text is blockquoted.

We should recognize lone < characters and entity encode them like we do for <3.

Attachments (7)

balanceTags-23001.patch (1.9 KB) - added by devesine 6 years ago.
balanceTags-unit-1162.patch (951 bytes) - added by devesine 6 years ago.
balanceTags-unit-1162-2.patch (1.0 KB) - added by devesine 6 years ago.
balanceTags-23001-2.patch (958 bytes) - added by devesine 6 years ago.
balanceTags-unit-1162-3.patch (1.4 KB) - added by devesine 6 years ago.
-3 is a slightly more extensive unit test for dangling < handling.
balanceTags-25012.patch (958 bytes) - added by devesine 5 years ago.
22625.patch (2.4 KB) - added by chriscct7 3 years ago.

Download all attachments as: .zip

Change History (15)

#1 in reply to: ↑ description @devesine
6 years ago

Here's a unit test for this, as well as the start of a possible patch to provide this functionality. Seriously malformed things like

< blockquote > < foo < blockquote >

still get mangled to the amusing

&lt; &lt; foo < blockquote >

with that patch, which is obviously wrong - in part, because this patch only handles a single lone < leading a tag - but I wanted to put what I had done up.

I'm not sure what the expected behavior might be where floating '>' are concerned, either.

#2 follow-ups: @azaozz
6 years ago

Perhaps extend the <3 regex to search for any < that don't look like html tag start?

#3 in reply to: ↑ 2 @devesine
6 years ago

Replying to azaozz:

Perhaps extend the <3 regex to search for any < that don't look like html tag start?

Hm, that seems like it would be very inefficient, but that gives me an idea - it looks like we could change the tag regex, currently

"/<(\/?[\w:]*)\s*([^>]*)>/"

to require that a tag be present, like

"/<(\/?[\w:]+)\s*([^>]*)>/"

(and encode any < left sitting in the before-the-tag text). That seems to work fine (and is much simpler), demonstrated in the -2 patch and unit test.

#4 in reply to: ↑ 2 @devesine
6 years ago

As discussed in IRC, performance numbers over 100,000 trials (running against the HTML from http://wordpress.org/), using the following code:

$source = file_get_contents('http://wordpress.org/');
$i = 0;
$ts_arr = array();
while ($i < 100000) {
	unset($foo);
	$ts = microtime(true);
	$foo = balanceTags($source);
	$end_ts = microtime(true);
	$ts_arr[] = $end_ts - $ts;
	$i += 1;
}
echo "Min: " . min($ts_arr) . PHP_EOL;
echo "Max: " . max($ts_arr) . PHP_EOL;
echo "Mean: " . (array_sum($ts_arr) / count($ts_arr)) . PHP_EOL;

Original:

Min: 4.9829483032227E-5
Max: 0.0089290142059326
Mean: 5.6982216835022E-5

-2 patch:

Min: 4.9829483032227E-5
Max: 0.0017969608306885
Mean: 5.7010720252991E-5

Looks like the performance is pretty solidly comparable.

@devesine
6 years ago

-3 is a slightly more extensive unit test for dangling < handling.

#5 @devesine
6 years ago

  • Keywords has-patch added

#6 @devesine
5 years ago

Refreshed against trunk: attachment:balanceTags-25012.patch

attachment:balanceTags-unit-1162-3.patch (to add unit tests) still applies cleanly.

#7 @nacin
4 years ago

  • Component changed from General to Formatting
  • Keywords has-unit-tests added
  • Milestone changed from Awaiting Review to Future Release

@chriscct7
3 years ago

#8 @chriscct7
3 years ago

  • Keywords close added

balanceTags is now removed from new installs though, so marking for close consideration.

New patch refreshes and combines the two old ones.

Note: See TracTickets for help on using tickets.