Make WordPress Core

Opened 11 years ago

Last modified 5 years ago

#24225 assigned enhancement

Improve regular expressions when matching attributes

Reported by: kovshenin's profile kovshenin Owned by: miqrogroove's profile miqrogroove
Milestone: Priority: normal
Severity: normal Version: 3.6
Component: Formatting Keywords: has-patch needs-unit-tests
Focuses: Cc:

Description

When working with HTML attributes in various functions we use something like [\'"](.+?)[\'"] to match the attribute value. Although not used much in attribute values, such an approach breaks when a single or double quote is intentionally part of the value:

  • attr="val'ue"
  • attr='val"ue'

Some of these are new 3.6 functions, mostly around media.

Attachments (6)

24225.diff (3.0 KB) - added by kovshenin 11 years ago.
24225.2.diff (3.0 KB) - added by kovshenin 11 years ago.
24225.3.diff (2.4 KB) - added by kovshenin 11 years ago.
24225.4.diff (776 bytes) - added by kovshenin 11 years ago.
24225.5.diff (537 bytes) - added by kovshenin 9 years ago.
24225.6.diff (604 bytes) - added by kovshenin 9 years ago.

Download all attachments as: .zip

Change History (27)

@kovshenin
11 years ago

#1 @kovshenin
11 years ago

  • Keywords has-patch added

24225.diff fixes some of these by using a backreference to the first matched single or double quote.

#2 follow-up: @SergeyBiryukov
11 years ago

  • Milestone changed from Awaiting Review to 3.6
  • Version set to trunk

echo $item; appears to be left from debugging.

@kovshenin
11 years ago

#3 in reply to: ↑ 2 @kovshenin
11 years ago

Replying to SergeyBiryukov: Whoops! Thanks for catching that, updated in .2.

@kovshenin
11 years ago

#4 @kovshenin
11 years ago

Refreshed in 24225.3.diff:

  • Remove extra slashes in backreferences - not needed since we're using single quotes
  • Use the original regex in wp-admin/includes/media.php, it's trickier than I thought - can't use backreferences in character classes

#5 @markjaquith
11 years ago

In 24241:

Improve regular expressions by using a backreference to match right quote of quote pair when matching attributes.

props kovshenin. see #24225.

#6 follow-up: @markjaquith
11 years ago

it's trickier than I thought - can't use backreferences in character classes

Looks like a negative lookahead might work:

http://stackoverflow.com/questions/8055727/negating-a-backreference-in-regular-expressions

#7 in reply to: ↑ 6 @kovshenin
11 years ago

Replying to markjaquith: Right, but that could be up to 5 times slower depending on the data :(

@kovshenin
11 years ago

#8 @kovshenin
11 years ago

For the kind of data coming through image_add_caption negative lookahead isn't going to be a performance issue. If the alignment class is first in the list (which is usually the case) there won't be a lookahead at all because of the lazy quantifier. See 24225.4.diff

#9 @nacin
11 years ago

  • Milestone changed from 3.6 to Future Release

Seems like this is a safe punt at this time.

Is 24225.4.diff the only proposed fix at this point?

#10 @kovshenin
11 years ago

Yes this is safe to punt. .4 seems to be the only proposed solution so far.

#11 @wonderboymusic
9 years ago

  • Owner set to kovshenin
  • Status changed from new to assigned

Decide if you still care

#12 @SergeyBiryukov
9 years ago

I think we can close this as fixed in [24241].

@kovshenin
9 years ago

@kovshenin
9 years ago

#13 @kovshenin
9 years ago

  • Milestone changed from Future Release to 4.1

Found another occurrence, covered in 24225.5.diff.

Also added 24225.6.diff as an alternative to .4.diff. It matches non-quote characters unless a quote character is encountered. When one is encountered it uses negative lookahead to attempt to match a quote that is not the opening quote, so it should be pretty efficient. It's an extreme (read impossible) edge case and doesn't need to be committed, but good to have here for future reference.

#14 @DrewAPicture
9 years ago

  • Component changed from General to Formatting

#15 follow-up: @miqrogroove
9 years ago

([^\'"]|(?!\2).)*?

This is alternating between not a quote and not a quote?

#16 in reply to: ↑ 15 @kovshenin
9 years ago

Replying to miqrogroove: That's alternating between not a quote or a quote, but not the opening one that matched.

#17 follow-up: @miqrogroove
9 years ago

It makes more sense after reading comment 13. Regex is such a pain sometimes.

I'm wondering if this is going to be worth the effort though.

Cases like this wouldn't really work as expected with or without the patch:

value="hello class='world alignright" class="alignnone"

#18 in reply to: ↑ 17 ; follow-up: @kovshenin
9 years ago

Replying to miqrogroove: It should match the correct attribute if you added something like .*?\2 instead of \s? to the end of the regex. This works in Perl:

$html = 'value="hello class=\'world alignright" class="foo alignnone"';
if ( $html =~ /(class=(["\'])(?:[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/ ) {
	print "$1\n";
}

But it doesn't work in PHP :) I don't really know why. But you're right, this is not worth the effort, and as I said, the patch was just for reference as a way to match attributes with negative lookahead, just in case we ever want to revisit parsing HTML with regex again.

#19 in reply to: ↑ 18 @kovshenin
9 years ago

But it doesn't work in PHP :) I don't really know why.

Oh I know why, it's hitting the backtrack limit :) we can put the non-quote-or-quote match in an atomic group, like this:

/(class=(["\'])(?>[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/

#20 @johnbillion
9 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from 4.1 to Future Release

#21 @wonderboymusic
9 years ago

  • Owner changed from kovshenin to miqrogroove

I'm lost as to where this discussion ended - do we need to do anything?

Note: See TracTickets for help on using tickets.