WordPress.org

Make WordPress Core

Opened 22 months ago

Last modified 3 months ago

#24225 assigned enhancement

Improve regular expressions when matching attributes

Reported by: kovshenin Owned by: kovshenin
Milestone: Future Release Priority: normal
Severity: normal Version: 3.6
Component: Formatting Keywords: has-patch needs-unit-tests
Focuses: Cc:

Description

When working with HTML attributes in various functions we use something like [\'"](.+?)[\'"] to match the attribute value. Although not used much in attribute values, such an approach breaks when a single or double quote is intentionally part of the value:

  • attr="val'ue"
  • attr='val"ue'

Some of these are new 3.6 functions, mostly around media.

Attachments (6)

24225.diff (3.0 KB) - added by kovshenin 22 months ago.
24225.2.diff (3.0 KB) - added by kovshenin 22 months ago.
24225.3.diff (2.4 KB) - added by kovshenin 22 months ago.
24225.4.diff (776 bytes) - added by kovshenin 22 months ago.
24225.5.diff (537 bytes) - added by kovshenin 5 months ago.
24225.6.diff (604 bytes) - added by kovshenin 5 months ago.

Download all attachments as: .zip

Change History (26)

@kovshenin22 months ago

comment:1 @kovshenin22 months ago

  • Keywords has-patch added

24225.diff fixes some of these by using a backreference to the first matched single or double quote.

comment:2 follow-up: @SergeyBiryukov22 months ago

  • Milestone changed from Awaiting Review to 3.6
  • Version set to trunk

echo $item; appears to be left from debugging.

@kovshenin22 months ago

comment:3 in reply to: ↑ 2 @kovshenin22 months ago

Replying to SergeyBiryukov: Whoops! Thanks for catching that, updated in .2.

@kovshenin22 months ago

comment:4 @kovshenin22 months ago

Refreshed in 24225.3.diff:

  • Remove extra slashes in backreferences - not needed since we're using single quotes
  • Use the original regex in wp-admin/includes/media.php, it's trickier than I thought - can't use backreferences in character classes

comment:5 @markjaquith22 months ago

In 24241:

Improve regular expressions by using a backreference to match right quote of quote pair when matching attributes.

props kovshenin. see #24225.

comment:6 follow-up: @markjaquith22 months ago

it's trickier than I thought - can't use backreferences in character classes

Looks like a negative lookahead might work:

http://stackoverflow.com/questions/8055727/negating-a-backreference-in-regular-expressions

comment:7 in reply to: ↑ 6 @kovshenin22 months ago

Replying to markjaquith: Right, but that could be up to 5 times slower depending on the data :(

@kovshenin22 months ago

comment:8 @kovshenin22 months ago

For the kind of data coming through image_add_caption negative lookahead isn't going to be a performance issue. If the alignment class is first in the list (which is usually the case) there won't be a lookahead at all because of the lazy quantifier. See 24225.4.diff

comment:9 @nacin20 months ago

  • Milestone changed from 3.6 to Future Release

Seems like this is a safe punt at this time.

Is 24225.4.diff the only proposed fix at this point?

comment:10 @kovshenin20 months ago

Yes this is safe to punt. .4 seems to be the only proposed solution so far.

comment:11 @wonderboymusic5 months ago

  • Owner set to kovshenin
  • Status changed from new to assigned

Decide if you still care

comment:12 @SergeyBiryukov5 months ago

I think we can close this as fixed in [24241].

@kovshenin5 months ago

@kovshenin5 months ago

comment:13 @kovshenin5 months ago

  • Milestone changed from Future Release to 4.1

Found another occurrence, covered in 24225.5.diff.

Also added 24225.6.diff as an alternative to .4.diff. It matches non-quote characters unless a quote character is encountered. When one is encountered it uses negative lookahead to attempt to match a quote that is not the opening quote, so it should be pretty efficient. It's an extreme (read impossible) edge case and doesn't need to be committed, but good to have here for future reference.

comment:14 @DrewAPicture4 months ago

  • Component changed from General to Formatting

comment:15 follow-up: @miqrogroove4 months ago

([^\'"]|(?!\2).)*?

This is alternating between not a quote and not a quote?

comment:16 in reply to: ↑ 15 @kovshenin4 months ago

Replying to miqrogroove: That's alternating between not a quote or a quote, but not the opening one that matched.

comment:17 follow-up: @miqrogroove4 months ago

It makes more sense after reading comment 13. Regex is such a pain sometimes.

I'm wondering if this is going to be worth the effort though.

Cases like this wouldn't really work as expected with or without the patch:

value="hello class='world alignright" class="alignnone"

comment:18 in reply to: ↑ 17 ; follow-up: @kovshenin4 months ago

Replying to miqrogroove: It should match the correct attribute if you added something like .*?\2 instead of \s? to the end of the regex. This works in Perl:

$html = 'value="hello class=\'world alignright" class="foo alignnone"';
if ( $html =~ /(class=(["\'])(?:[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/ ) {
	print "$1\n";
}

But it doesn't work in PHP :) I don't really know why. But you're right, this is not worth the effort, and as I said, the patch was just for reference as a way to match attributes with negative lookahead, just in case we ever want to revisit parsing HTML with regex again.

comment:19 in reply to: ↑ 18 @kovshenin4 months ago

But it doesn't work in PHP :) I don't really know why.

Oh I know why, it's hitting the backtrack limit :) we can put the non-quote-or-quote match in an atomic group, like this:

/(class=(["\'])(?>[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/

comment:20 @johnbillion3 months ago

  • Keywords needs-unit-tests added
  • Milestone changed from 4.1 to Future Release
Note: See TracTickets for help on using tickets.