WordPress.org

Make WordPress Core

Opened 2 years ago

Last modified 5 months ago

#24225 assigned enhancement

Improve regular expressions when matching attributes

Reported by: kovshenin Owned by: kovshenin
Milestone: Future Release Priority: normal
Severity: normal Version: 3.6
Component: Formatting Keywords: has-patch needs-unit-tests
Focuses: Cc:

Description

When working with HTML attributes in various functions we use something like [\'"](.+?)[\'"] to match the attribute value. Although not used much in attribute values, such an approach breaks when a single or double quote is intentionally part of the value:

  • attr="val'ue"
  • attr='val"ue'

Some of these are new 3.6 functions, mostly around media.

Attachments (6)

24225.diff (3.0 KB) - added by kovshenin 2 years ago.
24225.2.diff (3.0 KB) - added by kovshenin 2 years ago.
24225.3.diff (2.4 KB) - added by kovshenin 2 years ago.
24225.4.diff (776 bytes) - added by kovshenin 2 years ago.
24225.5.diff (537 bytes) - added by kovshenin 7 months ago.
24225.6.diff (604 bytes) - added by kovshenin 7 months ago.

Download all attachments as: .zip

Change History (26)

@kovshenin2 years ago

comment:1 @kovshenin2 years ago

  • Keywords has-patch added

24225.diff fixes some of these by using a backreference to the first matched single or double quote.

comment:2 follow-up: @SergeyBiryukov2 years ago

  • Milestone changed from Awaiting Review to 3.6
  • Version set to trunk

echo $item; appears to be left from debugging.

@kovshenin2 years ago

comment:3 in reply to: ↑ 2 @kovshenin2 years ago

Replying to SergeyBiryukov: Whoops! Thanks for catching that, updated in .2.

@kovshenin2 years ago

comment:4 @kovshenin2 years ago

Refreshed in 24225.3.diff:

  • Remove extra slashes in backreferences - not needed since we're using single quotes
  • Use the original regex in wp-admin/includes/media.php, it's trickier than I thought - can't use backreferences in character classes

comment:5 @markjaquith2 years ago

In 24241:

Improve regular expressions by using a backreference to match right quote of quote pair when matching attributes.

props kovshenin. see #24225.

comment:6 follow-up: @markjaquith2 years ago

it's trickier than I thought - can't use backreferences in character classes

Looks like a negative lookahead might work:

http://stackoverflow.com/questions/8055727/negating-a-backreference-in-regular-expressions

comment:7 in reply to: ↑ 6 @kovshenin2 years ago

Replying to markjaquith: Right, but that could be up to 5 times slower depending on the data :(

@kovshenin2 years ago

comment:8 @kovshenin2 years ago

For the kind of data coming through image_add_caption negative lookahead isn't going to be a performance issue. If the alignment class is first in the list (which is usually the case) there won't be a lookahead at all because of the lazy quantifier. See 24225.4.diff

comment:9 @nacin22 months ago

  • Milestone changed from 3.6 to Future Release

Seems like this is a safe punt at this time.

Is 24225.4.diff the only proposed fix at this point?

comment:10 @kovshenin22 months ago

Yes this is safe to punt. .4 seems to be the only proposed solution so far.

comment:11 @wonderboymusic7 months ago

  • Owner set to kovshenin
  • Status changed from new to assigned

Decide if you still care

comment:12 @SergeyBiryukov7 months ago

I think we can close this as fixed in [24241].

@kovshenin7 months ago

@kovshenin7 months ago

comment:13 @kovshenin7 months ago

  • Milestone changed from Future Release to 4.1

Found another occurrence, covered in 24225.5.diff.

Also added 24225.6.diff as an alternative to .4.diff. It matches non-quote characters unless a quote character is encountered. When one is encountered it uses negative lookahead to attempt to match a quote that is not the opening quote, so it should be pretty efficient. It's an extreme (read impossible) edge case and doesn't need to be committed, but good to have here for future reference.

comment:14 @DrewAPicture6 months ago

  • Component changed from General to Formatting

comment:15 follow-up: @miqrogroove6 months ago

([^\'"]|(?!\2).)*?

This is alternating between not a quote and not a quote?

comment:16 in reply to: ↑ 15 @kovshenin6 months ago

Replying to miqrogroove: That's alternating between not a quote or a quote, but not the opening one that matched.

comment:17 follow-up: @miqrogroove6 months ago

It makes more sense after reading comment 13. Regex is such a pain sometimes.

I'm wondering if this is going to be worth the effort though.

Cases like this wouldn't really work as expected with or without the patch:

value="hello class='world alignright" class="alignnone"

comment:18 in reply to: ↑ 17 ; follow-up: @kovshenin6 months ago

Replying to miqrogroove: It should match the correct attribute if you added something like .*?\2 instead of \s? to the end of the regex. This works in Perl:

$html = 'value="hello class=\'world alignright" class="foo alignnone"';
if ( $html =~ /(class=(["\'])(?:[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/ ) {
	print "$1\n";
}

But it doesn't work in PHP :) I don't really know why. But you're right, this is not worth the effort, and as I said, the patch was just for reference as a way to match attributes with negative lookahead, just in case we ever want to revisit parsing HTML with regex again.

comment:19 in reply to: ↑ 18 @kovshenin6 months ago

But it doesn't work in PHP :) I don't really know why.

Oh I know why, it's hitting the backtrack limit :) we can put the non-quote-or-quote match in an atomic group, like this:

/(class=(["\'])(?>[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/

comment:20 @johnbillion5 months ago

  • Keywords needs-unit-tests added
  • Milestone changed from 4.1 to Future Release
Note: See TracTickets for help on using tickets.