Opened 11 years ago
Last modified 5 years ago
#24225 assigned enhancement
Improve regular expressions when matching attributes
Reported by: |
|
Owned by: |
|
---|---|---|---|
Milestone: | Priority: | normal | |
Severity: | normal | Version: | 3.6 |
Component: | Formatting | Keywords: | has-patch needs-unit-tests |
Focuses: | Cc: |
Description
When working with HTML attributes in various functions we use something like [\'"](.+?)[\'"]
to match the attribute value. Although not used much in attribute values, such an approach breaks when a single or double quote is intentionally part of the value:
attr="val'ue"
attr='val"ue'
Some of these are new 3.6 functions, mostly around media.
Attachments (6)
Change History (27)
#2
follow-up:
↓ 3
@
11 years ago
- Milestone changed from Awaiting Review to 3.6
- Version set to trunk
echo $item;
appears to be left from debugging.
#3
in reply to:
↑ 2
@
11 years ago
Replying to SergeyBiryukov: Whoops! Thanks for catching that, updated in .2.
#4
@
11 years ago
Refreshed in 24225.3.diff:
- Remove extra slashes in backreferences - not needed since we're using single quotes
- Use the original regex in wp-admin/includes/media.php, it's trickier than I thought - can't use backreferences in character classes
#6
follow-up:
↓ 7
@
11 years ago
it's trickier than I thought - can't use backreferences in character classes
Looks like a negative lookahead might work:
http://stackoverflow.com/questions/8055727/negating-a-backreference-in-regular-expressions
#7
in reply to:
↑ 6
@
11 years ago
Replying to markjaquith: Right, but that could be up to 5 times slower depending on the data :(
#8
@
11 years ago
For the kind of data coming through image_add_caption
negative lookahead isn't going to be a performance issue. If the alignment class is first in the list (which is usually the case) there won't be a lookahead at all because of the lazy quantifier. See 24225.4.diff
#9
@
10 years ago
- Milestone changed from 3.6 to Future Release
Seems like this is a safe punt at this time.
Is 24225.4.diff the only proposed fix at this point?
#11
@
9 years ago
- Owner set to kovshenin
- Status changed from new to assigned
Decide if you still care
#13
@
9 years ago
- Milestone changed from Future Release to 4.1
Found another occurrence, covered in 24225.5.diff.
Also added 24225.6.diff as an alternative to .4.diff. It matches non-quote characters unless a quote character is encountered. When one is encountered it uses negative lookahead to attempt to match a quote that is not the opening quote, so it should be pretty efficient. It's an extreme (read impossible) edge case and doesn't need to be committed, but good to have here for future reference.
#15
follow-up:
↓ 16
@
9 years ago
([^\'"]|(?!\2).)*?
This is alternating between not a quote and not a quote?
#16
in reply to:
↑ 15
@
9 years ago
Replying to miqrogroove: That's alternating between not a quote or a quote, but not the opening one that matched.
#17
follow-up:
↓ 18
@
9 years ago
It makes more sense after reading comment 13. Regex is such a pain sometimes.
I'm wondering if this is going to be worth the effort though.
Cases like this wouldn't really work as expected with or without the patch:
value="hello class='world alignright" class="alignnone"
#18
in reply to:
↑ 17
;
follow-up:
↓ 19
@
9 years ago
Replying to miqrogroove: It should match the correct attribute if you added something like .*?\2
instead of \s?
to the end of the regex. This works in Perl:
$html = 'value="hello class=\'world alignright" class="foo alignnone"'; if ( $html =~ /(class=(["\'])(?:[^\'"]|(?!\2).)*?)align(none|left|right|center).*?\2/ ) { print "$1\n"; }
But it doesn't work in PHP :) I don't really know why. But you're right, this is not worth the effort, and as I said, the patch was just for reference as a way to match attributes with negative lookahead, just in case we ever want to revisit parsing HTML with regex again.
24225.diff fixes some of these by using a backreference to the first matched single or double quote.