Make WordPress Core

Opened 14 years ago

Closed 14 years ago

Last modified 14 years ago

#14993 closed defect (bug) (fixed)

make_clickable issue with ! character in URL

Reported by: lancewillett's profile lancewillett Owned by: westi's profile westi
Milestone: 3.1 Priority: high
Severity: critical Version: 3.0.1
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

This comes to light with recent Twitter changes to how tweet permalinks work. Example: http://twitter.com/#!/wordpress/status/25907440233

When filtered with make_clickable the URL result stops at the # character. Resulting HTML:

<a href="http://twitter.com/#" rel="noreferrer nofollow">http://twitter.com/#</a>
!/wordpress/status/25907440233

Adding ! to the pattern would avoid this.

Attachments (6)

make-clickable-twitter.14993.diff (1.1 KB) - added by filosofo 14 years ago.
TestMakeClickable.php (11.4 KB) - added by filosofo 14 years ago.
make-clickable-twitter.14993.2.diff (1.1 KB) - added by filosofo 14 years ago.
make-clickable-twitter.14993.3.diff (1.1 KB) - added by filosofo 14 years ago.
make_clickable_segfault.php (8.8 KB) - added by westi 14 years ago.
Demo file - with the patch applied making the comment clickable segfaults
make-clickable-ultimate.14993.diff (1.6 KB) - added by filosofo 14 years ago.

Download all attachments as: .zip

Change History (51)

#1 @Viper007Bond
14 years ago

The drawback would be this:

Check out http://wordpress.org!

I'm not sure how the regex works, but it'd be nice if ! was included only if it had URL characters after it.

#2 @mdawaffe
14 years ago

This pattern will probably become increasingly popular: http://code.google.com/web/ajaxcrawling/docs/specification.html

#3 @nacin
14 years ago

  • Milestone changed from Awaiting Review to 3.1

#4 @westi
14 years ago

  • Owner set to westi
  • Status changed from new to accepted

We have some good test cases for this function so should be easy to test changes

I think we should require something after the ! though so as to not link in Vipers example

If only we could require email style url wrapping in < and > ;-)

#5 @lancewillett
14 years ago

  • Cc lance@… added

#6 follow-up: @hakre
14 years ago

This is a bug in wordpress. URL is defined by RFC3986 and the "!" character is part of the reserved group which makes it immune to transition. Simply said: It's part of the URL - in any case.

Space is not btw, which means, that if it's followed by space it can be both: an exclamation mark and the end of a sentence.

As we already do treat other chars of the reserved group as part of the URL (e.g. "?"), "!" should be treated the same.

If the discussion in this ticket changes some minds that treating characters of the reserved group differently as we currently do in make_clickable, then the behaviour should be changed for the other reserved characters as well.

For any decision, this will break backwards-compability.

Reserved characters are:

      reserved    = gen-delims / sub-delims

      gen-delims  = ":" / "/" / "?" / "#" / "[" / "]" / "@"

      sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

Source: RFC 3986 page 13

A nice test-text:

Did it appear on http://www.google.com/?

Yes, it did appear on http://www.google.com/!

Example of the current implementation on wordpress.com: Link.

#7 in reply to: ↑ 6 @Viper007Bond
14 years ago

Replying to hakre:

Example of the current implementation on wordpress.com: Link.

I think ? being a part of the link in that example is a bug. I like how GMail formatted your example (I received your comment via e-mail). Google.com was properly links, but neither the ? nor the ! were.

#8 @hakre
14 years ago

About what the users are concerned is that ! is not linked (link ends prior to !) I think we should put that in because otherwise twitter users will spam trac all the time as they see the bug. See: #15070.

#9 @dossy
14 years ago

  • Cc dossy@… added

I think the behavior of least surprise is to treat "reserved" characters per RFC3986 as part of the URL *if and only if* it isn't followed by whitespace. This is necessary but not sufficient, as what would happen with this URL would be surprising:

http://www.google.com/?!

The URL would be treated as "http://www.google.com/?" - since "!" is followed by whitespace. However, did the user intend the URL to be "http://www.google.com/" followed by "?!" as punctuation? Or, should both the "?!" have been part of the linked URL?

This kind of "intention inference" is really tough. I think there will always be cases where it "does the wrong thing" (not what the author intended) ...

#10 @jane
14 years ago

  • Keywords needs-patch added

If someone wants to get this in for 3.1, please post a patch now, as freeze is right around the corner.

#11 @nacin
14 years ago

  • Priority changed from normal to high
  • Severity changed from normal to major

With http://code.google.com/web/ajaxcrawling/docs/specification.html and adoption especially by Twitter, I'm upping priority and severity on this. Tempted to say this is a must for 3.1. Leaving it as a bug, cause that's what it is.

#12 @nickmomrik
14 years ago

  • Cc nickmomrik added

#13 @norbertm
14 years ago

Comments above make a lot of sense.

Working on a patch for this.

#14 @norbertm
14 years ago

Looked deeply into this issue as well as the standards and it's much more about not having the proper URL matching pattern in make_clickable() rather than Twitter's new URLs.

Currently everything following the protocol part is matched with a single expression which is not correct. We may want to open another issue for getting RFC 3986 compliant by matching part specific patterns. Once we are standards compliant, we can add the two additonal layers on the top of it that we have right now, specifically XSS protection and logic for trailing periods, commas etc.

For a quick fix, submitting a patch very soon (including test cases) that adds the missing exclamation mark to the current pattern. This will avoid Twitter specific bug reports in 3.1.

Will submit a ticket for proposing standards compliance as well.

#15 @westi
14 years ago

Great. Thank You!

#16 @nickmomrik
14 years ago

  • Cc nickmomrik removed

#17 @nickmomrik
14 years ago

  • Cc mtdewvirus@… added

#18 @norbertm
14 years ago

  • Cc norbertm added

#19 @filosofo
14 years ago

  • Keywords has-patch added; needs-patch removed

make-clickable-twitter.14993.diff makes those kinds of URLs clickable and passes the WP unit tests as well as some from the other times we've dealt with make_clickable before (such as #10990).

#21 follow-up: @filosofo
14 years ago

TestMakeClickable.php is what I use to test make_clickable, with tests from the WP test suite and some gleaned from those other tickets.

Note that it says it's testing make_clickable but really it's just testing the making-clickableness for the http protocol, not the others.

You will need to change the include lines to point to your respective WP trunk files, then run:

phpunit --verbose TestMakeClickable

#22 @filosofo
14 years ago

s/protocol/scheme/

#23 in reply to: ↑ 21 @westi
14 years ago

Replying to filosofo:

TestMakeClickable.php is what I use to test make_clickable, with tests from the WP test suite and some gleaned from those other tickets.

Note that it says it's testing make_clickable but really it's just testing the making-clickableness for the http protocol, not the others.

You will need to change the include lines to point to your respective WP trunk files, then run:

phpunit --verbose TestMakeClickable

Going to add the extra test cases to the WP suite thanks :-)

#24 @westi
14 years ago

Opened a ticket for merging these extra test cases and merged a few of them.

http://unit-tests.trac.wordpress.org/ticket/8

Patches for extra test cases always welcome :-)

#25 follow-up: @westi
14 years ago

Committed filosofo's patch in [16111]

Added extra tests in http://unit-tests.trac.wordpress.org/changeset/314

It would still be nice to not link in the ! if after a url for emphasis as discussed above.

#26 in reply to: ↑ 25 @filosofo
14 years ago

Replying to westi:

Committed filosofo's patch in [16111]

Added extra tests in http://unit-tests.trac.wordpress.org/changeset/314

It would still be nice to not link in the ! if after a url for emphasis as discussed above.

OK, make-clickable-twitter.14993.2.diff makes it so the RFC 3986 sub-delims are ignored on either end of a URL, such as

This is a really good tweet http://twitter.com/#!/wordpress/status/25907440233!

except for a closing parenthesis ")" on the end, because from the WP tests the following is supposed to be the entire URL:

http://en.wikipedia.org/wiki/PC_Tools_(Central_Point_Software)

This nods to common practice rather than specification, and changing it would break backwards-compatibility.

#27 @westi
14 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

[16150] - Don't capture RFC3986 sub-delims when making urls clickable except for ). Fixes #14993 props filosofo.

#28 @westi
14 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

This chomps image urls and breaks shortcodes reverting.

#30 follow-up: @filosofo
14 years ago

Can you make test cases for those failures, or at least paste examples here so I can see what you mean?

#31 in reply to: ↑ 30 @westi
14 years ago

Replying to filosofo:

Can you make test cases for those failures, or at least paste examples here so I can see what you mean?

Yes I am working on those.

#32 @westi
14 years ago

http://unit-tests.trac.wordpress.org/changeset/315

Summary - urls quoted with single quotes were being made clickable which broke use as attributes of existing HTML elements :-)

#33 @filosofo
14 years ago

make-clickable-twitter.14993.3.diff passes the new tests.

#34 follow-up: @westi
14 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [16279]) Don't capture the pesky trailing punctuation. Fixes #14993 props filosofo

#35 in reply to: ↑ 34 @westi
14 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Replying to westi:

(In [16279]) Don't capture the pesky trailing punctuation. Fixes #14993 props filosofo

This can cause stack exhaustion in the pcre library leading to a segfault due to the recursive nature of the RegEx matching.

#36 @westi
14 years ago

(In [16692]) Revert [16279] - the recursive nature of this regex doesn't play well with stack space. See #14993

#37 @filosofo
14 years ago

Can you provide an example that leads to this problem?

#38 follow-up: @filosofo
14 years ago

Any more information available on this?

#39 in reply to: ↑ 38 @westi
14 years ago

Replying to filosofo:

Any more information available on this?

Sorry I forgot to come back and upload the test data :-(

Here it is

@westi
14 years ago

Demo file - with the patch applied making the comment clickable segfaults

#40 @westi
14 years ago

gdb backtrace from the segfault:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb6f7f6d0 (LWP 8029)]
match (
    eptr=0x9e71014 "][/url][url=http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/jess19_150.jpg[/img][/url][url=http://www.streamate."..., ecode=0xa88e8dc "M", 
    mstart=0x9e702c9 "http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/natalybaby_150.jpg[/img][/url][url=http://www.streamate.com/?DF="..., offset_top=2, md=0xbfb92df0, ims=5, eptrb=0x0, flags=0, rdepth=6795)
    at pcre_exec.c:405
405	pcre_exec.c: No such file or directory.
	in pcre_exec.c
(gdb) bt
#0  match (
    eptr=0x9e71014 "][/url][url=http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/jess19_150.jpg[/img][/url][url=http://www.streamate."..., ecode=0xa88e8dc "M", 
    mstart=0x9e702c9 "http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/natalybaby_150.jpg[/img][/url][url=http://www.streamate.com/?DF="..., offset_top=2, md=0xbfb92df0, ims=5, eptrb=0x0, flags=0, rdepth=6795)
    at pcre_exec.c:405
#1  0xb74dbfd8 in match (
    eptr=0x9e71014 "][/url][url=http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/jess19_150.jpg[/img][/url][url=http://www.streamate."..., ecode=0xa88e8d9 "]", 
    mstart=0x9e702c9 "http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/natalybaby_150.jpg[/img][/url][url=http://www.streamate.com/?DF="..., offset_top=2, md=0xbfb92df0, ims=5, eptrb=0x0, flags=0, rdepth=6794)
    at pcre_exec.c:749
#2  0xb74e0d22 in match (
    eptr=0x9e71014 "][/url][url=http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/jess19_150.jpg[/img][/url][url=http://www.streamate."..., ecode=0xa88e9c1 "U", 
    mstart=0x9e702c9 "http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/natalybaby_150.jpg[/img][/url][url=http://www.streamate.com/?DF="..., offset_top=2, md=0xbfb92df0, ims=5, eptrb=0x0, flags=0, rdepth=6793)
    at pcre_exec.c:1289
#3  0xb74dbfd8 in match (
    eptr=0x9e71013 "g][/url][url=http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/jess19_150.jpg[/img][/url][url=http://www.streamate"..., ecode=0xa88e8d9 "]", 
    mstart=0x9e702c9 "http://www.streamate.com/?DF=0&amp;AFNO=1-0-605655-342720&amp;UHNSMTY=303][img]http://won.images.streamray.com/images/streamray/won/jpg/natalybaby_150.jpg[/img][/url][url=http://www.streamate.com/?DF="..., offset_top=2, md=0xbfb92df0, ims=5, eptrb=0x0, flags=0, rdepth=6792)
    at pcre_exec.c:749

#41 @nacin
14 years ago

  • Keywords needs-patch added; has-patch removed
  • Severity changed from major to critical

Bumping again.

#42 @filosofo
14 years ago

  • Keywords has-patch added; needs-patch removed

make-clickable-ultimate.14993.diff passes every clickable test that I know of, and it should not cause a segmentation fault for texts like westi's example.

#43 @filosofo
14 years ago

I forgot to mention that I added a test to my personal clickable test based on how it fails in this comment. (The commenter later complains because WordPress mishandles the closing parenthesis.)

array of original and expected output:

array(
        'In his famous speech “You and Your research” (here:
        http://www.cs.virginia.edu/~robins/YouAndYourResearch.html)
        Richard Hamming wrote about people getting more done with their doors closed, but', 
        'In his famous speech “You and Your research” (here:
        <a href="http://www.cs.virginia.edu/~robins/YouAndYourResearch.html" rel="nofollow">http://www.cs.virginia.edu/~robins/YouAndYourResearch.html</a>)
        Richard Hamming wrote about people getting more done with their doors closed, but', 
),  

#44 follow-up: @westi
14 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

(In [16948]) Ultimate make_clickable. Fixes #14993 props filosofo

#45 in reply to: ↑ 44 @hakre
14 years ago

Replying to westi:

(In [16948]) Ultimate make_clickable. Fixes #14993 props filosofo

Segfaults are reported out of those lines, Related: #16892

Note: See TracTickets for help on using tickets.