Make WordPress Core

Opened 16 years ago

Closed 11 years ago

Last modified 11 years ago

#9064 closed defect (bug) (fixed)

URLs with commas are not pinged

Reported by: sirzooro's profile sirzooro Owned by: wonderboymusic's profile wonderboymusic
Milestone: 3.7 Priority: high
Severity: major Version: 2.7
Component: Pings/Trackbacks Keywords: has-patch
Focuses: Cc:

Description

I use following permalink format on my blog:
/%category%/%postname%,%post_id%
Unfortunately Wordpress doesn't ping them correctly - URLs extracted from post content don't have suffix with comma and post id.

Fix: in file wp-includes/comment.php change line 1400 from:
$punc = '.:?\-';

to:
$punc = '.:?\-,';

Attachments (8)

r10603_punc.diff (357 bytes) - added by FFEMTcJ 16 years ago.
Fix
comment.php.diff (274 bytes) - added by sirzooro 15 years ago.
Path for comma issue
comments.new_rx.diff (1.2 KB) - added by sirzooro 15 years ago.
Patch with new rx (slightly improved vs proposed earlier)
new-ping-urls-regex.diff (4.3 KB) - added by wonderboymusic 12 years ago.
test-extract-pingable-urls.diff (3.6 KB) - added by wonderboymusic 12 years ago.
9064-tests.diff (7.3 KB) - added by wonderboymusic 12 years ago.
9064.diff (4.0 KB) - added by wonderboymusic 12 years ago.
9064.2.diff (3.2 KB) - added by wonderboymusic 11 years ago.

Download all attachments as: .zip

Change History (42)

#1 follow-up: @tomontoast
16 years ago

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

@FFEMTcJ
16 years ago

Fix

#2 @FFEMTcJ
16 years ago

  • Keywords has-patch needs-testing added

Adds missing characters as per instructions above.

#3 @Denis-de-Bernardy
15 years ago

  • Keywords needs-patch added; has-patch needs-testing removed
  • Milestone changed from 2.7.2 to 2.8

patch is very fishy. in addition the parse error, it's going to introduce bugs due to the change in semantics.

it looks like gunk would be a better candidate to change, and the various characters should be escaped.

#4 @sirzooro
15 years ago

  • Cc sirzooro added
  • Keywords has-patch needs-testing added; needs-patch removed

From my perspective (as a bug discoverer) the most important thing at this moment is to fix problem with commas, described in this ticket. Last patch added by me is for this particular issue.

Underscores are already supported (\w includes letters, digits and underscores). The problem may be with remaining chars - parenthesises, apostrophe and asterisk - in particular we need to escape parenthesises and asterisk because they have special meaning in RegExp. Therefore I suggest to evaluate and apply my patch now, and submit another ticket to cover remaining chars. They are not commonly used in URLs as other characters, so they may wait a bit more :).

@sirzooro
15 years ago

Path for comma issue

#5 @sirzooro
15 years ago

One more thing to consider: looks like original regexp allows to extract all URLs from page, both clickable (wrapped in <a> tag) and non-clickable. I think that only clickable ones should be pinged. Therefore we can replace whole rx with something like this:

preg_match_all( '/\bhref\s*=\s*["\']?([^"\'\s]+)/', $content, $post_links_temp );

This will also require change in foreach loop below - it should iterate over $post_links_temp[1].

#6 @sirzooro
15 years ago

  • Keywords 2nd-opinion added; has-patch needs-testing removed

#7 @Denis-de-Bernardy
15 years ago

  • Keywords has-patch needs-testing dev-feedback added

@sirzooro
15 years ago

Patch with new rx (slightly improved vs proposed earlier)

#8 @ryan
15 years ago

  • Component changed from General to Pings/Trackbacks
  • Owner anonymous deleted

#9 @hakre
15 years ago

comments.new_rx.diff does not use $pung var. fail.

#10 @Denis-de-Bernardy
15 years ago

  • Milestone changed from 2.8 to 2.9

punting to 2.9, as changing the regex is probably too big a fix.

#11 @Denis-de-Bernardy
15 years ago

  • Keywords needs-patch added; has-patch needs-testing 2nd-opinion dev-feedback removed

#12 @hakre
15 years ago

Is it possible to define a test for this? Like steps to reproduce? I would like to look into that issue on a fresh install.

#13 @sirzooro
15 years ago

Testcase for my original problem:

  1. Set permalink format to /%category%/%postname%,%post_id% (or another one with comma)
  2. Create post no. 1
  3. Create post no. 2 with link to post no. 1
  4. Validate post no. 1 got pingback (or trackback? - I am not sure) from post no. 2

Current result: WordPress sends pingback to url truncated just before comma.

#14 @hakre
15 years ago

How can I check which actual data/request has been send? Any hint?

#15 @sirzooro
15 years ago

The simplest way is to print it and than die(). You can also log in to the file, or do something similar.

#16 @hakre
15 years ago

Where can I print it? Can you give me a direction, that would help. Thanks!

#17 @ryan
15 years ago

  • Milestone changed from 2.9 to 3.0

#18 @dd32
14 years ago

  • Milestone changed from 3.0 to 3.1

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I am going to move this to 3.1 as theres been no major traction on this ticket recently.

#19 @nacin
14 years ago

  • Milestone changed from Awaiting Triage to Future Release

#20 in reply to: ↑ 1 @hakre
13 years ago

Replying to tomontoast:

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

Right, rfc1738 Section 2.2 specifies the following characters as part of URLs:

";", "/", "?", ":", "@", "=" and "&"
"$-_.+!*'(),"

And in one list:

;/?:@=&$-_.+!*'(),

Replying to dd32:

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

The make_clickable() function has some advanced regexes to offer, I assume because of it's longer history of changes: #14993, #11211, #10990, #8300, #5081 and #3228. Probably that's something to connect to, subject to change, see #16892.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I think make_clickable() deals with all that. We have regexes and callback functions already (Related: #16892).

#21 @wonderboymusic
12 years ago

#15549 was marked as a duplicate.

#22 @wonderboymusic
12 years ago

  • Keywords has-patch dev-feedback added; needs-patch removed

I have some RegEx in a new patch that fixes both tickets

#23 @nacin
12 years ago

  • Keywords needs-unit-tests added; dev-feedback removed

Guess what _extract_pingable_urls() needs? :-)

#24 @wonderboymusic
12 years ago

Unit Test with a bunch of assertions added - requires _extract_pingable_urls

#25 @wonderboymusic
12 years ago

Another way to see how bad the current regex is - add a new post that has a bunch of URLs in its content to your local WP vhost with commas and &amp;.

2 examples:

http://wordpress-core/?346236346326&amp;2134362574863.437
http://wordpress-core/1,2,3,4,5,6/-1-2-3-4-/woo.html

WP will make HEAD requests to:

// breaks on the semicolon
"HEAD /?346236346326&amp HTTP/1.0"

// breaks after the first comma
HEAD /1 HTTP/1.0
Version 0, edited 12 years ago by wonderboymusic (next)

#26 @wonderboymusic
12 years ago

  • Milestone changed from Future Release to 3.6

@wonderboymusic
12 years ago

#28 @wonderboymusic
12 years ago

9064-tests.diff contains new Unit Tests that match a ton of IDN domains. 9064.diff contains new RegEx to make this possible. Since there really is no "URL grabber" function in core, I renamed mine to wp_extract_urls(). The only "missing" piece right now is IDN URLs that render RTL. I have identified 6 that are problematic because the chars line up like http:(RTL text)//. In this list (http://www.i18nguy.com/markup/idna-examples.html), these are the hard ones right now:

Comoros http://القمر.icom.museum
Cyprus http://κυπρος.icom.museum
Egypt http://مصر.icom.museum
Mauritania http://موريتانيا.icom.museum
Morocco http://المغرب.icom.museum
Oman http://عمان.icom.museum

I had to completely ditch the RegEx inspired by Gruber. If someone can figure out the RTL or some Punycode solution, have it at. Otherwise, this RegEx is light years better than the current script.

#29 @rmccue
12 years ago

(In the event that we need to decode to Punycode, I wrote a library to do that previously. Your regex looks good though.)

#30 @ryan
11 years ago

  • Milestone changed from 3.6 to Future Release

#31 @wonderboymusic
11 years ago

  • Milestone changed from Future Release to 3.7

#32 @wonderboymusic
11 years ago

Refreshed against trunk

#33 @wonderboymusic
11 years ago

  • Owner set to wonderboymusic
  • Resolution set to fixed
  • Status changed from new to closed

In 25313:

Replace the ancient phpfreaks.com RegEx to extract urls to ping with a more robust matcher. URLs with commas and things like &amp; were not being pinged. The new matcher even works for most IDN URLs. Adds unit tests.

Fixes #9064.

#34 @ericmann
11 years ago

  • Keywords needs-unit-tests removed
Note: See TracTickets for help on using tickets.