Opened 4 years ago

Last modified 3 days ago

#9064 new defect (bug)

URLs with commas are not pinged

Reported by: sirzooro Owned by:
Priority: high Milestone: Future Release
Component: Pings/Trackbacks Version: 2.7
Severity: major Keywords: has-patch needs-unit-tests
Cc: sirzooro

Description

I use following permalink format on my blog:
/%category%/%postname%,%post_id%
Unfortunately Wordpress doesn't ping them correctly - URLs extracted from post content don't have suffix with comma and post id.

Fix: in file wp-includes/comment.php change line 1400 from:
$punc = '.:?\-';

to:
$punc = '.:?\-,';

Attachments (7)

r10603_punc.diff (357 bytes) - added by FFEMTcJ 4 years ago.
Fix
comment.php.diff (274 bytes) - added by sirzooro 4 years ago.
Path for comma issue
comments.new_rx.diff (1.2 KB) - added by sirzooro 4 years ago.
Patch with new rx (slightly improved vs proposed earlier)
new-ping-urls-regex.diff (4.3 KB) - added by wonderboymusic 8 months ago.
test-extract-pingable-urls.diff (3.6 KB) - added by wonderboymusic 8 months ago.
9064-tests.diff (7.3 KB) - added by wonderboymusic 3 months ago.
9064.diff (4.0 KB) - added by wonderboymusic 3 months ago.

Download all attachments as: .zip

Change History (37)

comment:1 follow-up: ↓ 20   tomontoast4 years ago

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

FFEMTcJ4 years ago

Fix

  • Keywords has-patch needs-testing added

Adds missing characters as per instructions above.

  • Keywords needs-patch added; has-patch needs-testing removed
  • Milestone changed from 2.7.2 to 2.8

patch is very fishy. in addition the parse error, it's going to introduce bugs due to the change in semantics.

it looks like gunk would be a better candidate to change, and the various characters should be escaped.

  • Cc sirzooro added
  • Keywords has-patch needs-testing added; needs-patch removed

From my perspective (as a bug discoverer) the most important thing at this moment is to fix problem with commas, described in this ticket. Last patch added by me is for this particular issue.

Underscores are already supported (\w includes letters, digits and underscores). The problem may be with remaining chars - parenthesises, apostrophe and asterisk - in particular we need to escape parenthesises and asterisk because they have special meaning in RegExp. Therefore I suggest to evaluate and apply my patch now, and submit another ticket to cover remaining chars. They are not commonly used in URLs as other characters, so they may wait a bit more :).

Path for comma issue

One more thing to consider: looks like original regexp allows to extract all URLs from page, both clickable (wrapped in <a> tag) and non-clickable. I think that only clickable ones should be pinged. Therefore we can replace whole rx with something like this:

preg_match_all( '/\bhref\s*=\s*["\']?([^"\'\s]+)/', $content, $post_links_temp );

This will also require change in foreach loop below - it should iterate over $post_links_temp[1].

  • Keywords 2nd-opinion added; has-patch needs-testing removed
  • Keywords has-patch needs-testing dev-feedback added

Patch with new rx (slightly improved vs proposed earlier)

comment:8   ryan4 years ago

  • Component changed from General to Pings/Trackbacks
  • Owner anonymous deleted

comments.new_rx.diff does not use $pung var. fail.

  • Milestone changed from 2.8 to 2.9

punting to 2.9, as changing the regex is probably too big a fix.

  • Keywords needs-patch added; has-patch needs-testing 2nd-opinion dev-feedback removed

Is it possible to define a test for this? Like steps to reproduce? I would like to look into that issue on a fresh install.

Testcase for my original problem:

  1. Set permalink format to /%category%/%postname%,%post_id% (or another one with comma)
  2. Create post no. 1
  3. Create post no. 2 with link to post no. 1
  4. Validate post no. 1 got pingback (or trackback? - I am not sure) from post no. 2

Current result: WordPress sends pingback to url truncated just before comma.

How can I check which actual data/request has been send? Any hint?

The simplest way is to print it and than die(). You can also log in to the file, or do something similar.

Where can I print it? Can you give me a direction, that would help. Thanks!

  • Milestone changed from 2.9 to 3.0
  • Milestone changed from 3.0 to 3.1

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I am going to move this to 3.1 as theres been no major traction on this ticket recently.

  • Milestone changed from Awaiting Triage to Future Release

comment:20 in reply to: ↑ 1   hakre2 years ago

Replying to tomontoast:

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

Right, rfc1738 Section 2.2 specifies the following characters as part of URLs:

";", "/", "?", ":", "@", "=" and "&"
"$-_.+!*'(),"

And in one list:

;/?:@=&$-_.+!*'(),

Replying to dd32:

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

The make_clickable() function has some advanced regexes to offer, I assume because of it's longer history of changes: #14993, #11211, #10990, #8300, #5081 and #3228. Probably that's something to connect to, subject to change, see #16892.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I think make_clickable() deals with all that. We have regexes and callback functions already (Related: #16892).

#15549 was marked as a duplicate.

  • Keywords has-patch dev-feedback added; needs-patch removed

I have some RegEx in a new patch that fixes both tickets

  • Keywords needs-unit-tests added; dev-feedback removed

Guess what _extract_pingable_urls() needs? :-)

Unit Test with a bunch of assertions added - requires _extract_pingable_urls

Another way to see how bad the current regex is - add a new post that has a bunch of URLs in its content to your local WP vhost with commas and &amp;.

2 examples:

http://wordpress-core/?346236346326&amp;2134362574863.437
http://wordpress-core/1,2,3,4,5,6/-1-2-3-4-/woo.html

WP will make HEAD requests to:

// breaks on the semicolon
"HEAD /?346236346326&amp HTTP/1.0"

// breaks after the first comma
HEAD /1 HTTP/1.0
Version 0, edited 8 months ago by wonderboymusic (next)
  • Milestone changed from Future Release to 3.6

9064-tests.diff contains new Unit Tests that match a ton of IDN domains. 9064.diff contains new RegEx to make this possible. Since there really is no "URL grabber" function in core, I renamed mine to wp_extract_urls(). The only "missing" piece right now is IDN URLs that render RTL. I have identified 6 that are problematic because the chars line up like http:(RTL text)//. In this list (http://www.i18nguy.com/markup/idna-examples.html), these are the hard ones right now:

Comoros http://القمر.icom.museum
Cyprus http://κυπρος.icom.museum
Egypt http://مصر.icom.museum
Mauritania http://موريتانيا.icom.museum
Morocco http://المغرب.icom.museum
Oman http://عمان.icom.museum

I had to completely ditch the RegEx inspired by Gruber. If someone can figure out the RTL or some Punycode solution, have it at. Otherwise, this RegEx is light years better than the current script.

(In the event that we need to decode to Punycode, I wrote a library to do that previously. Your regex looks good though.)

comment:30   ryan3 days ago

  • Milestone changed from 3.6 to Future Release
Note: See TracTickets for help on using tickets.