Opened 4 years ago
Last modified 10 hours ago
#9064 new defect (bug)
URLs with commas are not pinged
| Reported by: |
|
Owned by: | |
|---|---|---|---|
| Priority: | high | Milestone: | Future Release |
| Component: | Pings/Trackbacks | Version: | 2.7 |
| Severity: | major | Keywords: | has-patch needs-unit-tests |
| Cc: | sirzooro |
Description
I use following permalink format on my blog:
/%category%/%postname%,%post_id%
Unfortunately Wordpress doesn't ping them correctly - URLs extracted from post content don't have suffix with comma and post id.
Fix: in file wp-includes/comment.php change line 1400 from:
$punc = '.:?\-';
to:
$punc = '.:?\-,';
Attachments (7)
Change History (37)
comment:1
follow-up:
↓ 20
tomontoast — 4 years ago
- Keywords has-patch needs-testing added
Adds missing characters as per instructions above.
- Keywords needs-patch added; has-patch needs-testing removed
- Milestone changed from 2.7.2 to 2.8
patch is very fishy. in addition the parse error, it's going to introduce bugs due to the change in semantics.
it looks like gunk would be a better candidate to change, and the various characters should be escaped.
- Cc sirzooro added
- Keywords has-patch needs-testing added; needs-patch removed
From my perspective (as a bug discoverer) the most important thing at this moment is to fix problem with commas, described in this ticket. Last patch added by me is for this particular issue.
Underscores are already supported (\w includes letters, digits and underscores). The problem may be with remaining chars - parenthesises, apostrophe and asterisk - in particular we need to escape parenthesises and asterisk because they have special meaning in RegExp. Therefore I suggest to evaluate and apply my patch now, and submit another ticket to cover remaining chars. They are not commonly used in URLs as other characters, so they may wait a bit more :).
One more thing to consider: looks like original regexp allows to extract all URLs from page, both clickable (wrapped in <a> tag) and non-clickable. I think that only clickable ones should be pinged. Therefore we can replace whole rx with something like this:
preg_match_all( '/\bhref\s*=\s*["\']?([^"\'\s]+)/', $content, $post_links_temp );
This will also require change in foreach loop below - it should iterate over $post_links_temp[1].
- Keywords has-patch needs-testing dev-feedback added
- Component changed from General to Pings/Trackbacks
- Owner anonymous deleted
- Milestone changed from 2.8 to 2.9
punting to 2.9, as changing the regex is probably too big a fix.
- Keywords needs-patch added; has-patch needs-testing 2nd-opinion dev-feedback removed
comment:12
hakre — 4 years ago
Is it possible to define a test for this? Like steps to reproduce? I would like to look into that issue on a fresh install.
comment:13
sirzooro — 4 years ago
Testcase for my original problem:
- Set permalink format to /%category%/%postname%,%post_id% (or another one with comma)
- Create post no. 1
- Create post no. 2 with link to post no. 1
- Validate post no. 1 got pingback (or trackback? - I am not sure) from post no. 2
Current result: WordPress sends pingback to url truncated just before comma.
comment:14
hakre — 4 years ago
How can I check which actual data/request has been send? Any hint?
comment:15
sirzooro — 4 years ago
The simplest way is to print it and than die(). You can also log in to the file, or do something similar.
comment:16
hakre — 4 years ago
Where can I print it? Can you give me a direction, that would help. Thanks!
comment:17
ryan — 3 years ago
- Milestone changed from 2.9 to 3.0
comment:18
dd32 — 3 years ago
- Milestone changed from 3.0 to 3.1
comment.php.diff
That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.
Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.
Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.
I am going to move this to 3.1 as theres been no major traction on this ticket recently.
comment:19
nacin — 3 years ago
- Milestone changed from Awaiting Triage to Future Release
comment:20
in reply to:
↑ 1
hakre — 2 years ago
Replying to tomontoast:
If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt
Right, rfc1738 Section 2.2 specifies the following characters as part of URLs:
";", "/", "?", ":", "@", "=" and "&" "$-_.+!*'(),"
And in one list:
;/?:@=&$-_.+!*'(),
Replying to dd32:
comment.php.diff
That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.
The make_clickable() function has some advanced regexes to offer, I assume because of it's longer history of changes: #14993, #11211, #10990, #8300, #5081 and #3228. Probably that's something to connect to, subject to change, see #16892.
Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.
Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.
I think make_clickable() deals with all that. We have regexes and callback functions already (Related: #16892).
wonderboymusic — 8 months ago
comment:21
wonderboymusic — 8 months ago
#15549 was marked as a duplicate.
comment:22
wonderboymusic — 8 months ago
- Keywords has-patch dev-feedback added; needs-patch removed
I have some RegEx in a new patch that fixes both tickets
comment:23
nacin — 8 months ago
- Keywords needs-unit-tests added; dev-feedback removed
Guess what _extract_pingable_urls() needs? :-)
wonderboymusic — 8 months ago
comment:24
wonderboymusic — 8 months ago
Unit Test with a bunch of assertions added - requires _extract_pingable_urls
comment:25
wonderboymusic — 8 months ago
Another way to see how bad the current regex is - add a new post that has a bunch of URLs in its content to your local WP vhost with commas and &. Tail the access_log to see the requests being made (ex tail -f /opt/local/apache2/logs/access_log).
2 examples:
http://wordpress-core/?346236346326&2134362574863.437 http://wordpress-core/1,2,3,4,5,6/-1-2-3-4-/woo.html
WP will make HEAD requests to:
// breaks on the semicolon "HEAD /?346236346326& HTTP/1.0" // breaks after the first comma HEAD /1 HTTP/1.0
comment:26
wonderboymusic — 5 months ago
- Milestone changed from Future Release to 3.6
comment:27
wonderboymusic — 4 months ago
Related [23329]
wonderboymusic — 3 months ago
wonderboymusic — 3 months ago
comment:28
wonderboymusic — 3 months ago
9064-tests.diff contains new Unit Tests that match a ton of IDN domains. 9064.diff contains new RegEx to make this possible. Since there really is no "URL grabber" function in core, I renamed mine to wp_extract_urls(). The only "missing" piece right now is IDN URLs that render RTL. I have identified 6 that are problematic because the chars line up like http:(RTL text)//. In this list (http://www.i18nguy.com/markup/idna-examples.html), these are the hard ones right now:
Comoros http://القمر.icom.museum
Cyprus http://κυπρος.icom.museum
Egypt http://مصر.icom.museum
Mauritania http://موريتانيا.icom.museum
Morocco http://المغرب.icom.museum
Oman http://عمان.icom.museum
I had to completely ditch the RegEx inspired by Gruber. If someone can figure out the RTL or some Punycode solution, have it at. Otherwise, this RegEx is light years better than the current script.
comment:29
rmccue — 3 months ago
(In the event that we need to decode to Punycode, I wrote a library to do that previously. Your regex looks good though.)
comment:30
ryan — 10 hours ago
- Milestone changed from 3.6 to Future Release

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt