#9064 closed defect (bug) (fixed)
URLs with commas are not pinged
Reported by: | sirzooro | Owned by: | wonderboymusic |
---|---|---|---|
Milestone: | 3.7 | Priority: | high |
Severity: | major | Version: | 2.7 |
Component: | Pings/Trackbacks | Keywords: | has-patch |
Focuses: | Cc: |
Description
I use following permalink format on my blog:
/%category%/%postname%,%post_id%
Unfortunately Wordpress doesn't ping them correctly - URLs extracted from post content don't have suffix with comma and post id.
Fix: in file wp-includes/comment.php change line 1400 from:
$punc = '.:?\-';
to:
$punc = '.:?\-,';
Attachments (8)
Change History (42)
#2
@
16 years ago
- Keywords has-patch needs-testing added
Adds missing characters as per instructions above.
#3
@
15 years ago
- Keywords needs-patch added; has-patch needs-testing removed
- Milestone changed from 2.7.2 to 2.8
patch is very fishy. in addition the parse error, it's going to introduce bugs due to the change in semantics.
it looks like gunk would be a better candidate to change, and the various characters should be escaped.
#4
@
15 years ago
- Cc sirzooro added
- Keywords has-patch needs-testing added; needs-patch removed
From my perspective (as a bug discoverer) the most important thing at this moment is to fix problem with commas, described in this ticket. Last patch added by me is for this particular issue.
Underscores are already supported (\w includes letters, digits and underscores). The problem may be with remaining chars - parenthesises, apostrophe and asterisk - in particular we need to escape parenthesises and asterisk because they have special meaning in RegExp. Therefore I suggest to evaluate and apply my patch now, and submit another ticket to cover remaining chars. They are not commonly used in URLs as other characters, so they may wait a bit more :).
#5
@
15 years ago
One more thing to consider: looks like original regexp allows to extract all URLs from page, both clickable (wrapped in <a> tag) and non-clickable. I think that only clickable ones should be pinged. Therefore we can replace whole rx with something like this:
preg_match_all( '/\bhref\s*=\s*["\']?([^"\'\s]+)/', $content, $post_links_temp );
This will also require change in foreach
loop below - it should iterate over $post_links_temp[1]
.
#10
@
15 years ago
- Milestone changed from 2.8 to 2.9
punting to 2.9, as changing the regex is probably too big a fix.
#11
@
15 years ago
- Keywords needs-patch added; has-patch needs-testing 2nd-opinion dev-feedback removed
#12
@
15 years ago
Is it possible to define a test for this? Like steps to reproduce? I would like to look into that issue on a fresh install.
#13
@
15 years ago
Testcase for my original problem:
- Set permalink format to /%category%/%postname%,%post_id% (or another one with comma)
- Create post no. 1
- Create post no. 2 with link to post no. 1
- Validate post no. 1 got pingback (or trackback? - I am not sure) from post no. 2
Current result: WordPress sends pingback to url truncated just before comma.
#15
@
15 years ago
The simplest way is to print it and than die(). You can also log in to the file, or do something similar.
#18
@
15 years ago
- Milestone changed from 3.0 to 3.1
comment.php.diff
That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.
Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.
Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.
I am going to move this to 3.1 as theres been no major traction on this ticket recently.
#20
in reply to:
↑ 1
@
14 years ago
Replying to tomontoast:
If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt
Right, rfc1738 Section 2.2 specifies the following characters as part of URLs:
";", "/", "?", ":", "@", "=" and "&" "$-_.+!*'(),"
And in one list:
;/?:@=&$-_.+!*'(),
Replying to dd32:
comment.php.diff
That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.
The make_clickable() function has some advanced regexes to offer, I assume because of it's longer history of changes: #14993, #11211, #10990, #8300, #5081 and #3228. Probably that's something to connect to, subject to change, see #16892.
Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.
Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.
I think make_clickable() deals with all that. We have regexes and callback functions already (Related: #16892).
#22
@
12 years ago
- Keywords has-patch dev-feedback added; needs-patch removed
I have some RegEx in a new patch that fixes both tickets
#23
@
12 years ago
- Keywords needs-unit-tests added; dev-feedback removed
Guess what _extract_pingable_urls() needs? :-)
#25
@
12 years ago
Another way to see how bad the current regex is - add a new post that has a bunch of URLs in its content to your local WP vhost with commas and &. Tail the access_log to see the requests being made (ex tail -f /opt/local/apache2/logs/access_log
).
2 examples:
http://wordpress-core/?346236346326&2134362574863.437 http://wordpress-core/1,2,3,4,5,6/-1-2-3-4-/woo.html
WP will make HEAD requests to:
// breaks on the semicolon "HEAD /?346236346326& HTTP/1.0" // breaks after the first comma HEAD /1 HTTP/1.0
#28
@
12 years ago
9064-tests.diff contains new Unit Tests that match a ton of IDN domains. 9064.diff contains new RegEx to make this possible. Since there really is no "URL grabber" function in core, I renamed mine to wp_extract_urls()
. The only "missing" piece right now is IDN URLs that render RTL. I have identified 6 that are problematic because the chars line up like http:(RTL text)//
. In this list (http://www.i18nguy.com/markup/idna-examples.html), these are the hard ones right now:
Comoros http://القمر.icom.museum
Cyprus http://κυπρος.icom.museum
Egypt http://مصر.icom.museum
Mauritania http://موريتانيا.icom.museum
Morocco http://المغرب.icom.museum
Oman http://عمان.icom.museum
I had to completely ditch the RegEx inspired by Gruber. If someone can figure out the RTL or some Punycode solution, have it at. Otherwise, this RegEx is light years better than the current script.
#29
@
12 years ago
(In the event that we need to decode to Punycode, I wrote a library to do that previously. Your regex looks good though.)
If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt