WordPress.org

Make WordPress Core

Opened 5 years ago

Closed 10 months ago

Last modified 9 months ago

#9064 closed defect (bug) (fixed)

URLs with commas are not pinged

Reported by: sirzooro Owned by: wonderboymusic
Milestone: 3.7 Priority: high
Severity: major Version: 2.7
Component: Pings/Trackbacks Keywords: has-patch
Focuses: Cc:

Description

I use following permalink format on my blog:
/%category%/%postname%,%post_id%
Unfortunately Wordpress doesn't ping them correctly - URLs extracted from post content don't have suffix with comma and post id.

Fix: in file wp-includes/comment.php change line 1400 from:
$punc = '.:?\-';

to:
$punc = '.:?\-,';

Attachments (8)

r10603_punc.diff (357 bytes) - added by FFEMTcJ 5 years ago.
Fix
comment.php.diff (274 bytes) - added by sirzooro 5 years ago.
Path for comma issue
comments.new_rx.diff (1.2 KB) - added by sirzooro 5 years ago.
Patch with new rx (slightly improved vs proposed earlier)
new-ping-urls-regex.diff (4.3 KB) - added by wonderboymusic 22 months ago.
test-extract-pingable-urls.diff (3.6 KB) - added by wonderboymusic 22 months ago.
9064-tests.diff (7.3 KB) - added by wonderboymusic 17 months ago.
9064.diff (4.0 KB) - added by wonderboymusic 17 months ago.
9064.2.diff (3.2 KB) - added by wonderboymusic 12 months ago.

Download all attachments as: .zip

Change History (42)

comment:1 follow-up: tomontoast5 years ago

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

FFEMTcJ5 years ago

Fix

comment:2 FFEMTcJ5 years ago

  • Keywords has-patch needs-testing added

Adds missing characters as per instructions above.

comment:3 Denis-de-Bernardy5 years ago

  • Keywords needs-patch added; has-patch needs-testing removed
  • Milestone changed from 2.7.2 to 2.8

patch is very fishy. in addition the parse error, it's going to introduce bugs due to the change in semantics.

it looks like gunk would be a better candidate to change, and the various characters should be escaped.

comment:4 sirzooro5 years ago

  • Cc sirzooro added
  • Keywords has-patch needs-testing added; needs-patch removed

From my perspective (as a bug discoverer) the most important thing at this moment is to fix problem with commas, described in this ticket. Last patch added by me is for this particular issue.

Underscores are already supported (\w includes letters, digits and underscores). The problem may be with remaining chars - parenthesises, apostrophe and asterisk - in particular we need to escape parenthesises and asterisk because they have special meaning in RegExp. Therefore I suggest to evaluate and apply my patch now, and submit another ticket to cover remaining chars. They are not commonly used in URLs as other characters, so they may wait a bit more :).

sirzooro5 years ago

Path for comma issue

comment:5 sirzooro5 years ago

One more thing to consider: looks like original regexp allows to extract all URLs from page, both clickable (wrapped in <a> tag) and non-clickable. I think that only clickable ones should be pinged. Therefore we can replace whole rx with something like this:

preg_match_all( '/\bhref\s*=\s*["\']?([^"\'\s]+)/', $content, $post_links_temp );

This will also require change in foreach loop below - it should iterate over $post_links_temp[1].

comment:6 sirzooro5 years ago

  • Keywords 2nd-opinion added; has-patch needs-testing removed

comment:7 Denis-de-Bernardy5 years ago

  • Keywords has-patch needs-testing dev-feedback added

sirzooro5 years ago

Patch with new rx (slightly improved vs proposed earlier)

comment:8 ryan5 years ago

  • Component changed from General to Pings/Trackbacks
  • Owner anonymous deleted

comment:9 hakre5 years ago

comments.new_rx.diff does not use $pung var. fail.

comment:10 Denis-de-Bernardy5 years ago

  • Milestone changed from 2.8 to 2.9

punting to 2.9, as changing the regex is probably too big a fix.

comment:11 Denis-de-Bernardy5 years ago

  • Keywords needs-patch added; has-patch needs-testing 2nd-opinion dev-feedback removed

comment:12 hakre5 years ago

Is it possible to define a test for this? Like steps to reproduce? I would like to look into that issue on a fresh install.

comment:13 sirzooro5 years ago

Testcase for my original problem:

  1. Set permalink format to /%category%/%postname%,%post_id% (or another one with comma)
  2. Create post no. 1
  3. Create post no. 2 with link to post no. 1
  4. Validate post no. 1 got pingback (or trackback? - I am not sure) from post no. 2

Current result: WordPress sends pingback to url truncated just before comma.

comment:14 hakre5 years ago

How can I check which actual data/request has been send? Any hint?

comment:15 sirzooro5 years ago

The simplest way is to print it and than die(). You can also log in to the file, or do something similar.

comment:16 hakre5 years ago

Where can I print it? Can you give me a direction, that would help. Thanks!

comment:17 ryan5 years ago

  • Milestone changed from 2.9 to 3.0

comment:18 dd324 years ago

  • Milestone changed from 3.0 to 3.1

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I am going to move this to 3.1 as theres been no major traction on this ticket recently.

comment:19 nacin4 years ago

  • Milestone changed from Awaiting Triage to Future Release

comment:20 in reply to: ↑ 1 hakre3 years ago

Replying to tomontoast:

If this piece of code does what I think it does then the charcahters *'()_ should also be included as valid punctuation in urls. See http://www.ietf.org/rfc/rfc1738.txt

Right, rfc1738 Section 2.2 specifies the following characters as part of URLs:

";", "/", "?", ":", "@", "=" and "&"
"$-_.+!*'(),"

And in one list:

;/?:@=&$-_.+!*'(),

Replying to dd32:

comment.php.diff

That seems to be the most straight forward method here, Clickable and non-clickable links should be pung IMO.

The make_clickable() function has some advanced regexes to offer, I assume because of it's longer history of changes: #14993, #11211, #10990, #8300, #5081 and #3228. Probably that's something to connect to, subject to change, see #16892.

Thats also the only proposed solution here which ignores trailing commas, "http://test.com/, something else" shouldn't match that comma.

Adding the other punctuation markers here (Well, underscore is already supported) specifically the ()'s could be tricky, (http://.../) vs http://wikipedia/some_article(other_ambiguation) doesnt match the 2nd properly - There are other parts of WordPress which suffer(or have suffered) that problem.

I think make_clickable() deals with all that. We have regexes and callback functions already (Related: #16892).

comment:21 wonderboymusic22 months ago

#15549 was marked as a duplicate.

comment:22 wonderboymusic22 months ago

  • Keywords has-patch dev-feedback added; needs-patch removed

I have some RegEx in a new patch that fixes both tickets

comment:23 nacin22 months ago

  • Keywords needs-unit-tests added; dev-feedback removed

Guess what _extract_pingable_urls() needs? :-)

comment:24 wonderboymusic22 months ago

Unit Test with a bunch of assertions added - requires _extract_pingable_urls

comment:25 wonderboymusic22 months ago

Another way to see how bad the current regex is - add a new post that has a bunch of URLs in its content to your local WP vhost with commas and &amp;. Tail the access_log to see the requests being made (ex tail -f /opt/local/apache2/logs/access_log).

2 examples:

http://wordpress-core/?346236346326&amp;2134362574863.437
http://wordpress-core/1,2,3,4,5,6/-1-2-3-4-/woo.html

WP will make HEAD requests to:

// breaks on the semicolon
"HEAD /?346236346326&amp HTTP/1.0"

// breaks after the first comma
HEAD /1 HTTP/1.0
Last edited 22 months ago by wonderboymusic (previous) (diff)

comment:26 wonderboymusic19 months ago

  • Milestone changed from Future Release to 3.6

wonderboymusic17 months ago

wonderboymusic17 months ago

comment:28 wonderboymusic17 months ago

9064-tests.diff contains new Unit Tests that match a ton of IDN domains. 9064.diff contains new RegEx to make this possible. Since there really is no "URL grabber" function in core, I renamed mine to wp_extract_urls(). The only "missing" piece right now is IDN URLs that render RTL. I have identified 6 that are problematic because the chars line up like http:(RTL text)//. In this list (http://www.i18nguy.com/markup/idna-examples.html), these are the hard ones right now:

Comoros http://القمر.icom.museum
Cyprus http://κυπρος.icom.museum
Egypt http://مصر.icom.museum
Mauritania http://موريتانيا.icom.museum
Morocco http://المغرب.icom.museum
Oman http://عمان.icom.museum

I had to completely ditch the RegEx inspired by Gruber. If someone can figure out the RTL or some Punycode solution, have it at. Otherwise, this RegEx is light years better than the current script.

comment:29 rmccue17 months ago

(In the event that we need to decode to Punycode, I wrote a library to do that previously. Your regex looks good though.)

comment:30 ryan14 months ago

  • Milestone changed from 3.6 to Future Release

wonderboymusic12 months ago

comment:31 wonderboymusic12 months ago

  • Milestone changed from Future Release to 3.7

comment:32 wonderboymusic12 months ago

Refreshed against trunk

comment:33 wonderboymusic10 months ago

  • Owner set to wonderboymusic
  • Resolution set to fixed
  • Status changed from new to closed

In 25313:

Replace the ancient phpfreaks.com RegEx to extract urls to ping with a more robust matcher. URLs with commas and things like &amp; were not being pinged. The new matcher even works for most IDN URLs. Adds unit tests.

Fixes #9064.

comment:34 ericmann9 months ago

  • Keywords needs-unit-tests removed
Note: See TracTickets for help on using tickets.