Make WordPress Core

Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#31645 closed defect (bug) (fixed)

Press This: Reject relative URLs when scraping source html

Reported by: kraftbj's profile kraftbj Owned by: azaozz's profile azaozz
Milestone: 4.2 Priority: normal
Severity: normal Version: 4.2
Component: Press This Keywords: has-patch
Focuses: Cc:

Description

We're relying on the <link rel="canonical" href=...> tag for the link rendered for a Pressed page. The standard allows relative URLs in that field, which are currently being allowed in PT.

Steps to duplicate:

  1. Create a test html page with
    <html>
    <head>
    	<link rel="canonical" href="/test.html">
    	<title>Test</title>
    </head>
    <body>
    	<h1>TEST</h1>
    </body>
    </html>
    
  1. Press that page
  2. Check URL used in the "source" link.

Expected: An absolute URL
Actual: Relative URL based on the href tag.

Attachments (10)

31645.patch (879 bytes) - added by kraftbj 10 years ago.
Exclude URLs that begin with /
31645.2.patch (621 bytes) - added by kraftbj 10 years ago.
Correct patch. Previous one included additional code submitted against different ticket.
31645.3.patch (854 bytes) - added by SergeyBiryukov 10 years ago.
31645.4.patch (1019 bytes) - added by azaozz 10 years ago.
31645.5.patch (1.2 KB) - added by azaozz 10 years ago.
31645-make-fully-qualified.diff (819 bytes) - added by stephdau 10 years ago.
Suggestion to make canonical links fully qualified.
31645.6.patch (1.1 KB) - added by azaozz 10 years ago.
31645.diff (1.3 KB) - added by stephdau 10 years ago.
31645.7.patch (1.3 KB) - added by stephdau 10 years ago.
Same as 31645.diff, but with .patch so it gets renamed with 31645.7.patch upon upload. :)
31645.8.patch (1.0 KB) - added by azaozz 10 years ago.

Download all attachments as: .zip

Change History (26)

@kraftbj
10 years ago

Exclude URLs that begin with /

#1 @kraftbj
10 years ago

  • Keywords has-patch added

esc_url_raw will purposely allow relative URLs despite having the allowed protocols defined.

31645.patch borrows the check on https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=31771#L3078 in esc_url that is exempting relative URLs from the protocol check and adds it to _limit_url as a trigger to return null.

@kraftbj
10 years ago

Correct patch. Previous one included additional code submitted against different ticket.

#2 @DrewAPicture
10 years ago

  • Milestone changed from Awaiting Review to 4.2

#3 @SergeyBiryukov
10 years ago

31645.3.patch makes the comment a bit clearer and fixes a typo.

@azaozz
10 years ago

#4 @azaozz
10 years ago

Thinking we should specifically reject <link rel="canonical" ...> URLs that are not absolute or protocol relative. For the rest (which are mostly image src), we should try to prepend the domain if they are root-relative. (Still not sure if should do that in PHP or JS, needs more testing).

@azaozz
10 years ago

#5 @azaozz
10 years ago

In 31645.5.patch: reject canonical URLs if they are not absolute or protocol relative before we clean them.

Last edited 10 years ago by azaozz (previous) (diff)

#6 @kraftbj
10 years ago

Tested .5 with the following variations of the canonical URL:

http://example.com/123.html
https://example.com/123.html
//example.com/123.html
/123.html
123.html

In the first three, the canonical URL is used and the latter two the URL is rejected, all as expected.

We're scraping canonical, shortlink, icon—is there a reason we're okay with relative URLs for shortlinks or the icon? Shortlinks appear to be a very last resort for a canonical URL. I don't see what we're using the icon so far.

#7 @stephdau
10 years ago

Why would we reject canonical links from root?
I've suggested dealing with them as so in https://core.trac.wordpress.org/attachment/ticket/31637/31637.2.diff (other ticket)

if ( preg_match( '/^[\/]{1}[^\/]+/', $link ) ) { 
    $link = preg_replace( '/^(https?:\/\/[^\/]+)(\/.*)?$/', '\1', $data['u'] ) . $link; 
} 

@stephdau
10 years ago

Suggestion to make canonical links fully qualified.

#8 @stephdau
10 years ago

So yes, I think we should deal with this inside get_canonical_link() instead, especially since it itself uses alternative data keys if a canonical link isn't available, some which might have the same symptoms.

I'd do with something along the lines of 31645-make-fully-qualified.diff, and then make sure that whatever uses get_canonical_link() tests for en empty return (which we should have anyways, if we did not yet).

My patch doesn't process blah.html, or ../blah.html, etc. Just /blah and blah.com/blah, but simply for the sake of giving a simple example. It could.

@azaozz
10 years ago

#9 @azaozz
10 years ago

In 31645.6.patch: prepend the protocol and domain to all root-relative URLs.

#10 follow-up: @stephdau
10 years ago

31645.6.patch will not work as is because $url goes through esc_url_raw() before being tested, which prepends http:// to whatever value is passed to it... So 123.html becomes http://123.html/.

Last edited 10 years ago by stephdau (previous) (diff)

@stephdau
10 years ago

@stephdau
10 years ago

Same as 31645.diff, but with .patch so it gets renamed with 31645.7.patch upon upload. :)

#12 @kraftbj
10 years ago

Tried the test URLs previously and threw in a couple of random WTH examples (bought -1 beers at the bar)

http://example.com/123.html -- WORKS (use canonical)
https://example.com/123.html -- WORKS (use canonical)
//example.com/123.html -- WORKS (use canonical, plus http:)
/123.html -- WORKS (use canonical, prepended by parent domain)
123.html -- WORKS (ignore canonical, uses parent URL)
mailto:example@example.com -- WORKS (ignore canonical, uses parent URL)
ftp://example.com -- WORKS (ignore canonical, uses parent URL)

#13 in reply to: ↑ 10 @azaozz
10 years ago

Replying to stephdau:

31645.6.patch will not work as is because $url goes through esc_url_raw() before being tested, which prepends http:// to whatever value is passed to it... So 123.html becomes http://123.html/.

Right, we have to run that test before esc_url_raw() as it will prepend http:// to some relative URLs.

Looking at 31645.7.patch:

  • '/^[\/]{1}[^\/]+/' is exactly the same as '%^/[^/]+%' and $url{0} === '/' && $url{1} !== '/' with the last one being much faster than any PCRE function.
  • This '/^[\/]{2}[^\/]+/' matches protocol-relative URLs, then we prepend the current protocol to them. Not sure this is desirable. URLs starting with // are the best choice for links, embeds, images, etc. (as long as the server supports both http and https). They will never trigger "Mixed/Insecure content" warnings. If the page is telling us to use these, we should :)
  • At the end we just return the whole URL passed by the user? So if there is an image src ../../assets/images/test.gif we will replace the src with the page's URL. We should be rejecting non-root relative URLs.

We can try to extrapolate the absolute URL out of the page's URL and an relative image src. We can attempt that or discard relative image sources. Shouldn't return a wrong src though.

@azaozz
10 years ago

#14 @azaozz
10 years ago

How about 31645.8.patch. It ensures we have an absolute or protocol relative URL before passing it through esc_url_raw(). It is dependent on having a proper u set, i.e. the page the user opened PT on or the URL they pasted. Also changed the (weird looking) string offset $str{0}, $str{1} checks with a regex, so it is easier to read :)

#15 @azaozz
10 years ago

  • Owner set to azaozz
  • Resolution set to fixed
  • Status changed from new to closed

In 31818:

Press This: prepend the site domain to root-relative URLs. Reject other relative URLs.
Part props stephdau, kraftbj. Fixes #31645.

#16 @SergeyBiryukov
10 years ago

In 31837:

Make a comment clearer (without implying inversed logic) and fix a typo.

see #31645.

Note: See TracTickets for help on using tickets.