WordPress.org

Make WordPress Core

Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#31645 closed defect (bug) (fixed)

Press This: Reject relative URLs when scraping source html

Reported by: kraftbj Owned by: azaozz
Milestone: 4.2 Priority: normal
Severity: normal Version: 4.2
Component: Press This Keywords: has-patch
Focuses: Cc:

Description

We're relying on the <link rel="canonical" href=...> tag for the link rendered for a Pressed page. The standard allows relative URLs in that field, which are currently being allowed in PT.

Steps to duplicate:

  1. Create a test html page with
    <html>
    <head>
    	<link rel="canonical" href="/test.html">
    	<title>Test</title>
    </head>
    <body>
    	<h1>TEST</h1>
    </body>
    </html>
    
  1. Press that page
  2. Check URL used in the "source" link.

Expected: An absolute URL
Actual: Relative URL based on the href tag.

Attachments (10)

31645.patch (879 bytes) - added by kraftbj 7 years ago.
Exclude URLs that begin with /
31645.2.patch (621 bytes) - added by kraftbj 7 years ago.
Correct patch. Previous one included additional code submitted against different ticket.
31645.3.patch (854 bytes) - added by SergeyBiryukov 7 years ago.
31645.4.patch (1019 bytes) - added by azaozz 7 years ago.
31645.5.patch (1.2 KB) - added by azaozz 7 years ago.
31645-make-fully-qualified.diff (819 bytes) - added by stephdau 7 years ago.
Suggestion to make canonical links fully qualified.
31645.6.patch (1.1 KB) - added by azaozz 7 years ago.
31645.diff (1.3 KB) - added by stephdau 7 years ago.
31645.7.patch (1.3 KB) - added by stephdau 7 years ago.
Same as 31645.diff, but with .patch so it gets renamed with 31645.7.patch upon upload. :)
31645.8.patch (1.0 KB) - added by azaozz 7 years ago.

Download all attachments as: .zip

Change History (26)

@kraftbj
7 years ago

Exclude URLs that begin with /

#1 @kraftbj
7 years ago

  • Keywords has-patch added

esc_url_raw will purposely allow relative URLs despite having the allowed protocols defined.

31645.patch borrows the check on https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=31771#L3078 in esc_url that is exempting relative URLs from the protocol check and adds it to _limit_url as a trigger to return null.

@kraftbj
7 years ago

Correct patch. Previous one included additional code submitted against different ticket.

#2 @DrewAPicture
7 years ago

  • Milestone changed from Awaiting Review to 4.2

#3 @SergeyBiryukov
7 years ago

31645.3.patch makes the comment a bit clearer and fixes a typo.

@azaozz
7 years ago

#4 @azaozz
7 years ago

Thinking we should specifically reject <link rel="canonical" ...> URLs that are not absolute or protocol relative. For the rest (which are mostly image src), we should try to prepend the domain if they are root-relative. (Still not sure if should do that in PHP or JS, needs more testing).

@azaozz
7 years ago

#5 @azaozz
7 years ago

In 31645.5.patch: reject canonical URLs if they are not absolute or protocol relative before we clean them.

Last edited 7 years ago by azaozz (previous) (diff)

#6 @kraftbj
7 years ago

Tested .5 with the following variations of the canonical URL:

http://example.com/123.html
https://example.com/123.html
//example.com/123.html
/123.html
123.html

In the first three, the canonical URL is used and the latter two the URL is rejected, all as expected.

We're scraping canonical, shortlink, icon—is there a reason we're okay with relative URLs for shortlinks or the icon? Shortlinks appear to be a very last resort for a canonical URL. I don't see what we're using the icon so far.

#7 @stephdau
7 years ago

Why would we reject canonical links from root?
I've suggested dealing with them as so in https://core.trac.wordpress.org/attachment/ticket/31637/31637.2.diff (other ticket)

if ( preg_match( '/^[\/]{1}[^\/]+/', $link ) ) { 
    $link = preg_replace( '/^(https?:\/\/[^\/]+)(\/.*)?$/', '\1', $data['u'] ) . $link; 
} 

@stephdau
7 years ago

Suggestion to make canonical links fully qualified.

#8 @stephdau
7 years ago

So yes, I think we should deal with this inside get_canonical_link() instead, especially since it itself uses alternative data keys if a canonical link isn't available, some which might have the same symptoms.

I'd do with something along the lines of 31645-make-fully-qualified.diff, and then make sure that whatever uses get_canonical_link() tests for en empty return (which we should have anyways, if we did not yet).

My patch doesn't process blah.html, or ../blah.html, etc. Just /blah and blah.com/blah, but simply for the sake of giving a simple example. It could.

@azaozz
7 years ago

#9 @azaozz
7 years ago

In 31645.6.patch: prepend the protocol and domain to all root-relative URLs.

#10 follow-up: @stephdau
7 years ago

31645.6.patch will not work as is because $url goes through esc_url_raw() before being tested, which prepends http:// to whatever value is passed to it... So 123.html becomes http://123.html/.

Last edited 7 years ago by stephdau (previous) (diff)

@stephdau
7 years ago

@stephdau
7 years ago

Same as 31645.diff, but with .patch so it gets renamed with 31645.7.patch upon upload. :)

#12 @kraftbj
7 years ago

Tried the test URLs previously and threw in a couple of random WTH examples (bought -1 beers at the bar)

http://example.com/123.html -- WORKS (use canonical)
https://example.com/123.html -- WORKS (use canonical)
//example.com/123.html -- WORKS (use canonical, plus http:)
/123.html -- WORKS (use canonical, prepended by parent domain)
123.html -- WORKS (ignore canonical, uses parent URL)
mailto:example@example.com -- WORKS (ignore canonical, uses parent URL)
ftp://example.com -- WORKS (ignore canonical, uses parent URL)

#13 in reply to: ↑ 10 @azaozz
7 years ago

Replying to stephdau:

31645.6.patch will not work as is because $url goes through esc_url_raw() before being tested, which prepends http:// to whatever value is passed to it... So 123.html becomes http://123.html/.

Right, we have to run that test before esc_url_raw() as it will prepend http:// to some relative URLs.

Looking at 31645.7.patch:

  • '/^[\/]{1}[^\/]+/' is exactly the same as '%^/[^/]+%' and $url{0} === '/' && $url{1} !== '/' with the last one being much faster than any PCRE function.
  • This '/^[\/]{2}[^\/]+/' matches protocol-relative URLs, then we prepend the current protocol to them. Not sure this is desirable. URLs starting with // are the best choice for links, embeds, images, etc. (as long as the server supports both http and https). They will never trigger "Mixed/Insecure content" warnings. If the page is telling us to use these, we should :)
  • At the end we just return the whole URL passed by the user? So if there is an image src ../../assets/images/test.gif we will replace the src with the page's URL. We should be rejecting non-root relative URLs.

We can try to extrapolate the absolute URL out of the page's URL and an relative image src. We can attempt that or discard relative image sources. Shouldn't return a wrong src though.

@azaozz
7 years ago

#14 @azaozz
7 years ago

How about 31645.8.patch. It ensures we have an absolute or protocol relative URL before passing it through esc_url_raw(). It is dependent on having a proper u set, i.e. the page the user opened PT on or the URL they pasted. Also changed the (weird looking) string offset $str{0}, $str{1} checks with a regex, so it is easier to read :)

#15 @azaozz
7 years ago

  • Owner set to azaozz
  • Resolution set to fixed
  • Status changed from new to closed

In 31818:

Press This: prepend the site domain to root-relative URLs. Reject other relative URLs.
Part props stephdau, kraftbj. Fixes #31645.

#16 @SergeyBiryukov
7 years ago

In 31837:

Make a comment clearer (without implying inversed logic) and fix a typo.

see #31645.

Note: See TracTickets for help on using tickets.