#31645 closed defect (bug) (fixed)
Press This: Reject relative URLs when scraping source html
Reported by: | kraftbj | Owned by: | azaozz |
---|---|---|---|
Milestone: | 4.2 | Priority: | normal |
Severity: | normal | Version: | 4.2 |
Component: | Press This | Keywords: | has-patch |
Focuses: | Cc: |
Description
We're relying on the <link rel="canonical" href=...>
tag for the link rendered for a Pressed page. The standard allows relative URLs in that field, which are currently being allowed in PT.
Steps to duplicate:
- Create a test html page with
<html> <head> <link rel="canonical" href="/test.html"> <title>Test</title> </head> <body> <h1>TEST</h1> </body> </html>
- Press that page
- Check URL used in the "source" link.
Expected: An absolute URL
Actual: Relative URL based on the href tag.
Attachments (10)
Change History (26)
#1
@
9 years ago
- Keywords has-patch added
esc_url_raw
will purposely allow relative URLs despite having the allowed protocols defined.
31645.patch borrows the check on https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php?rev=31771#L3078 in esc_url
that is exempting relative URLs from the protocol check and adds it to _limit_url
as a trigger to return null.
@
9 years ago
Correct patch. Previous one included additional code submitted against different ticket.
#3
@
9 years ago
31645.3.patch makes the comment a bit clearer and fixes a typo.
#4
@
9 years ago
Thinking we should specifically reject <link rel="canonical" ...>
URLs that are not absolute or protocol relative. For the rest (which are mostly image src
), we should try to prepend the domain if they are root-relative. (Still not sure if should do that in PHP or JS, needs more testing).
#5
@
9 years ago
In 31645.5.patch: reject canonical URLs if they are not absolute or protocol relative before we clean them.
#6
@
9 years ago
Tested .5 with the following variations of the canonical URL:
http://example.com/123.html https://example.com/123.html //example.com/123.html /123.html 123.html
In the first three, the canonical URL is used and the latter two the URL is rejected, all as expected.
We're scraping canonical, shortlink, icon—is there a reason we're okay with relative URLs for shortlinks or the icon? Shortlinks appear to be a very last resort for a canonical URL. I don't see what we're using the icon so far.
#7
@
9 years ago
Why would we reject canonical links from root?
I've suggested dealing with them as so in https://core.trac.wordpress.org/attachment/ticket/31637/31637.2.diff (other ticket)
if ( preg_match( '/^[\/]{1}[^\/]+/', $link ) ) { $link = preg_replace( '/^(https?:\/\/[^\/]+)(\/.*)?$/', '\1', $data['u'] ) . $link; }
#8
@
9 years ago
So yes, I think we should deal with this inside get_canonical_link()
instead, especially since it itself uses alternative data keys if a canonical link isn't available, some which might have the same symptoms.
I'd do with something along the lines of 31645-make-fully-qualified.diff, and then make sure that whatever uses get_canonical_link()
tests for en empty return (which we should have anyways, if we did not yet).
My patch doesn't process blah.html, or ../blah.html, etc. Just /blah and blah.com/blah, but simply for the sake of giving a simple example. It could.
#10
follow-up:
↓ 13
@
9 years ago
31645.6.patch will not work as is because $url
goes through esc_url_raw()
before being tested, which prepends http://
to whatever value is passed to it... So 123.html
becomes http://123.html/
.
@
9 years ago
Same as 31645.diff
, but with .patch
so it gets renamed with 31645.7.patch
upon upload. :)
#11
@
9 years ago
How about 31645.7.patch?
#12
@
9 years ago
Tried the test URLs previously and threw in a couple of random WTH examples (bought -1 beers at the bar)
http://example.com/123.html -- WORKS (use canonical) https://example.com/123.html -- WORKS (use canonical) //example.com/123.html -- WORKS (use canonical, plus http:) /123.html -- WORKS (use canonical, prepended by parent domain) 123.html -- WORKS (ignore canonical, uses parent URL) mailto:example@example.com -- WORKS (ignore canonical, uses parent URL) ftp://example.com -- WORKS (ignore canonical, uses parent URL)
#13
in reply to:
↑ 10
@
9 years ago
Replying to stephdau:
31645.6.patch will not work as is because
$url
goes throughesc_url_raw()
before being tested, which prependshttp://
to whatever value is passed to it... So123.html
becomeshttp://123.html/
.
Right, we have to run that test before esc_url_raw()
as it will prepend http://
to some relative URLs.
Looking at 31645.7.patch:
'/^[\/]{1}[^\/]+/'
is exactly the same as'%^/[^/]+%'
and$url{0} === '/' && $url{1} !== '/'
with the last one being much faster than any PCRE function.- This
'/^[\/]{2}[^\/]+/'
matches protocol-relative URLs, then we prepend the current protocol to them. Not sure this is desirable. URLs starting with//
are the best choice for links, embeds, images, etc. (as long as the server supports both http and https). They will never trigger "Mixed/Insecure content" warnings. If the page is telling us to use these, we should :) - At the end we just return the whole URL passed by the user? So if there is an image src
../../assets/images/test.gif
we will replace the src with the page's URL. We should be rejecting non-root relative URLs.
We can try to extrapolate the absolute URL out of the page's URL and an relative image src. We can attempt that or discard relative image sources. Shouldn't return a wrong src though.
#14
@
9 years ago
How about 31645.8.patch. It ensures we have an absolute or protocol relative URL before passing it through esc_url_raw()
. It is dependent on having a proper u
set, i.e. the page the user opened PT on or the URL they pasted. Also changed the (weird looking) string offset $str{0}
, $str{1}
checks with a regex, so it is easier to read :)
Exclude URLs that begin with /