#36638 closed task (blessed) (fixed)
Detect broken URLs in the editor
Reported by: | iseulde | Owned by: | azaozz |
---|---|---|---|
Milestone: | 4.6 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Editor | Keywords: | has-patch commit |
Focuses: | ui | Cc: |
Description
Chatted a bit with @azaozz. We could detect broken URLs in the editor when the user adds one by sending a request to the server and then doing a HEAD request and check the response code. If it's 4xx, we could mark the bad link in the editor and if the user clicks on it show the toolbar with a message. This is to prevent accidental user mistakes (wrong copy/paste, entering query by accident...)
Another thing we could do is warn for mixed content.
Attachments (14)
Change History (66)
#2
@
9 years ago
This patch currently checks only when a user adds a link, then when the post is saved it strips the attribute denoting that it's a bad link.
Should the links also be scanned & marked when a post is opened, or in any other scenarios?
This ticket was mentioned in Slack in #core by chriscct7. View the logs.
9 years ago
#4
@
9 years ago
- Keywords needs-unit-tests needs-patch added
- Owner set to azaozz
- Status changed from new to assigned
The patch needs some adjustments, $_POST['check_url']
should probably be isset checked before running with it.
#5
follow-up:
↓ 6
@
9 years ago
Flagging a url as "bad" when a HEAD request returns a 4xx
response would also flag sites that require authentication, and would result in false negatives.
Maybe limit to just 404
or allow a filter to disable.
#6
in reply to:
↑ 5
@
9 years ago
Replying to rachelbaker:
Flagging a url as "bad" when a HEAD request returns a
4xx
response would also flag sites that require authentication, and would result in false negatives.
Maybe limit to just
404
or allow a filter to disable.
I think going for (only) 404 is feasable, avoiding side-effects and still offering a huge help for detecting false links. We are talking about the 'everyday-user', not developer, right? :)
#7
@
9 years ago
I think we should check for both 404 and 410. The latter one is basically a permanent 404.
This ticket was mentioned in Slack in #core by voldemortensen. View the logs.
8 years ago
#9
@
8 years ago
- Milestone changed from 4.6 to Future Release
Still needs some work, punting from 4.6. This can definitely be brought back into the milestone if updates are made to the patch.
#10
@
8 years ago
- Keywords needs-testing added; needs-patch removed
- Milestone changed from Future Release to 4.6
In 36638.1.patch:
- Use capabilities check and a nonce for the AJAX request.
- Use
wp_safe_remote_get()
to check the URL. Using a HEAD request sounds better, however: https://core.trac.wordpress.org/browser/tags/4.5.2/src/wp-admin/includes/upgrade.php#L331. Can probably usewp_remote_get()
instead.get_headers()
(http://php.net/manual/en/function.get-headers.php) can possibly be used too however it has some drawbacks. - Only "fail" URLs when very malformed, domain cannot be found or the response is 404. This may need revising.
- Add the error message as title on the URL in the inline toolbar. TODO: is it worth keeping this?
- Make the link text white on red background and the URL in the inline toolbar red when the test is negative.
#11
follow-up:
↓ 12
@
8 years ago
The patch applied cleanly but I added a URL link into the main content and saved and I did not see any error messages. The test URL used was https://www.google.co.uk/new
#12
in reply to:
↑ 11
@
8 years ago
Replying to lukecavanagh:
It doesn't show error message (at least for now). It highlights the link (white text on red background) and turns the link in the "link view" toolbar red. This can still change pending UI/UX feedback.
This ticket was mentioned in Slack in #core-editor by azaozz. View the logs.
8 years ago
#15
@
8 years ago
Thanks for the info. One way or the other this is a very useful feature to have in core.
#19
@
8 years ago
Ugh, sorry @rockwell15, should have included you in the props above. Will on the next commit :)
#20
@
8 years ago
Above patch:
- Added white outline for contrast on darker backgrounds.
- Changed red colour in toolbar. Maybe should have a hove colour from https://codepen.io/hugobaeta/pen/RNOzoV just like normal links in the admin have it.
This ticket was mentioned in Slack in #core by ocean90. View the logs.
8 years ago
#23
@
8 years ago
Seeing the message "Not found, HTTP error 404." as a return value for an invalid page request, think we can improve upon this a bit.
Suggestions:
The target page does not exist.
The page could not be found.
#24
follow-up:
↓ 25
@
8 years ago
- Keywords needs-patch added
- Type changed from enhancement to task (blessed)
The current approach seems fine, but I think it requires a few tweaks.
- It doesn't detect broken URLs when pasting URLs over text.
- The visual feedback is rough, see 36638.highlighting.png. The red underline is coming from a spellchecker, the red dotted outline is for the broken URL.
- The title attribute might be not the best idea in terms of a11y. Maybe we should use
wp.a11y.speak()
? cc: @afercia
#25
in reply to:
↑ 24
@
8 years ago
Replying to ocean90:
- The title attribute might be not the best idea in terms of a11y. Maybe we should use
wp.a11y.speak()
?
Yep I'd recommend to don't add new title attributes. From an a11y perspective when the link is broken there's no audible feedback and a barely noticeable visible feedback (the title attribute "tooltip").
To get the error message, screen reader users should (unlikely) go through the following process:
- after the link is inserted, press Alt+F8 to move focus to the link toolbar
- explore the link toolbar using the arrow keys (not the tab key since the displayed link is not focusable)
- finally, only when the screen reader is reading out the link displayed in the toolbar they would get the error message in the title attribute, see below:
I'd consider @ocean90 suggestion about using wp.a11y.speak()
to convey an audible warning message.
For sighted users also, the title attribute is barely noticeable and not so useful on touch devices. It would be great to try to find a good idea for a better visual feedback.
About colors, worth reminding the Editor background may be styled, for example:
Lastly, I'd suggest to review the error messages wording. As an average users I'd be probably be confused by a message saying something about "host name" or "HTTP error 404". I'd just need to know the link is broken :) As far as I see, the current error messages are:
'Invalid host name.' 'Not found, HTTP error 404.'
Maybe using a generic, simpler, sentence followed by the "technical" error code could work?
#26
@
8 years ago
What about using contextual links within the url? say instead of entering http://mysite.org/page/subpage I just entered in /page/subpage would it automatically try to insert the home_url in front to check the validity of the url entered?
#27
@
8 years ago
In 36638.5.patch:
- Remove the title as suggested in #comment:25.
- Fix/improve the URL testing regex in the plugin and the wpLink script.
#28
follow-up:
↓ 30
@
8 years ago
- Keywords needs-patch removed
I'm having some "second thoughts" about this.
- We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:
- server is unreachable
- bad request (400)
- authentication error (401, 403)
- Internal Server Error and the rest of the 5xx codes, etc.
Even if we add some of there, there is no good way to account for all (some may be temporary, etc.).
- Many web servers block the
WordPress
user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.
In these terms we will miss a lot of broken URLs. This makes it an unreliable check. Furthermore immediately after inserting a link the URL is visible in the inline link toolbar. If in doubt the user can test it very easy by clicking on it.
Perhaps it will be better to just do a quick regex test for malformed URLs.
This ticket was mentioned in Slack in #core-editor by azaozz. View the logs.
8 years ago
#30
in reply to:
↑ 28
;
follow-up:
↓ 32
@
8 years ago
Replying to azaozz:
We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:
IMO the current checks are enough, reliable and cover the most common errors. But more important, those can be fixed by users themselves. Producing a 400 error seems pretty hard, it's usually the result of sending bad data to an API service. Authentication errors should be ignored because the current user and their readers might have access to it. 5xx errors are also unlikely and these should (hopefully) only be temporary.
- Many web servers block the
WordPress
user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.
This is the first time I'm hearing this. :) Do you happen to know what the status code is? It's probably not 404.
#31
@
8 years ago
I have the following in my nginx server block: if ($http_user_agent ~ WordPress) { return 444; }
http://serverfault.com/questions/787522/attempting-to-block-wordpress-useragent-is-overloading-cpu
blocking because of hacking attempts, especially XML-RPC exploit, which is history forever, not ;-) ?
#32
in reply to:
↑ 30
@
8 years ago
Replying to ocean90:
Replying to azaozz:
- Many web servers block the
WordPress
user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.This is the first time I'm hearing this. :) Do you happen to know what the status code is? It's probably not 404.
Medium.com returns a 403
.
$ curl -v --user-agent "WordPress" https://medium.com * Rebuilt URL to: https://medium.com/ * Trying 104.16.123.127... * Connected to medium.com (104.16.123.127) port 443 (#0) * TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 * Server certificate: medium.com * Server certificate: DigiCert SHA2 Extended Validation Server CA * Server certificate: DigiCert High Assurance EV Root CA > GET / HTTP/1.1 > Host: medium.com > User-Agent: WordPress > Accept: */* > < HTTP/1.1 403 Forbidden < Date: Wed, 20 Jul 2016 00:16:39 GMT < Content-Type: text/html; charset=UTF-8 < Transfer-Encoding: chunked < Connection: keep-alive < Set-Cookie: __cfduid=daac0e730a603793f2b0f7c5577dbd7d71468973799; expires=Thu, 20-Jul-17 00:16:39 GMT; path=/; domain=.medium.com; HttpOnly < Cache-Control: max-age=15 < Expires: Wed, 20 Jul 2016 00:16:54 GMT < X-Frame-Options: SAMEORIGIN < Strict-Transport-Security: max-age=15552000; includeSubDomains; preload < X-Content-Type-Options: nosniff < Server: cloudflare-nginx < CF-RAY: 2c5239c8f7aa10e7-ORD
#33
@
8 years ago
In 36638.7.patch: also added wp.a11y.speak()
as suggested above. However if we keep checking the URLs by AJAX, that speak()
will be triggered few seconds after inserting a link. It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.
Do you happen to know what the status code is? It's probably not 404.
No, never seen 404. There is (perhaps) a chance that some servers will let the specific link check UA WordPress URL Test
through, but still thinking it makes this unreliable.
#34
@
8 years ago
Quickly tested 36638.7.patch especially the wp.a11y.speak()
part. See screenshot below, using a small personal dev plugin to log messages in the console:
Technically everything works nicely but I'd recommend a few improvements.
- use the
assertive
parameter: since it's an error, it makes sense to force screen readers to announce this message as soon as possible - when there is an error, the following message "Link inserted." should not be dispatched: testing with VoiceOver, it gets confused and announces just the last message (no error announced at all)
It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.
maybe there's no need to use the URL in the message: users just entered that URL, they know what they typed. Yes, they could have typed incorrectly but maybe they need just to be informed the link destination is not reachable and asked to check the link?
#35
@
8 years ago
More related to wp_remote_get()
but worth reminding some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL. I usually change my DNS but I forgot to do that on the machine I was using for testing. Thus, with my ISP default DNS, any not existent URL I was intentionally typing was "good" and no error appeared. Specifically, I was getting a 302 followed by a 200:
I'm afraid there's no much we can do here other than blaming bad ISPs practices :(
About the JS part, I'm not sure I understand the checks for visible()
for example in setLinkError()
where toolbar.visible()
is always false for me.
#36
@
8 years ago
The only things changed in 36638.8.patch are the assertive
parameter for the URL check message and the message string. I'd propose to simplify this message and try to describe what actually happened: the link was inserted but the URL check returned an error. We can't guarantee the link destination is really unreachable (could be temporarily unreachable for example) so I'd say to change the "error" in a more relaxed "warning":
Warning: the link has been inserted but the link destination cannot be reached.
The assertive
parameter should guarantee that even if the "Link inserted." message is dispatched before the URL check (and as far as I see this can happen depending for example on the time required by the AJAX response), anything that is being announced by screen readers int hat moment will be interrupted by the warning.
This ticket was mentioned in Slack in #core by ocean90. View the logs.
8 years ago
#38
@
8 years ago
some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL.
I'm afraid there's no much we can do here other than blaming bad ISPs practices :(
Yes, actually most ISPs do that now (AFAIK). Yet another reason the URL test will fail. If we don't follow the redirect, the response will be "undecided", if we follow it, we get 200. In both cases we don't flag the URL as "bad".
I'm not sure I understand the checks for visible() for example in setLinkError() where toolbar.visible() is always false for me.
setLinkError()
is used in two places: one is after a regex check for well formed URL. As that is very fast, the toolbar.visible()
will always be false. The other is after the AJAX response. The toolbar will usually be visible there depending on how long the AJAX takes (if it's very very fast, it may not be visible yet).
36638.8.patch looks good. Going to commit it now to get the accessibility improvements in.
#39
follow-up:
↓ 41
@
8 years ago
Warning: the link has been inserted but the link destination cannot be reached
=>
Warning: the destination of the inserted link cannot be reached
the same a bit shorter, imho, or even
Warning: destination of inserted link cannot be reached
#41
in reply to:
↑ 39
@
8 years ago
Replying to Presskopp:
Shortened the warning a bit. We can revise it again in the next few days :)
This ticket was mentioned in Slack in #core by gma992. View the logs.
8 years ago
This ticket was mentioned in Slack in #core-editor by ocean90. View the logs.
8 years ago
#44
@
8 years ago
- Keywords needs-patch added
As mentioned on Slack, the current regex needs to be improved because it only supports ASCII chars. URLs like http://кто.рф/eng/, http://münchen.de or http://香港大學.香港/about/ shouldn't be marked as invalid.
#45
follow-up:
↓ 47
@
8 years ago
There seems to be no perfect solution, see links below
What about 'localhost' or 'https://localhost' and all those strange things?
https://mathiasbynens.be/demo/url-regex
http://hx3.de/software-webentwicklung-23/php-url-uri-validation-idn-internationalized-domain-name-support-17404/
http://hx3.de/software-webentwicklung-23/php-e-mail-validation-idn-internationalized-domain-name-support-17398/
#46
follow-up:
↓ 48
@
8 years ago
I'd just like to voice an opinion that performing a HTTP request to "validate" each URL isn't likely to be very workable in the long run.
The scenario's that I'm looking at are things like..
- Editors may have a URL which is currently 403/404 which they wish to have in a draft, causing them to just ignore those notifications all the time
- Linking to sites which intentionally block, or rate limit the WordPress UA would incorrectly fail (Leading to a bad user experience and inability to trust it going forward)
- The link would be marked inaccessible for sites which are temporarily unavailable, slow, or inaccessible from the servers network
- The WordPress UA might get rate limited or blocked on more sites/networks for being seen as making a lot of irrelevant requests. We already see this happening with blocking of the WordPress UA (as noted above) due to the XML-RPC attacks.
If the HTTP request is kept, then
- a HEAD should be performed instead IMHO
- The
limit_response_size
parameter should be set to avoid issues when linking to large documents. - The
Relative URL
handling could probably useWP_HTTP::make_absolute_url()
Personally I feel that anything more than a regular expression is probably overkill here for the limited benefit it provides.
#47
in reply to:
↑ 45
@
8 years ago
Replying to Presskopp:
What about 'localhost' or 'https://localhost' and all those strange things?
This is a valid URL however it should never appear on a "live" web page. Even think some browsers restrict such links.
I like the 8th and 11th regex from there (need to convert them to JS). They are listed as failing to detect some of the IPs but think this is a "feature". We don't want to flag URLs to private network segments as bad.
#48
in reply to:
↑ 46
@
8 years ago
Replying to dd32:
I'd just like to voice an opinion that performing a HTTP request to "validate" each URL isn't likely to be very workable in the long run.
Yeah, agreed.
In 36638.9.patch:
- Remove proxying through WordPress to test if an URL exists.
- Fix/enhance the regex that tests if the URL is well formed.
#49
@
8 years ago
- Keywords has-patch commit added; needs-unit-tests needs-testing needs-patch removed
36638.9.patch looks good to me.
#51
@
8 years ago
What about sites with no CORS headers? This feature cannot possibly be effective if the external request is performed in the browser.
This is the logical next step for enhancing how we handle links (and URLs in general) in the editor. Something like "spellcheck for URLs" :)
I'm hoping we will have time to implement it in this cycle, or at least start.