WordPress.org

Make WordPress Core

Opened 18 months ago

Closed 15 months ago

Last modified 8 months ago

#36638 closed task (blessed) (fixed)

Detect broken URLs in the editor

Reported by: iseulde Owned by: azaozz
Milestone: 4.6 Priority: normal
Severity: normal Version:
Component: Editor Keywords: has-patch commit
Focuses: ui Cc:

Description

Chatted a bit with @azaozz. We could detect broken URLs in the editor when the user adds one by sending a request to the server and then doing a HEAD request and check the response code. If it's 4xx, we could mark the bad link in the editor and if the user clicks on it show the toolbar with a message. This is to prevent accidental user mistakes (wrong copy/paste, entering query by accident...)

Another thing we could do is warn for mixed content.

Attachments (14)

36638.patch (5.4 KB) - added by rockwell15 18 months ago.
36638.1.patch (7.6 KB) - added by azaozz 16 months ago.
404.png (9.8 KB) - added by azaozz 16 months ago.
linethrough.jpg (23.7 KB) - added by Presskopp 16 months ago.
What about a line-through? Would still need some (color)-decision..
36638.2.patch (8.1 KB) - added by azaozz 16 months ago.
bad-link-patch-test.png (25.7 KB) - added by lukecavanagh 16 months ago.
Patch Tested
36638.3.patch (19.7 KB) - added by iseulde 16 months ago.
36638.4.patch (961 bytes) - added by iseulde 16 months ago.
36638.highlighting.png (41.8 KB) - added by ocean90 16 months ago.
36638.5.patch (4.5 KB) - added by azaozz 15 months ago.
36638.6.patch (6.1 KB) - added by azaozz 15 months ago.
36638.7.patch (6.2 KB) - added by azaozz 15 months ago.
36638.8.patch (6.2 KB) - added by afercia 15 months ago.
36638.9.patch (5.8 KB) - added by azaozz 15 months ago.

Download all attachments as: .zip

Change History (66)

#1 @azaozz
18 months ago

  • Milestone changed from Awaiting Review to 4.6

This is the logical next step for enhancing how we handle links (and URLs in general) in the editor. Something like "spellcheck for URLs" :)

I'm hoping we will have time to implement it in this cycle, or at least start.

Last edited 18 months ago by azaozz (previous) (diff)

@rockwell15
18 months ago

#2 @rockwell15
18 months ago

This patch currently checks only when a user adds a link, then when the post is saved it strips the attribute denoting that it's a bad link.

Should the links also be scanned & marked when a post is opened, or in any other scenarios?

This ticket was mentioned in Slack in #core by chriscct7. View the logs.


17 months ago

#4 @chriscct7
17 months ago

  • Keywords needs-unit-tests needs-patch added
  • Owner set to azaozz
  • Status changed from new to assigned

The patch needs some adjustments, $_POST['check_url'] should probably be isset checked before running with it.

#5 follow-up: @rachelbaker
17 months ago

Flagging a url as "bad" when a HEAD request returns a 4xx response would also flag sites that require authentication, and would result in false negatives.

Maybe limit to just 404 or allow a filter to disable.

#6 in reply to: ↑ 5 @Presskopp
17 months ago

Replying to rachelbaker:

Flagging a url as "bad" when a HEAD request returns a 4xx response would also flag sites that require authentication, and would result in false negatives.

Maybe limit to just 404 or allow a filter to disable.

I think going for (only) 404 is feasable, avoiding side-effects and still offering a huge help for detecting false links. We are talking about the 'everyday-user', not developer, right? :)

#7 @thomaswm
17 months ago

I think we should check for both 404 and 410. The latter one is basically a permanent 404.

This ticket was mentioned in Slack in #core by voldemortensen. View the logs.


17 months ago

#9 @voldemortensen
17 months ago

  • Milestone changed from 4.6 to Future Release

Still needs some work, punting from 4.6. This can definitely be brought back into the milestone if updates are made to the patch.

@azaozz
16 months ago

#10 @azaozz
16 months ago

  • Keywords needs-testing added; needs-patch removed
  • Milestone changed from Future Release to 4.6

In 36638.1.patch:

  • Use capabilities check and a nonce for the AJAX request.
  • Use wp_safe_remote_get() to check the URL. Using a HEAD request sounds better, however: https://core.trac.wordpress.org/browser/tags/4.5.2/src/wp-admin/includes/upgrade.php#L331. Can probably use wp_remote_get() instead. get_headers() (http://php.net/manual/en/function.get-headers.php) can possibly be used too however it has some drawbacks.
  • Only "fail" URLs when very malformed, domain cannot be found or the response is 404. This may need revising.
  • Add the error message as title on the URL in the inline toolbar. TODO: is it worth keeping this?
  • Make the link text white on red background and the URL in the inline toolbar red when the test is negative.
Last edited 16 months ago by azaozz (previous) (diff)

#11 follow-up: @lukecavanagh
16 months ago

The patch applied cleanly but I added a URL link into the main content and saved and I did not see any error messages. The test URL used was https://www.google.co.uk/new

#12 in reply to: ↑ 11 @azaozz
16 months ago

Replying to lukecavanagh:

It doesn't show error message (at least for now). It highlights the link (white text on red background) and turns the link in the "link view" toolbar red. This can still change pending UI/UX feedback.

Last edited 16 months ago by azaozz (previous) (diff)

@azaozz
16 months ago

#13 @azaozz
16 months ago

  • Focuses ui added

This ticket was mentioned in Slack in #core-editor by azaozz. View the logs.


16 months ago

#15 @lukecavanagh
16 months ago

Thanks for the info. One way or the other this is a very useful feature to have in core.

@Presskopp
16 months ago

What about a line-through? Would still need some (color)-decision..

@azaozz
16 months ago

#16 @azaozz
16 months ago

In 36638.2.patch:

  • Refresh.
  • Add checking from the wpLink modal.

@lukecavanagh
16 months ago

Patch Tested

#17 @lukecavanagh
16 months ago

Patch applies cleanly and works well.

@iseulde
16 months ago

#18 @azaozz
16 months ago

In 37741:

Editor: after inserting a link detect if the URL is broken, first run.

Props iseulde, azaozz.
See #36638.

Last edited 16 months ago by azaozz (previous) (diff)

#19 @azaozz
16 months ago

Ugh, sorry @rockwell15, should have included you in the props above. Will on the next commit :)

@iseulde
16 months ago

#20 @iseulde
16 months ago

Above patch:

  • Added white outline for contrast on darker backgrounds.
  • Changed red colour in toolbar. Maybe should have a hove colour from https://codepen.io/hugobaeta/pen/RNOzoV just like normal links in the admin have it.

#21 @azaozz
16 months ago

In 37751:

Editor, link checker:

  • Add white outline for contrast on darker backgrounds.
  • Change red colour in toolbar.

Props iseulde. Props rockwell15 for the initial patch.
See #36638.

This ticket was mentioned in Slack in #core by ocean90. View the logs.


16 months ago

#23 @jipmoors
16 months ago

Seeing the message "Not found, HTTP error 404." as a return value for an invalid page request, think we can improve upon this a bit.

Suggestions:
The target page does not exist.
The page could not be found.

#24 follow-up: @ocean90
16 months ago

  • Keywords needs-patch added
  • Type changed from enhancement to task (blessed)

The current approach seems fine, but I think it requires a few tweaks.

  • It doesn't detect broken URLs when pasting URLs over text.
  • The visual feedback is rough, see 36638.highlighting.png. The red underline is coming from a spellchecker, the red dotted outline is for the broken URL.
  • The title attribute might be not the best idea in terms of a11y. Maybe we should use wp.a11y.speak()? cc: @afercia

#25 in reply to: ↑ 24 @afercia
16 months ago

Replying to ocean90:

  • The title attribute might be not the best idea in terms of a11y. Maybe we should use wp.a11y.speak()?

Yep I'd recommend to don't add new title attributes. From an a11y perspective when the link is broken there's no audible feedback and a barely noticeable visible feedback (the title attribute "tooltip").

To get the error message, screen reader users should (unlikely) go through the following process:

  • after the link is inserted, press Alt+F8 to move focus to the link toolbar
  • explore the link toolbar using the arrow keys (not the tab key since the displayed link is not focusable)
  • finally, only when the screen reader is reading out the link displayed in the toolbar they would get the error message in the title attribute, see below:

https://cldup.com/Ek5r3dTHfX.png

I'd consider @ocean90 suggestion about using wp.a11y.speak() to convey an audible warning message.

For sighted users also, the title attribute is barely noticeable and not so useful on touch devices. It would be great to try to find a good idea for a better visual feedback.

About colors, worth reminding the Editor background may be styled, for example:

https://cldup.com/OoCWmaYS2Q.png

Lastly, I'd suggest to review the error messages wording. As an average users I'd be probably be confused by a message saying something about "host name" or "HTTP error 404". I'd just need to know the link is broken :) As far as I see, the current error messages are:

'Invalid host name.'
'Not found, HTTP error 404.'

Maybe using a generic, simpler, sentence followed by the "technical" error code could work?

#26 @majemedia
16 months ago

What about using contextual links within the url? say instead of entering http://mysite.org/page/subpage I just entered in /page/subpage would it automatically try to insert the home_url in front to check the validity of the url entered?

@azaozz
15 months ago

#27 @azaozz
15 months ago

In 36638.5.patch:

  • Remove the title as suggested in #comment:25.
  • Fix/improve the URL testing regex in the plugin and the wpLink script.

#28 follow-up: @azaozz
15 months ago

  • Keywords needs-patch removed

I'm having some "second thoughts" about this.

  1. We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:
    • server is unreachable
    • bad request (400)
    • authentication error (401, 403)
    • Internal Server Error and the rest of the 5xx codes, etc.
      Even if we add some of there, there is no good way to account for all (some may be temporary, etc.).
  2. Many web servers block the WordPress user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.

In these terms we will miss a lot of broken URLs. This makes it an unreliable check. Furthermore immediately after inserting a link the URL is visible in the inline link toolbar. If in doubt the user can test it very easy by clicking on it. Then none of the above restrictions will apply as it will be a standard browser request.

Perhaps it will be better to just do a quick regex test for malformed URLs.

Last edited 15 months ago by azaozz (previous) (diff)

This ticket was mentioned in Slack in #core-editor by azaozz. View the logs.


15 months ago

#30 in reply to: ↑ 28 ; follow-up: @ocean90
15 months ago

Replying to azaozz:

We return an error only when we are sure the URL is wrong: either the domain name cannot be resolved or it returns 404. This leaves several other common cases out, for example:

IMO the current checks are enough, reliable and cover the most common errors. But more important, those can be fixed by users themselves. Producing a 400 error seems pretty hard, it's usually the result of sending bad data to an API service. Authentication errors should be ignored because the current user and their readers might have access to it. 5xx errors are also unlikely and these should (hopefully) only be temporary.

  1. Many web servers block the WordPress user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.

This is the first time I'm hearing this. :) Do you happen to know what the status code is? It's probably not 404.

#31 @Presskopp
15 months ago

I have the following in my nginx server block: if ($http_user_agent ~ WordPress) { return 444; }

http://serverfault.com/questions/787522/attempting-to-block-wordpress-useragent-is-overloading-cpu

blocking because of hacking attempts, especially XML-RPC exploit, which is history forever, not ;-) ?

#32 in reply to: ↑ 30 @rachelbaker
15 months ago

Replying to ocean90:

Replying to azaozz:

  1. Many web servers block the WordPress user agent for different reasons. We probably could use something else in the UA, but... Doesn't sound right.

This is the first time I'm hearing this. :) Do you happen to know what the status code is? It's probably not 404.

Medium.com returns a 403.

$ curl -v --user-agent "WordPress" https://medium.com
* Rebuilt URL to: https://medium.com/
*   Trying 104.16.123.127...
* Connected to medium.com (104.16.123.127) port 443 (#0)
* TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate: medium.com
* Server certificate: DigiCert SHA2 Extended Validation Server CA
* Server certificate: DigiCert High Assurance EV Root CA
> GET / HTTP/1.1
> Host: medium.com
> User-Agent: WordPress
> Accept: */*
>
< HTTP/1.1 403 Forbidden
< Date: Wed, 20 Jul 2016 00:16:39 GMT
< Content-Type: text/html; charset=UTF-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=daac0e730a603793f2b0f7c5577dbd7d71468973799; expires=Thu, 20-Jul-17 00:16:39 GMT; path=/; domain=.medium.com; HttpOnly
< Cache-Control: max-age=15
< Expires: Wed, 20 Jul 2016 00:16:54 GMT
< X-Frame-Options: SAMEORIGIN
< Strict-Transport-Security: max-age=15552000; includeSubDomains; preload
< X-Content-Type-Options: nosniff
< Server: cloudflare-nginx
< CF-RAY: 2c5239c8f7aa10e7-ORD

@azaozz
15 months ago

@azaozz
15 months ago

#33 @azaozz
15 months ago

In 36638.7.patch: also added wp.a11y.speak() as suggested above. However if we keep checking the URLs by AJAX, that speak() will be triggered few seconds after inserting a link. It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.

Do you happen to know what the status code is? It's probably not 404.

No, never seen 404. There is (perhaps) a chance that some servers will let the specific link check UA WordPress URL Test through, but still thinking it makes this unreliable.

#34 @afercia
15 months ago

Quickly tested 36638.7.patch especially the wp.a11y.speak() part. See screenshot below, using a small personal dev plugin to log messages in the console:

https://cldup.com/kjEax8A8H0.png

Technically everything works nicely but I'd recommend a few improvements.

  • use the assertive parameter: since it's an error, it makes sense to force screen readers to announce this message as soon as possible
  • when there is an error, the following message "Link inserted." should not be dispatched: testing with VoiceOver, it gets confused and announces just the last message (no error announced at all)

It "speaks" the URL but still a bit unsure if this won't be confusing for screen readers.

maybe there's no need to use the URL in the message: users just entered that URL, they know what they typed. Yes, they could have typed incorrectly but maybe they need just to be informed the link destination is not reachable and asked to check the link?

#35 @afercia
15 months ago

More related to wp_remote_get() but worth reminding some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL. I usually change my DNS but I forgot to do that on the machine I was using for testing. Thus, with my ISP default DNS, any not existent URL I was intentionally typing was "good" and no error appeared. Specifically, I was getting a 302 followed by a 200:

https://cldup.com/ozZUcq4R8i.png

I'm afraid there's no much we can do here other than blaming bad ISPs practices :(

About the JS part, I'm not sure I understand the checks for visible() for example in setLinkError() where toolbar.visible() is always false for me.

@afercia
15 months ago

#36 @afercia
15 months ago

The only things changed in 36638.8.patch are the assertive parameter for the URL check message and the message string. I'd propose to simplify this message and try to describe what actually happened: the link was inserted but the URL check returned an error. We can't guarantee the link destination is really unreachable (could be temporarily unreachable for example) so I'd say to change the "error" in a more relaxed "warning":
Warning: the link has been inserted but the link destination cannot be reached.

The assertive parameter should guarantee that even if the "Link inserted." message is dispatched before the URL check (and as far as I see this can happen depending for example on the time required by the AJAX response), anything that is being announced by screen readers int hat moment will be interrupted by the warning.

This ticket was mentioned in Slack in #core by ocean90. View the logs.


15 months ago

#38 @azaozz
15 months ago

some ISPs redirect to a "courtesy page" (for marketing purposes!) when you request a not existent URL.
I'm afraid there's no much we can do here other than blaming bad ISPs practices :(

Yes, actually most ISPs do that now (AFAIK). Yet another reason the URL test will fail. If we don't follow the redirect, the response will be "undecided", if we follow it, we get 200. In both cases we don't flag the URL as "bad".

I'm not sure I understand the checks for visible() for example in setLinkError() where toolbar.visible() is always false for me.

setLinkError() is used in two places: one is after a regex check for well formed URL. As that is very fast, the toolbar.visible() will always be false. The other is after the AJAX response. The toolbar will usually be visible there depending on how long the AJAX takes (if it's very very fast, it may not be visible yet).

36638.8.patch looks good. Going to commit it now to get the accessibility improvements in.

#39 follow-up: @Presskopp
15 months ago

Warning: the link has been inserted but the link destination cannot be reached

=>

Warning: the destination of the inserted link cannot be reached

the same a bit shorter, imho, or even

Warning: destination of inserted link cannot be reached

#40 @azaozz
15 months ago

In 38126:

TinyMCE, link check:

  • Use wp.a11y.speak() to announce bad URLs.
  • Do not add a title to the link toolbar.
  • Better error message.

Props afercia, azaozz.
See #36638.

#41 in reply to: ↑ 39 @azaozz
15 months ago

Replying to Presskopp:

Shortened the warning a bit. We can revise it again in the next few days :)

This ticket was mentioned in Slack in #core by gma992. View the logs.


15 months ago

This ticket was mentioned in Slack in #core-editor by ocean90. View the logs.


15 months ago

#44 @ocean90
15 months ago

  • Keywords needs-patch added

As mentioned on Slack, the current regex needs to be improved because it only supports ASCII chars. URLs like http://кто.рф/eng/, http://münchen.de or http://香港大學.香港/about/ shouldn't be marked as invalid.

#46 follow-up: @dd32
15 months ago

I'd just like to voice an opinion that performing a HTTP request to "validate" each URL isn't likely to be very workable in the long run.

The scenario's that I'm looking at are things like..

  • Editors may have a URL which is currently 403/404 which they wish to have in a draft, causing them to just ignore those notifications all the time
  • Linking to sites which intentionally block, or rate limit the WordPress UA would incorrectly fail (Leading to a bad user experience and inability to trust it going forward)
  • The link would be marked inaccessible for sites which are temporarily unavailable, slow, or inaccessible from the servers network
  • The WordPress UA might get rate limited or blocked on more sites/networks for being seen as making a lot of irrelevant requests. We already see this happening with blocking of the WordPress UA (as noted above) due to the XML-RPC attacks.

If the HTTP request is kept, then

  • a HEAD should be performed instead IMHO
  • The limit_response_size parameter should be set to avoid issues when linking to large documents.
  • The Relative URL handling could probably use WP_HTTP::make_absolute_url()

Personally I feel that anything more than a regular expression is probably overkill here for the limited benefit it provides.

#47 in reply to: ↑ 45 @azaozz
15 months ago

Replying to Presskopp:

What about 'localhost' or 'https://localhost' and all those strange things?

This is a valid URL however it should never appear on a "live" web page. Even think some browsers restrict such links.

https://mathiasbynens.be/demo/url-regex

I like the 8th and 11th regex from there (need to convert them to JS). They are listed as failing to detect some of the IPs but think this is a "feature". We don't want to flag URLs to private network segments as bad.

Last edited 15 months ago by azaozz (previous) (diff)

@azaozz
15 months ago

#48 in reply to: ↑ 46 @azaozz
15 months ago

Replying to dd32:

I'd just like to voice an opinion that performing a HTTP request to "validate" each URL isn't likely to be very workable in the long run.

Yeah, agreed.

In 36638.9.patch:

  • Remove proxying through WordPress to test if an URL exists.
  • Fix/enhance the regex that tests if the URL is well formed.

#49 @ocean90
15 months ago

  • Keywords has-patch commit added; needs-unit-tests needs-testing needs-patch removed

36638.9.patch looks good to me.

#50 @azaozz
15 months ago

  • Resolution set to fixed
  • Status changed from assigned to closed

In 38159:

TinyMCE, inline link:

  • Remove proxying through WordPress to test if an URL exists.
  • Fix and enhance the regex that tests if the URL is well formed.

Fixes #36638.

#51 @prometh
13 months ago

What about sites with no CORS headers? This feature cannot possibly be effective if the external request is performed in the browser.

Last edited 13 months ago by prometh (previous) (diff)

This ticket was mentioned in Slack in #forums by lukecavanagh. View the logs.


8 months ago

Note: See TracTickets for help on using tickets.