WordPress.org

Make WordPress Core

Opened 3 years ago

Last modified 8 months ago

#16881 new defect (bug)

Remove all unwanted 'nofollow' attributes from 'reply to comment' links

Reported by: joelhardi Owned by:
Milestone: Future Release Priority: normal
Severity: minor Version: 3.1
Component: Comments Keywords: has-patch commit
Focuses: Cc:

Description

This should be a trivial fix. r16230 removed the rel="nofollow" attribute from "reply to comment" links in the comments list, for reasons discussed in #10550. In brief, the reasons are that these are not external links that we want to tell search engines not to follow in the first place, and there is actually an SEO penalty from adding the "nofollow" to internal links.

Unfortunately, the "nofollow" was not removed from all of the "reply to comment" and "reply to post" links (for instance, in code branches when the site requires login/registration to comment). This patch corrects the rest of them.

It doesn't delete the "nofollow" attributes we want. For instance, in get_comment_author_link() which escapes links to the comment author's URL.

Additionally, the patch (see line 1110) fixes a minor inconsistency between the return values of get_comment_reply_link() and get_post_reply_link() when get_option('comment_registration') is TRUE.

The version in get_post_reply_link() is missing the class="comment-reply-login" and esc_url() wrap. I've added them.

Attachments (3)

comment-template.17521.php.diff (2.3 KB) - added by joelhardi 3 years ago.
general-template.php.17522.diff (501 bytes) - added by joelhardi 3 years ago.
one-liner to show the robots noindex,nofollow meta tag on replytocom pages (v2)
16881.patch (2.2 KB) - added by c3mdigital 8 months ago.
refresh of remove rel="nofollow" from comment-template.php

Download all attachments as: .zip

Change History (19)

comment:1 scribu3 years ago

  • Keywords dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release

Looks good.

comment:2 scribu3 years ago

  • Keywords commit added

comment:3 solarissmoke3 years ago

  • Keywords 2nd-opinion added

Before committing, please note that there are objections to removing the nofollows - see comment 11 onwards in #10550 and also #16709.

comment:4 joelhardi3 years ago

I submitted the patch thinking that the "policy side" of this decision had been made with r16230 and just wanted to follow through and finish implementing it.

WordPress sets the robots meta tag to "noindex,nofollow" (just verified this in trunk) on the actual ?replytocom pages, so crawlers should not be indexing these pages. I believe that issue is resolved.

I'm actually the last commenter on #10550 (with the robots.txt tip) and I can tell you that in my case those URLs were being crawled only because I don't call wp_head() in my theme, so there's no meta tag. In my case, there's no issue with core.

Although this is OT for this issue, I disagree with the idea of making these links into POST forms, that would break HTTP (not sure where filosofo's interpretation is coming from). These are GET, a parameterized lookup analogous to a search query. The user hasn't submitted any information, only requested a blank form. GET means that these URLs can be cached (by server, by proxy, by CDN, by browser) and controlled with URL rewriting, robots.txt etc. Making them POST would cause load issues, and it would cause people to see the "Do you want to submit a form again?" popup when they click the back button in their browsers.

comment:5 follow-up: nacin3 years ago

Should #10550 be re-opened? It doesn't appear r16230 had enough review.

comment:6 joelhardi3 years ago

Whoops, WordPress actually isn't setting the robots meta tag. It was the sitewide privacy settings doing that.

My thought process was:

  1. WordPress should just set the robots meta tag on ?replytocom pages, that'll address the problem.
  1. I'll look and see if it is. Eureka, it already is.
  1. Facepalm.

Anyway, to me that seems like the obvious solution to these pages being indexed. Should I submit as a separate issue/patch or do you want to reopen #10550? I guess I would suggest:

  • commit this patch to settle the nofollow inconsistency
  • separate issue to insert the robots meta tag on ?replytocom pages to address the indexing issue

comment:7 follow-up: joelhardi3 years ago

The patch I came up with is a one-liner so I attached it here.

There may be a better way organizationally to implement this but it wasn't obvious so I thought I'd post this and take a second look tomorrow. The only other code that checks whether replytocom is set is in comments-template.php -- not sure it makes sense to add an action to wp_head from there.

The one-liner does what I was talking about to address the double-indexing issue.

Incidentally, there are other URLs we might want to protect from indexing always, for instance wp-login.php (I always put /*.php in robots.txt).

Last edited 3 years ago by joelhardi (previous) (diff)

joelhardi3 years ago

one-liner to show the robots noindex,nofollow meta tag on replytocom pages (v2)

comment:8 in reply to: ↑ 7 solarissmoke3 years ago

Replying to joelhardi:

The patch I came up with is a one-liner so I attached it here.

The one-liner does what I was talking about to address the double-indexing issue.

rel=canonical already makes sure content isn't indexed twice. That isn't the problem.

The problem is the significant overhead caused by bots accessing these duplicate pages when there isn't really any need for them. That was why the POST method was suggested in #10550.

comment:9 follow-up: joelhardi3 years ago

Not exactly ... canonical isn't meant to stop double indexing, it just provides info that Google uses when determining how to return results. See here.

Also, it's a recent Google creation, not (yet) a standard.

If you don't want your pages crawled, the robots exclusion standard is the one you want to use. That's precisely what it's for, and Google established nofollow and canonical for other, different reasons.

The noindex meta tag will mean the pages aren't indexed/stored. Will it eliminate crawling? No, since the pages have to be downloaded for the meta tag to be read. But, it may still significantly ameliorate the crawling problem -- we haven't actually implemented it so we don't know, but my guess is that Googlebot at minimum isn't going to re-crawl these URLs as frequently.

So, I think adding the noindex to these pages is an appropriate improvement regardless of whatever the links to these pages look like.

The obvious solution to totally stop unwanted crawling is Disallow: /*/?replytocom= in robots.txt. When the "correct" fix is such a trivial webmaster task, I don't think it's a good idea for WordPress to do lots of incorrect things to try to mimic the same function. It shouldn't abuse HTTP or nofollow specs the way it was doing with nofollow, or would be doing with POST to simulate links.

comment:10 in reply to: ↑ 9 ; follow-up: solarissmoke3 years ago

Replying to joelhardi:

Not exactly ... canonical isn't meant to stop double indexing, it just provides info that Google uses when determining how to return results. See here.

Ok, my mistake.

So, I think adding the noindex to these pages is an appropriate improvement regardless of whatever the links to these pages look like.

Agreed.

The obvious solution to totally stop unwanted crawling is Disallow: /*/?replytocom= in robots.txt. When the "correct" fix is such a trivial webmaster task, I don't think it's a good idea for WordPress to do lots of incorrect things to try to mimic the same function. It shouldn't abuse HTTP or nofollow specs the way it was doing with nofollow, or would be doing with POST to simulate links.

Maybe you're right - after all the only people that will notice the added overhead are likely to be those that are able to create/manage robots.txt files. I still don't like that we're generating a whole new bunch of pages for crawlers to have to visit, though.

comment:11 in reply to: ↑ 10 joelhardi3 years ago

Maybe you're right - after all the only people that will notice the added overhead are likely to be those that are able to create/manage robots.txt files. I still don't like that we're generating a whole new bunch of pages for crawlers to have to visit, though.

Cool. I just brain-dumped into #10550. Basically, I suspect that the "Google is hitting these useless URLs on my website!" reaction will go way down in volume once these pages aren't indexed anymore (which the meta tag does). Once people don't see them anymore when they log onto Webmaster Tools.

Sure, Googlebot may still hit these URLs (in a rate-limited way in the middle of the night), but is it hurting anything? WordPress already does a good job of affirmatively saying "index me" when you publish a post, so it's not like people are waiting around for Googlebot's nightly visit for their pages to show up in Google News or whatever.

comment:12 in reply to: ↑ 5 ; follow-up: hakre3 years ago

Replying to nacin:

Should #10550 be re-opened? It doesn't appear r16230 had enough review.

The nofollow attribute and robots.txt are two different things. The parties discussing in #10550 are in the process to differentiate more. I suggest to open another ticket for another issue to keep things apart.

The review looks good in my eyes.

comment:13 in reply to: ↑ 12 joelhardi3 years ago

Replying to hakre:

The nofollow attribute and robots.txt are two different things. The parties discussing in #10550 are in the process to differentiate more. I suggest to open another ticket for another issue to keep things apart.

I agree, the issues are just being muddied. This ticket and #10550 should be closed.

I can start a new ticket for general-template.php.17522.diff and addressing the "search engine spiders are crawling my ?replytocom URLs" issue.

comment:14 scribu3 years ago

  • Keywords 2nd-opinion removed

This ticket should be closed, after the patch is commited. ;)

comment:15 joelhardi3 years ago

Yes! Commit attachment:comment-template.17521.php.diff first, then close I meant. :)

FYI, I've filed #16893 on the "bots are crawling my ?replytocom pages. stop them!" issue.

c3mdigital8 months ago

refresh of remove rel="nofollow" from comment-template.php

comment:16 c3mdigital8 months ago

Commit the refreshed patch then close.

Note: See TracTickets for help on using tickets.