Make WordPress Core

Opened 14 years ago

Closed 9 years ago

Last modified 9 years ago

#16881 closed defect (bug) (wontfix)

Remove all unwanted 'nofollow' attributes from 'reply to comment' links

Reported by: joelhardi's profile joelhardi Owned by:
Milestone: Priority: normal
Severity: minor Version: 3.1
Component: Comments Keywords: has-patch close
Focuses: Cc:

Description

This should be a trivial fix. r16230 removed the rel="nofollow" attribute from "reply to comment" links in the comments list, for reasons discussed in #10550. In brief, the reasons are that these are not external links that we want to tell search engines not to follow in the first place, and there is actually an SEO penalty from adding the "nofollow" to internal links.

Unfortunately, the "nofollow" was not removed from all of the "reply to comment" and "reply to post" links (for instance, in code branches when the site requires login/registration to comment). This patch corrects the rest of them.

It doesn't delete the "nofollow" attributes we want. For instance, in get_comment_author_link() which escapes links to the comment author's URL.

Additionally, the patch (see line 1110) fixes a minor inconsistency between the return values of get_comment_reply_link() and get_post_reply_link() when get_option('comment_registration') is TRUE.

The version in get_post_reply_link() is missing the class="comment-reply-login" and esc_url() wrap. I've added them.

Attachments (3)

comment-template.17521.php.diff (2.3 KB) - added by joelhardi 14 years ago.
general-template.php.17522.diff (501 bytes) - added by joelhardi 14 years ago.
one-liner to show the robots noindex,nofollow meta tag on replytocom pages (v2)
16881.patch (2.2 KB) - added by c3mdigital 11 years ago.
refresh of remove rel="nofollow" from comment-template.php

Download all attachments as: .zip

Change History (27)

#1 @scribu
14 years ago

  • Keywords dev-feedback removed
  • Milestone changed from Awaiting Review to Future Release

Looks good.

#2 @scribu
14 years ago

  • Keywords commit added

#3 @solarissmoke
14 years ago

  • Keywords 2nd-opinion added

Before committing, please note that there are objections to removing the nofollows - see comment 11 onwards in #10550 and also #16709.

#4 @joelhardi
14 years ago

I submitted the patch thinking that the "policy side" of this decision had been made with r16230 and just wanted to follow through and finish implementing it.

WordPress sets the robots meta tag to "noindex,nofollow" (just verified this in trunk) on the actual ?replytocom pages, so crawlers should not be indexing these pages. I believe that issue is resolved.

I'm actually the last commenter on #10550 (with the robots.txt tip) and I can tell you that in my case those URLs were being crawled only because I don't call wp_head() in my theme, so there's no meta tag. In my case, there's no issue with core.

Although this is OT for this issue, I disagree with the idea of making these links into POST forms, that would break HTTP (not sure where filosofo's interpretation is coming from). These are GET, a parameterized lookup analogous to a search query. The user hasn't submitted any information, only requested a blank form. GET means that these URLs can be cached (by server, by proxy, by CDN, by browser) and controlled with URL rewriting, robots.txt etc. Making them POST would cause load issues, and it would cause people to see the "Do you want to submit a form again?" popup when they click the back button in their browsers.

#5 follow-up: @nacin
14 years ago

Should #10550 be re-opened? It doesn't appear r16230 had enough review.

#6 @joelhardi
14 years ago

Whoops, WordPress actually isn't setting the robots meta tag. It was the sitewide privacy settings doing that.

My thought process was:

  1. WordPress should just set the robots meta tag on ?replytocom pages, that'll address the problem.
  1. I'll look and see if it is. Eureka, it already is.
  1. Facepalm.

Anyway, to me that seems like the obvious solution to these pages being indexed. Should I submit as a separate issue/patch or do you want to reopen #10550? I guess I would suggest:

  • commit this patch to settle the nofollow inconsistency
  • separate issue to insert the robots meta tag on ?replytocom pages to address the indexing issue

#7 follow-up: @joelhardi
14 years ago

The patch I came up with is a one-liner so I attached it here.

There may be a better way organizationally to implement this but it wasn't obvious so I thought I'd post this and take a second look tomorrow. The only other code that checks whether replytocom is set is in comments-template.php -- not sure it makes sense to add an action to wp_head from there.

The one-liner does what I was talking about to address the double-indexing issue.

Incidentally, there are other URLs we might want to protect from indexing always, for instance wp-login.php (I always put /*.php in robots.txt).

Last edited 14 years ago by joelhardi (previous) (diff)

@joelhardi
14 years ago

one-liner to show the robots noindex,nofollow meta tag on replytocom pages (v2)

#8 in reply to: ↑ 7 @solarissmoke
14 years ago

Replying to joelhardi:

The patch I came up with is a one-liner so I attached it here.

The one-liner does what I was talking about to address the double-indexing issue.

rel=canonical already makes sure content isn't indexed twice. That isn't the problem.

The problem is the significant overhead caused by bots accessing these duplicate pages when there isn't really any need for them. That was why the POST method was suggested in #10550.

#9 follow-up: @joelhardi
14 years ago

Not exactly ... canonical isn't meant to stop double indexing, it just provides info that Google uses when determining how to return results. See here.

Also, it's a recent Google creation, not (yet) a standard.

If you don't want your pages crawled, the robots exclusion standard is the one you want to use. That's precisely what it's for, and Google established nofollow and canonical for other, different reasons.

The noindex meta tag will mean the pages aren't indexed/stored. Will it eliminate crawling? No, since the pages have to be downloaded for the meta tag to be read. But, it may still significantly ameliorate the crawling problem -- we haven't actually implemented it so we don't know, but my guess is that Googlebot at minimum isn't going to re-crawl these URLs as frequently.

So, I think adding the noindex to these pages is an appropriate improvement regardless of whatever the links to these pages look like.

The obvious solution to totally stop unwanted crawling is Disallow: /*/?replytocom= in robots.txt. When the "correct" fix is such a trivial webmaster task, I don't think it's a good idea for WordPress to do lots of incorrect things to try to mimic the same function. It shouldn't abuse HTTP or nofollow specs the way it was doing with nofollow, or would be doing with POST to simulate links.

#10 in reply to: ↑ 9 ; follow-up: @solarissmoke
14 years ago

Replying to joelhardi:

Not exactly ... canonical isn't meant to stop double indexing, it just provides info that Google uses when determining how to return results. See here.

Ok, my mistake.

So, I think adding the noindex to these pages is an appropriate improvement regardless of whatever the links to these pages look like.

Agreed.

The obvious solution to totally stop unwanted crawling is Disallow: /*/?replytocom= in robots.txt. When the "correct" fix is such a trivial webmaster task, I don't think it's a good idea for WordPress to do lots of incorrect things to try to mimic the same function. It shouldn't abuse HTTP or nofollow specs the way it was doing with nofollow, or would be doing with POST to simulate links.

Maybe you're right - after all the only people that will notice the added overhead are likely to be those that are able to create/manage robots.txt files. I still don't like that we're generating a whole new bunch of pages for crawlers to have to visit, though.

#11 in reply to: ↑ 10 @joelhardi
14 years ago

Maybe you're right - after all the only people that will notice the added overhead are likely to be those that are able to create/manage robots.txt files. I still don't like that we're generating a whole new bunch of pages for crawlers to have to visit, though.

Cool. I just brain-dumped into #10550. Basically, I suspect that the "Google is hitting these useless URLs on my website!" reaction will go way down in volume once these pages aren't indexed anymore (which the meta tag does). Once people don't see them anymore when they log onto Webmaster Tools.

Sure, Googlebot may still hit these URLs (in a rate-limited way in the middle of the night), but is it hurting anything? WordPress already does a good job of affirmatively saying "index me" when you publish a post, so it's not like people are waiting around for Googlebot's nightly visit for their pages to show up in Google News or whatever.

#12 in reply to: ↑ 5 ; follow-up: @hakre
14 years ago

Replying to nacin:

Should #10550 be re-opened? It doesn't appear r16230 had enough review.

The nofollow attribute and robots.txt are two different things. The parties discussing in #10550 are in the process to differentiate more. I suggest to open another ticket for another issue to keep things apart.

The review looks good in my eyes.

#13 in reply to: ↑ 12 @joelhardi
14 years ago

Replying to hakre:

The nofollow attribute and robots.txt are two different things. The parties discussing in #10550 are in the process to differentiate more. I suggest to open another ticket for another issue to keep things apart.

I agree, the issues are just being muddied. This ticket and #10550 should be closed.

I can start a new ticket for general-template.php.17522.diff and addressing the "search engine spiders are crawling my ?replytocom URLs" issue.

#14 @scribu
14 years ago

  • Keywords 2nd-opinion removed

This ticket should be closed, after the patch is commited. ;)

#15 @joelhardi
14 years ago

Yes! Commit attachment:comment-template.17521.php.diff first, then close I meant. :)

FYI, I've filed #16893 on the "bots are crawling my ?replytocom pages. stop them!" issue.

@c3mdigital
11 years ago

refresh of remove rel="nofollow" from comment-template.php

#16 @c3mdigital
11 years ago

Commit the refreshed patch then close.

#17 @wonderboymusic
9 years ago

  • Keywords commit removed

can someone who knows SEO well weigh in here with a fresh perspective? I am not sure what current best practices here are

#18 @joostdevalk
9 years ago

OK so as far as I'm concerned, this change is nonsense. To make sure that it wasn't just me I pinged a contact at Google and he said the same thing.

There are no penalty's for internal nofollows, despite some SEOs claiming the opposite, because they think that'd make sense. In this particular case, on a blog with registration required for commenting, we'd create as many extra crawlable URLs as there are posts in a site. This happens simply because they all link to the login page with the post as the return parameter and that makes it a "unique" URL. Extra value in terms of SEO: none. In fact, because search engines would spend the time crawling those URLs, they wouldn't be crawling other pages on the site, so the actual benefit is negative.

My advice: close as wontfix.

#19 @joostdevalk
9 years ago

A better case for spending time increasing the SEO would be #22889 by the way.

#20 @samuelsidler
9 years ago

  • Keywords close added

Based on comment:18, I agree with wontfix.

#21 @helen
9 years ago

  • Milestone Future Release deleted
  • Resolution set to wontfix
  • Status changed from new to closed

#22 @joelhardi
9 years ago

  • Resolution wontfix deleted
  • Status changed from closed to reopened

Hi ... thanks for the interest in this ticket. However I think the original intent of the ticket was lost in the decision to close it out. My original proposal was to delete this attribute because it was being misused and has no effect on internal links. It's technical debt and a waste of bits. It also causes confusion. People think nofollow controls crawler behavior, when it doesn't.

nofollow is meant to be used on external links to avoid them being indexed, given PageRank or other mythical SEO pixie dust. You can look at the [W3C spec](http://www.w3.org/TR/html5/links.html#linkTypes) for instance.

Maybe ~8 years or so ago, there was an SEO technique of "pagesculpting" to liberally sprinkle rel="nofollow" everywhere, except for a few select links, because at the time this resulted in additional relative pagerank being applied to those remaining links. Google et al. quickly adjusted and have been ignoring "nofollow" on internal links for at least a year or two before when I originally submitted this patch. Since then including it on these particular links has had no purpose.

So I 100% agree with @joostdevalk that there is no SEO purpose to adding this to these links ... there hasn't been for a long time. That is why I want to remove this attribute that is being used for the wrong purpose. #22889 is the right way to deal with the ?replytocom crawling issue.

If you are in favor of retaining the attribute can you please explain why? I am perfectly willing to listen.

#23 @joostdevalk
9 years ago

  • Resolution set to wontfix
  • Status changed from reopened to closed

@joelhardi: Sorry, you're understanding me wrongly.

I'm absolutely not in favor of removing them. Nofollow still works. In fact, the third case in the examples on this page shows exactly why. If you have 1,000 posts, you require commenters to be logged in, Googlebot sees 1,000 different links to the login URL. Having nofollow on these links prevents Google from crawling tons of unneeded URLs.

This improves crawling enormously, therefor the nofollow should stay. As I said above, I've discussed this with Google, they agree, end of story.

#24 @SergeyBiryukov
9 years ago

In 33047:

Restore rel='nofollow' for comment reply links to reduce extra crawling by search engines.

This reverts [16230], which didn't have enough review at the time of commit.

props joostdevalk.
see #22889, #18547, #16881.

Note: See TracTickets for help on using tickets.