Make WordPress Core

Opened 14 years ago

Last modified 5 months ago

#14773 reviewing defect (bug)

Error in slug parsing leads to unlimited URLs for the same article = duplicate content

Reported by: thermoman's profile thermoman Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 2.5
Component: Canonical Keywords: has-patch needs-testing
Focuses: Cc:

Description

an example says more than 1000 words:

right url:

http://ahmongwoman.wordpress.com/2010/09/02/i-am-not-hmong-and-i-dont-speak-spanish/

wrong urls:

http://ahmongwoman.wordpress.com/2010/09/02/i-am-not------hmong-and-i-dont-speak-spanish/

http://ahmongwoman.wordpress.com/2010/09/02/i----am-not-hmong-and-i-dont-speak-spanish/

http://ahmongwoman.wordpress.com/2010/09/02/i-am-not-----hmong-and-i-------dont-speak---------spanish/

Problem:

Wordpress returns the article with HTTP status code 200 for the wrong URLs.

This is a serious issue for people regarding search engine optimization (duplicate content).

Expected results:

Wordpress returns HTTP Status code 301 with Location-Header and right URL to the client.

Attachments (7)

12456.diff (1.2 KB) - added by wonderboymusic 11 years ago.
12456.2.diff (1.3 KB) - added by wonderboymusic 11 years ago.
tests.14773.1.diff (1.7 KB) - added by atimmer 11 years ago.
12456.3.diff (1.3 KB) - added by atimmer 11 years ago.
tests.14773.2.diff (970 bytes) - added by atimmer 11 years ago.
12456.4.diff (2.2 KB) - added by atimmer 10 years ago.
12456.5.diff (2.3 KB) - added by boonebgorges 9 years ago.
Contains .4.patch plus the extracted existing test. See #30284.

Download all attachments as: .zip

Change History (37)

#1 @thermoman
14 years ago

Tested with several wordpress blogs - not only with wordpress.com hosted blogs.

Bug is confirmed to be an issue in 3.0 and 2.5

#2 @kawauso
13 years ago

Discussed in #wordpress and explained <meta rel="canonical" />.

Behaviour with multiple dashes is inconsistent with behaviour for incomplete URLs.

#3 @scribu
13 years ago

  • Component changed from Permalinks to Canonical

Shouldn't redirect_canonical() be taking care of this?

#4 @dd32
13 years ago

  • Keywords 3.2-early added; dash slug duplicate content removed
  • Milestone changed from Awaiting Review to Future Release
  • Owner set to dd32
  • Status changed from new to accepted

Shouldn't redirect_canonical() be taking care of this?

Thats where it should be handled, However, Canonical at present mainly covers redirecting query vars to rewritten url's, along with common duplicate content url's (such as those using %category%).

My patch on #12456 should be able to deal with this case.

#5 @wonderboymusic
11 years ago

  • Milestone changed from Future Release to 3.6

Kudos to all of us for letting these bugs sit so long - the dd32 patch fixes this as well

#6 @markjaquith
11 years ago

What about HTTP vs HTTPS? Wouldn't this change redirect from HTTPS to HTTP? That would be a breaking change. I think the check has to be more conservative and just make sure the post slug exists in the URL, or do a direct slug comparison.

#7 @wonderboymusic
11 years ago

Now wrapped in set_url_scheme

#8 @dd32
11 years ago

  • Owner dd32 deleted
  • Status changed from accepted to assigned

#9 @ericlewis
11 years ago

  • Keywords has-patch added

#10 @nacin
11 years ago

  • Keywords needs-unit-tests added
  • Milestone changed from 3.6 to Future Release
  • Version set to 2.5

Needs unit tests like whoa.

#11 @wonderboymusic
11 years ago

  • Milestone changed from Future Release to 3.7

#12 follow-up: @atimmer
11 years ago

I have added tests.14773.1.diff which tests common wordpress canonical functionality and adds this ticket to it as well, for which the tests fail.

According to the tests it only happens if all the words in the URL are already present, when only 1 word of the page_name is present the URL will be right because the page_name is not complete.

Edit:
Adding to this, the patch 12456.2.diff fixes only 1 of the testcases.
It fixes: "/this--should-be-resolved-" -> "/this-should-be-resolved/"
It does not fix: "/this----should---be---resolved-" -> "/this-should-be-resolved/"

Last edited 11 years ago by atimmer (previous) (diff)

#13 @atimmer
11 years ago

  • Cc atimmermans@… added

#14 in reply to: ↑ 12 @duck_
11 years ago

Replying to atimmer:

I have added tests.14773.1.diff which tests common wordpress canonical functionality and adds this ticket to it as well, for which the tests fail.

Thanks for the tests! Not a big deal, but I don't really like the name of the class. Maybe we just need to increase the number of tests in Tests_Canonical which actually already has a test for this ticket.

Edit:
Adding to this, the patch 12456.2.diff fixes only 1 of the testcases.
It fixes: "/this--should-be-resolved-" -> "/this-should-be-resolved/"
It does not fix: "/this----should---be---resolved-" -> "/this-should-be-resolved/"

This isn't a problem with the patch since these URLs 404 on trunk. It looks like the problem is that the multiple -s are being picked up as octet placeholders by sanitize_title_with_dashes(), i.e. ---be--- becomes %be. Maybe we should open a ticket to make a better placeholder. (Edit: #25021)

Last edited 11 years ago by duck_ (previous) (diff)

@atimmer
11 years ago

#15 @atimmer
11 years ago

12456.3.diff refreshes the patch against current trunk and the new repository structure.

Last edited 11 years ago by atimmer (previous) (diff)

#16 @atimmer
11 years ago

Added tests.14773.2.diff

It seemed like the canonical test class already has a test for this ticket, I have added tests to test the dashes at the front and the back of the url. I also refreshed the patch and the patch fixes all 3 test cases.

#17 @nacin
10 years ago

  • Milestone changed from 3.7 to 3.8

Hi atimmer, could you combine your tests and the patch together?

@atimmer
10 years ago

#18 @atimmer
10 years ago

12456.4.diff combines the unit tests and the fix.

#19 @nacin
10 years ago

  • Keywords needs-unit-tests removed
  • Milestone changed from 3.8 to Future Release
  • Owner set to dd32
  • Status changed from assigned to reviewing

@boonebgorges
9 years ago

Contains .4.patch plus the extracted existing test. See #30284.

#20 @chriscct7
8 years ago

  • Keywords needs-refresh added; 3.2-early removed

#21 @SergeyBiryukov
7 years ago

#39833 was marked as a duplicate.

#23 @ivantipov
7 years ago

What would be the repercussions of turning off sanitization on query to fix this, as such?

function raw_sanitize_title($title, $raw_title, $context) {
    if ($context === 'query') {
        return $raw_title;
    }

    return $title;
}
add_filter('sanitize_title', 'raw_sanitize_title', 10, 3);

#24 @SergeyBiryukov
2 years ago

#54471 was marked as a duplicate.

#25 @SergeyBiryukov
16 months ago

#57177 was marked as a duplicate.

#26 @SergeyBiryukov
16 months ago

#57177 was marked as a duplicate.

#27 @SergeyBiryukov
12 months ago

#57980 was marked as a duplicate.

This ticket was mentioned in PR #313 on WordPress/wporg-main-2022 by @ryelle.


7 months ago
#28

  • Keywords needs-refresh removed

This will reverse the logic for loading the redesign theme, so that pages are opted-out of the new theme, rather than opted in. It also makes the page check case-insensitive. This fixes some edge cases where technically-valid page URLs are used, but the theme switcher fails to load the correct template.

For example, the following URLs should now load the correct theme & content.

  • Capitalized slugs: wordpress.org/About (or Blocks, as the original issue found, but there's a workaround for that page in place now)
  • Slugs with dots: wordpress.org/download/releases/6.3/ (IIRC this was reported in slack but not followed up on)
  • Arbitrary dashes: wordpress.org/download/beta----nightly/
  • More dots: wordpress.org/download/beta....nightly/

Ideally these URLs should redirect to the real canonical URL, but since they are functional links, the theme should at least not be broken. There are core tickets for the arbitrary dashes and dots in permalinks (which also applies to the 6.3 issue despite seeming related to the post title).

See https://github.com/WordPress/wporg-main-2022/issues/245#issuecomment-1675270164, https://meta.trac.wordpress.org/ticket/7204

### How to test the changes in this Pull Request:

  1. Check out the branch, test locally or sync to sandbox (mu-plugins/main-network/theme-switcher.php)
  2. Try visiting any of the example URLs
  3. The page should load the new theme and show the correct content
  4. Try other valid pages, they should continue to work
  5. Visit /hosting/ or another "old theme" page
  6. It should still use the old theme & content

#29 @dd32
7 months ago

  • Owner dd32 deleted

#30 @oglekler
5 months ago

  • Keywords needs-testing added
Note: See TracTickets for help on using tickets.