Make WordPress Core

Opened 13 years ago

Closed 7 years ago

Last modified 4 years ago

#20383 closed defect (bug) (fixed)

Strip trailing punctuation with canonical URLs

Reported by: nacin's profile nacin Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 4.9 Priority: normal
Severity: normal Version:
Component: Canonical Keywords:
Focuses: Cc:

Description

A follow-up to #7537 where we removed %20 and " " from the end of URLs, we should try to remove all sorts of punctuation from the end of a URL, both URL-encoded and decoded.

Example bad URLs we should surely be able to resolve:

http://ma.tt/2012/03/productivity-per-square-inch/%7B

http://ma.tt/2012/03/productivity-per-square-inch/)

http://ma.tt/2012/03/productivity-per-square-inch/,

http://ma.tt/2012/03/productivity-per-square-inch/!

More difficult when there is no trailing slash in the permalink structure (or the link requested, regardless of permalink structure), but if is_404() is taken into account, it should be doable to trail various pieces of punctuation and see if we can get things to resolve.

Attachments (3)

20383.patch (659 bytes) - added by joostdevalk 11 years ago.
20383.2.patch (1.4 KB) - added by lancewillett 8 years ago.
Updated patch to catch more punctuation; adds unit tests
20383.3.patch (7.8 KB) - added by SergeyBiryukov 8 years ago.

Download all attachments as: .zip

Change History (30)

#1 @n3k4
12 years ago

  • Cc hellokane@… added

@joostdevalk
11 years ago

#2 @joostdevalk
11 years ago

  • Keywords has-patch reporter-feedback added
  • Owner set to joostdevalk
  • Status changed from new to accepted

I added the suggested ones and a couple more into the same rewrite in which we fixed the trailing space. WordPress doesn't allow for these characters in the URL so I think it's safe to assume that if they're in there, at the end of the URL, it's a mistake.

Please do tell me if you disagree, if so I could move this to redirect_guess_404_permalink without issue of course.

#3 @chriscct7
9 years ago

  • Keywords dev-feedback added; reporter-feedback removed

@lancewillett
8 years ago

Updated patch to catch more punctuation; adds unit tests

#4 @lancewillett
8 years ago

  • Keywords has-unit-tests added

.2 patch adds a bit more cases for trailing punctuation, and adds unit tests.

#5 @joostdevalk
8 years ago

That looks like what I meant. @SergeyBiryukov could you have a look?

#6 @lancewillett
8 years ago

@SergeyBiryukov Can you review and test the patch, please?

#7 @SergeyBiryukov
8 years ago

  • Milestone changed from Future Release to 4.8
  • Owner changed from joostdevalk to SergeyBiryukov

#8 @SergeyBiryukov
8 years ago

  • Keywords commit added; dev-feedback removed

I think the regex in 20383.2.patch might accidentally catch international characters in some environments, so I'd like to keep an explicit list of punctuation symbols to remove.

20383.3.patch adds more symbols to the list and more tests.

#9 @SergeyBiryukov
8 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

In 40256:

Canonical: Strip trailing punctuation from permalinks.

Props joostdevalk, lancewillett, SergeyBiryukov.
Fixes #20383.

#10 @ocean90
8 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

I can't get the curly quote tests from [40256] passing:

1) Tests_Canonical_NoRewrite::test with data set #25 ('/?p=358“', array('/?p=358', array('358')), 20383)
Ticket #20383
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'/?p=358'
+'/?p=358�__'

I'm using OS X and already tried several PHP versions: pre-installed PHP 5.6.28 or via homebrew 5.6.29, 7.0.15, 7.1.2 and 7.2.0-dev.

It seems like parse_url() is (sometimes?) not multibyte aware

Using PHP's built-in server I get the following output for http://localhost:8000/test.php?p=358“

$_GET
array(1) {
  ["p"]=>
  string(6) "358”"
}

$_SERVER['REQUEST_URI']
string(24) "/test.php?p=358%E2%80%9D"

parse_url( $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'] )
array(4) {
  ["host"]=>
  string(9) "localhost"
  ["port"]=>
  int(8000)
  ["path"]=>
  string(9) "/test.php"
  ["query"]=>
  string(14) "p=358%E2%80%9D"
}

// What the tests are doing:
parse_url( "http://example.org/?p=358”" )
array(4) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(11) "example.org"
  ["path"]=>
  string(1) "/"
  ["query"]=>
  string(8) "p=358�__"
}

// Example form https://github.com/fguillot/picoFeed/issues/167
parse_url( "https://ru.wikipedia.org/wiki/Преступление_и_наказание" )
array(3) {
  ["scheme"]=>
  string(5) "https"
  ["host"]=>
  string(16) "ru.wikipedia.org"
  ["path"]=>
  string(52) "/wiki/�_�_е�_�_�_пление_и_наказание"
}

$_SERVER['REQUEST_URI'] contains the encoded URI which is also used by redirect_canonical(). The encoding is performed client-side, like in your browser or with wget. curl uses the raw URL which produced an "Invalid request (Malformed HTTP request)" error for me.
I'm wondering if we really have to test the decoded strings here.

#11 follow-up: @gitlost
8 years ago

It seems like parse_url() is (sometimes?) not multibyte aware

Another one of those incredibly tedious locale gotchas, this time only affecting non-Linux systems it seems.

If you run this on OSX or Windows 10:

<?php
error_log( "default LC_CTYPE=" . setlocale( LC_CTYPE, '0' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) );

$before = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "before=$before" );
error_log( "before === " . ( '/wiki/Преступление_и_наказание' === $before ? 'true' : 'false' ) );

error_log( "setting LC_CTYPE=" . setlocale( LC_CTYPE, 'C' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) );

$after = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "after=$after" );
error_log( "after === " . ( '/wiki/Преступление_и_наказание' === $after ? 'true' : 'false' ) );

then if your default locale is "UTF-8ish" the $before will be corrupted while the $after, with LC_CTYPE set to 'C', won't.

It's due to the libc function iscntrl() returning true for certain non-ASCII characters like \x90 on these systems depending on locale, which makes PHP replace them with underscores (ext/standard/url.c:75).

To get around it one could use the Joomla function, or maybe just wrap parse_url() with before and after setlocales, eg:

<?php
function parse_url_utf8( $url, $component = -1 ) {
        $prev_ctype_locale = setlocale( LC_CTYPE, '0' );
        setlocale( LC_CTYPE, 'C' );
        $ret = parse_url( $url, $component );
        setlocale( LC_CTYPE, $prev_ctype_locale );
        return $ret;
}

#12 in reply to: ↑ 11 @ocean90
8 years ago

  • Keywords needs-patch added; has-patch has-unit-tests commit removed

Replying to gitlost:

To get around it one could use the Joomla function, or maybe just wrap parse_url() with before and after setlocales, eg:

Sounds like something which we could add to wp_parse_url().

#13 @johnbillion
8 years ago

[40518] removes our left over behaviour where canonical tests which are connected to open Trac tickets are skipped, so the tests introduced in [40256] will start failing on OS X now.

This ticket was mentioned in Slack in #core by ocean90. View the logs.


8 years ago

#15 @azaozz
8 years ago

[40256] seems to introduce some tests failures in some environments. Lets revert it for beta-1 and add it later if we can figure out the exact reasons.

#16 @azaozz
8 years ago

In 40661:

Revert [40256] for now as some tests are failing in some environments.

See #20383.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


8 years ago

#18 @SergeyBiryukov
8 years ago

  • Milestone changed from 4.8 to 4.8.1

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 years ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


7 years ago

#21 @jbpaul17
7 years ago

  • Milestone changed from 4.8.1 to 4.9

Punting per today's bug scrub.

This ticket was mentioned in Slack in #core by westonruter. View the logs.


7 years ago

#23 @westonruter
7 years ago

  • Milestone changed from 4.9 to Future Release

#24 @SergeyBiryukov
7 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 41991:

Canonical: Strip trailing punctuation from permalinks.

Previously attempted in [40256], which caused the test for decoded curly quotes to fail in some environments.

$_SERVER['REQUEST_URI'] contains the encoded URI, so this version removes the failing tests and only checks for encoded curly quotes.

Props joostdevalk, lancewillett, SergeyBiryukov.
Fixes #20383.

#25 @SergeyBiryukov
7 years ago

  • Milestone changed from Future Release to 4.9

#26 @adnan.limdi
7 years ago

I have face issue after i update 4.9.1 my url not work here

site url/property-details/MDM2NXwwMDAwMDAwMDAxMXw1OA..

Now My URL remove automatic site url/property-details/MDM2NXwwMDAwMDAwMDAxMXw1OA(double dot value remove automatic)

anyone have solutions for resolve this.

i find this in caonical.php so please provide here hook to change this

	$punctuation_pattern = implode( '|', array_map( 'preg_quote', array(
		' ', '%20',  // space
		'!', '%21',  // exclamation mark
		'"', '%22',  // double quote
		"'", '%27',  // single quote
		'(', '%28',  // opening bracket
		')', '%29',  // closing bracket
		',', '%2C',  // comma
		';', '%3B',  // semicolon
		'{', '%7B',  // opening curly bracket
		'}', '%7D',  // closing curly bracket
		'%E2%80%9C', // opening curly quote
		'%E2%80%9D', // closing curly quote
	) ) );
Last edited 4 years ago by SergeyBiryukov (previous) (diff)

#27 @SergeyBiryukov
4 years ago

  • Keywords needs-patch removed

Follow-up: #52865

Note: See TracTickets for help on using tickets.