WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 4 weeks ago

#20383 reopened defect (bug)

Strip trailing punctuation with canonical URLs

Reported by: nacin Owned by: SergeyBiryukov
Milestone: 4.9 Priority: normal
Severity: normal Version:
Component: Canonical Keywords: needs-patch
Focuses: Cc:

Description

A follow-up to #7537 where we removed %20 and " " from the end of URLs, we should try to remove all sorts of punctuation from the end of a URL, both URL-encoded and decoded.

Example bad URLs we should surely be able to resolve:

http://ma.tt/2012/03/productivity-per-square-inch/%7B

http://ma.tt/2012/03/productivity-per-square-inch/)

http://ma.tt/2012/03/productivity-per-square-inch/,

http://ma.tt/2012/03/productivity-per-square-inch/!

More difficult when there is no trailing slash in the permalink structure (or the link requested, regardless of permalink structure), but if is_404() is taken into account, it should be doable to trail various pieces of punctuation and see if we can get things to resolve.

Attachments (3)

20383.patch (659 bytes) - added by joostdevalk 4 years ago.
20383.2.patch (1.4 KB) - added by lancewillett 9 months ago.
Updated patch to catch more punctuation; adds unit tests
20383.3.patch (7.8 KB) - added by SergeyBiryukov 5 months ago.

Download all attachments as: .zip

Change History (24)

#1 @n3k4
5 years ago

  • Cc hellokane@… added

@joostdevalk
4 years ago

#2 @joostdevalk
4 years ago

  • Keywords has-patch reporter-feedback added
  • Owner set to joostdevalk
  • Status changed from new to accepted

I added the suggested ones and a couple more into the same rewrite in which we fixed the trailing space. WordPress doesn't allow for these characters in the URL so I think it's safe to assume that if they're in there, at the end of the URL, it's a mistake.

Please do tell me if you disagree, if so I could move this to redirect_guess_404_permalink without issue of course.

#3 @chriscct7
2 years ago

  • Keywords dev-feedback added; reporter-feedback removed

@lancewillett
9 months ago

Updated patch to catch more punctuation; adds unit tests

#4 @lancewillett
9 months ago

  • Keywords has-unit-tests added

.2 patch adds a bit more cases for trailing punctuation, and adds unit tests.

#5 @joostdevalk
9 months ago

That looks like what I meant. @SergeyBiryukov could you have a look?

#6 @lancewillett
6 months ago

@SergeyBiryukov Can you review and test the patch, please?

#7 @SergeyBiryukov
6 months ago

  • Milestone changed from Future Release to 4.8
  • Owner changed from joostdevalk to SergeyBiryukov

#8 @SergeyBiryukov
5 months ago

  • Keywords commit added; dev-feedback removed

I think the regex in 20383.2.patch might accidentally catch international characters in some environments, so I'd like to keep an explicit list of punctuation symbols to remove.

20383.3.patch adds more symbols to the list and more tests.

#9 @SergeyBiryukov
5 months ago

  • Resolution set to fixed
  • Status changed from accepted to closed

In 40256:

Canonical: Strip trailing punctuation from permalinks.

Props joostdevalk, lancewillett, SergeyBiryukov.
Fixes #20383.

#10 @ocean90
5 months ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

I can't get the curly quote tests from [40256] passing:

1) Tests_Canonical_NoRewrite::test with data set #25 ('/?p=358“', array('/?p=358', array('358')), 20383)
Ticket #20383
Failed asserting that two strings are equal.
--- Expected
+++ Actual
@@ @@
-'/?p=358'
+'/?p=358�__'

I'm using OS X and already tried several PHP versions: pre-installed PHP 5.6.28 or via homebrew 5.6.29, 7.0.15, 7.1.2 and 7.2.0-dev.

It seems like parse_url() is (sometimes?) not multibyte aware

Using PHP's built-in server I get the following output for http://localhost:8000/test.php?p=358“

$_GET
array(1) {
  ["p"]=>
  string(6) "358”"
}

$_SERVER['REQUEST_URI']
string(24) "/test.php?p=358%E2%80%9D"

parse_url( $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'] )
array(4) {
  ["host"]=>
  string(9) "localhost"
  ["port"]=>
  int(8000)
  ["path"]=>
  string(9) "/test.php"
  ["query"]=>
  string(14) "p=358%E2%80%9D"
}

// What the tests are doing:
parse_url( "http://example.org/?p=358”" )
array(4) {
  ["scheme"]=>
  string(4) "http"
  ["host"]=>
  string(11) "example.org"
  ["path"]=>
  string(1) "/"
  ["query"]=>
  string(8) "p=358�__"
}

// Example form https://github.com/fguillot/picoFeed/issues/167
parse_url( "https://ru.wikipedia.org/wiki/Преступление_и_наказание" )
array(3) {
  ["scheme"]=>
  string(5) "https"
  ["host"]=>
  string(16) "ru.wikipedia.org"
  ["path"]=>
  string(52) "/wiki/�_�_е�_�_�_пление_и_наказание"
}

$_SERVER['REQUEST_URI'] contains the encoded URI which is also used by redirect_canonical(). The encoding is performed client-side, like in your browser or with wget. curl uses the raw URL which produced an "Invalid request (Malformed HTTP request)" error for me.
I'm wondering if we really have to test the decoded strings here.

#11 follow-up: @gitlost
5 months ago

It seems like parse_url() is (sometimes?) not multibyte aware

Another one of those incredibly tedious locale gotchas, this time only affecting non-Linux systems it seems.

If you run this on OSX or Windows 10:

<?php
error_log( "default LC_CTYPE=" . setlocale( LC_CTYPE, '0' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) );

$before = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "before=$before" );
error_log( "before === " . ( '/wiki/Преступление_и_наказание' === $before ? 'true' : 'false' ) );

error_log( "setting LC_CTYPE=" . setlocale( LC_CTYPE, 'C' ) );
error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) );

$after = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH );
error_log( "after=$after" );
error_log( "after === " . ( '/wiki/Преступление_и_наказание' === $after ? 'true' : 'false' ) );

then if your default locale is "UTF-8ish" the $before will be corrupted while the $after, with LC_CTYPE set to 'C', won't.

It's due to the libc function iscntrl() returning true for certain non-ASCII characters like \x90 on these systems depending on locale, which makes PHP replace them with underscores (ext/standard/url.c:75).

To get around it one could use the Joomla function, or maybe just wrap parse_url() with before and after setlocales, eg:

<?php
function parse_url_utf8( $url, $component = -1 ) {
        $prev_ctype_locale = setlocale( LC_CTYPE, '0' );
        setlocale( LC_CTYPE, 'C' );
        $ret = parse_url( $url, $component );
        setlocale( LC_CTYPE, $prev_ctype_locale );
        return $ret;
}

#12 in reply to: ↑ 11 @ocean90
5 months ago

  • Keywords needs-patch added; has-patch has-unit-tests commit removed

Replying to gitlost:

To get around it one could use the Joomla function, or maybe just wrap parse_url() with before and after setlocales, eg:

Sounds like something which we could add to wp_parse_url().

#13 @johnbillion
4 months ago

[40518] removes our left over behaviour where canonical tests which are connected to open Trac tickets are skipped, so the tests introduced in [40256] will start failing on OS X now.

This ticket was mentioned in Slack in #core by ocean90. View the logs.


3 months ago

#15 @azaozz
3 months ago

[40256] seems to introduce some tests failures in some environments. Lets revert it for beta-1 and add it later if we can figure out the exact reasons.

#16 @azaozz
3 months ago

In 40661:

Revert [40256] for now as some tests are failing in some environments.

See #20383.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


3 months ago

#18 @SergeyBiryukov
3 months ago

  • Milestone changed from 4.8 to 4.8.1

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


5 weeks ago

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


4 weeks ago

#21 @jbpaul17
4 weeks ago

  • Milestone changed from 4.8.1 to 4.9

Punting per today's bug scrub.

Note: See TracTickets for help on using tickets.