#20383 closed defect (bug) (fixed)
Strip trailing punctuation with canonical URLs
Reported by: | nacin | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | 4.9 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Canonical | Keywords: | |
Focuses: | Cc: |
Description
A follow-up to #7537 where we removed %20 and " " from the end of URLs, we should try to remove all sorts of punctuation from the end of a URL, both URL-encoded and decoded.
Example bad URLs we should surely be able to resolve:
http://ma.tt/2012/03/productivity-per-square-inch/%7B
http://ma.tt/2012/03/productivity-per-square-inch/)
http://ma.tt/2012/03/productivity-per-square-inch/,
http://ma.tt/2012/03/productivity-per-square-inch/!
More difficult when there is no trailing slash in the permalink structure (or the link requested, regardless of permalink structure), but if is_404() is taken into account, it should be doable to trail various pieces of punctuation and see if we can get things to resolve.
Attachments (3)
Change History (30)
#2
@
11 years ago
- Keywords has-patch reporter-feedback added
- Owner set to joostdevalk
- Status changed from new to accepted
#4
@
8 years ago
- Keywords has-unit-tests added
.2 patch adds a bit more cases for trailing punctuation, and adds unit tests.
#7
@
8 years ago
- Milestone changed from Future Release to 4.8
- Owner changed from joostdevalk to SergeyBiryukov
#8
@
8 years ago
- Keywords commit added; dev-feedback removed
I think the regex in 20383.2.patch might accidentally catch international characters in some environments, so I'd like to keep an explicit list of punctuation symbols to remove.
20383.3.patch adds more symbols to the list and more tests.
#10
@
8 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
I can't get the curly quote tests from [40256] passing:
1) Tests_Canonical_NoRewrite::test with data set #25 ('/?p=358“', array('/?p=358', array('358')), 20383) Ticket #20383 Failed asserting that two strings are equal. --- Expected +++ Actual @@ @@ -'/?p=358' +'/?p=358�__'
I'm using OS X and already tried several PHP versions: pre-installed PHP 5.6.28 or via homebrew 5.6.29, 7.0.15, 7.1.2 and 7.2.0-dev.
It seems like parse_url()
is (sometimes?) not multibyte aware
- https://bugs.php.net/bug.php?id=52923
- http://bluebones.net/2013/04/parse_url-is-not-utf-8-safe/
- https://github.com/fguillot/picoFeed/issues/167
Using PHP's built-in server I get the following output for http://localhost:8000/test.php?p=358“
$_GET array(1) { ["p"]=> string(6) "358”" } $_SERVER['REQUEST_URI'] string(24) "/test.php?p=358%E2%80%9D" parse_url( $_SERVER['HTTP_HOST'] . $_SERVER['REQUEST_URI'] ) array(4) { ["host"]=> string(9) "localhost" ["port"]=> int(8000) ["path"]=> string(9) "/test.php" ["query"]=> string(14) "p=358%E2%80%9D" } // What the tests are doing: parse_url( "http://example.org/?p=358”" ) array(4) { ["scheme"]=> string(4) "http" ["host"]=> string(11) "example.org" ["path"]=> string(1) "/" ["query"]=> string(8) "p=358�__" } // Example form https://github.com/fguillot/picoFeed/issues/167 parse_url( "https://ru.wikipedia.org/wiki/Преступление_и_наказание" ) array(3) { ["scheme"]=> string(5) "https" ["host"]=> string(16) "ru.wikipedia.org" ["path"]=> string(52) "/wiki/�_�_е�_�_�_пление_и_наказание" }
$_SERVER['REQUEST_URI']
contains the encoded URI which is also used by redirect_canonical()
. The encoding is performed client-side, like in your browser or with wget
. curl
uses the raw URL which produced an "Invalid request (Malformed HTTP request)" error for me.
I'm wondering if we really have to test the decoded strings here.
#11
follow-up:
↓ 12
@
8 years ago
It seems like parse_url() is (sometimes?) not multibyte aware
Another one of those incredibly tedious locale gotchas, this time only affecting non-Linux systems it seems.
If you run this on OSX or Windows 10:
<?php error_log( "default LC_CTYPE=" . setlocale( LC_CTYPE, '0' ) ); error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) ); $before = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH ); error_log( "before=$before" ); error_log( "before === " . ( '/wiki/Преступление_и_наказание' === $before ? 'true' : 'false' ) ); error_log( "setting LC_CTYPE=" . setlocale( LC_CTYPE, 'C' ) ); error_log( "ctype_cntrl=" . ( ctype_cntrl( "\x90" ) ? 'true' : 'false' ) ); $after = parse_url( 'https://ru.wikipedia.org/wiki/Преступление_и_наказание', PHP_URL_PATH ); error_log( "after=$after" ); error_log( "after === " . ( '/wiki/Преступление_и_наказание' === $after ? 'true' : 'false' ) );
then if your default locale is "UTF-8ish" the $before
will be corrupted while the $after
, with LC_CTYPE
set to 'C'
, won't.
It's due to the libc function iscntrl()
returning true for certain non-ASCII characters like \x90
on these systems depending on locale, which makes PHP replace them with underscores (ext/standard/url.c:75).
To get around it one could use the Joomla function, or maybe just wrap parse_url()
with before and after setlocales, eg:
<?php function parse_url_utf8( $url, $component = -1 ) { $prev_ctype_locale = setlocale( LC_CTYPE, '0' ); setlocale( LC_CTYPE, 'C' ); $ret = parse_url( $url, $component ); setlocale( LC_CTYPE, $prev_ctype_locale ); return $ret; }
#12
in reply to:
↑ 11
@
8 years ago
- Keywords needs-patch added; has-patch has-unit-tests commit removed
Replying to gitlost:
To get around it one could use the Joomla function, or maybe just wrap
parse_url()
with before and after setlocales, eg:
Sounds like something which we could add to wp_parse_url()
.
This ticket was mentioned in Slack in #core by ocean90. View the logs.
8 years ago
#15
@
8 years ago
[40256] seems to introduce some tests failures in some environments. Lets revert it for beta-1 and add it later if we can figure out the exact reasons.
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
8 years ago
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
7 years ago
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
7 years ago
This ticket was mentioned in Slack in #core by westonruter. View the logs.
7 years ago
#26
@
7 years ago
I have face issue after i update 4.9.1 my url not work here
site url/property-details/MDM2NXwwMDAwMDAwMDAxMXw1OA..
Now My URL remove automatic site url/property-details/MDM2NXwwMDAwMDAwMDAxMXw1OA(double dot value remove automatic)
anyone have solutions for resolve this.
i find this in caonical.php so please provide here hook to change this
$punctuation_pattern = implode( '|', array_map( 'preg_quote', array( ' ', '%20', // space '!', '%21', // exclamation mark '"', '%22', // double quote "'", '%27', // single quote '(', '%28', // opening bracket ')', '%29', // closing bracket ',', '%2C', // comma ';', '%3B', // semicolon '{', '%7B', // opening curly bracket '}', '%7D', // closing curly bracket '%E2%80%9C', // opening curly quote '%E2%80%9D', // closing curly quote ) ) );
I added the suggested ones and a couple more into the same rewrite in which we fixed the trailing space. WordPress doesn't allow for these characters in the URL so I think it's safe to assume that if they're in there, at the end of the URL, it's a mistake.
Please do tell me if you disagree, if so I could move this to
redirect_guess_404_permalink
without issue of course.