WordPress.org

Make WordPress Core

Opened 2 years ago

Closed 17 months ago

Last modified 15 months ago

#21013 closed defect (bug) (fixed)

Badly truncated slugs in posts with similar, long non-latin titles

Reported by: nevma Owned by: SergeyBiryukov
Milestone: 3.6 Priority: normal
Severity: normal Version: 3.4
Component: Permalinks Keywords: has-patch 3.6-early commit
Focuses: Cc:

Description

When WP automatically generates a slug for a post with non-latin title AND a number has to be appended to the slug to avoid duplicate slugs, then the resulting slug may be badly truncated.

To reproduce the problem:

Create and publish a post with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι"
Create and publish another one with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι, και άλλα τραγούδια"

In both cases, let WP create the slug.

You will see that the second slug is broken and the URL results in a 404 error.

It seems that, in this scenario, WP does not take into account the fact that the new slug may contain url-encoded characters from multibyte character sets, when truncating it.

Attachments (12)

21013.patch (2.3 KB) - added by SergeyBiryukov 2 years ago.
21013.2.patch (4.3 KB) - added by SergeyBiryukov 2 years ago.
21013.3.patch (4.6 KB) - added by SergeyBiryukov 2 years ago.
21013.4.patch (4.6 KB) - added by SergeyBiryukov 2 years ago.
Replaced urlencode() with utf8_uri_encode()
21013.5.patch (2.6 KB) - added by SergeyBiryukov 2 years ago.
21013.6.patch (3.0 KB) - added by SergeyBiryukov 2 years ago.
21013.7.patch (3.0 KB) - added by SergeyBiryukov 2 years ago.
With trimming last hyphen
21013.8.patch (2.6 KB) - added by SergeyBiryukov 2 years ago.
21013.9.patch (2.6 KB) - added by SergeyBiryukov 2 years ago.
21013.10.patch (2.6 KB) - added by SergeyBiryukov 23 months ago.
21013.11.patch (2.7 KB) - added by SergeyBiryukov 17 months ago.
Refreshed
21013.fix-existing-slugs.php (1.9 KB) - added by SergeyBiryukov 15 months ago.

Download all attachments as: .zip

Change History (33)

comment:1 SergeyBiryukov2 years ago

Related: #10483 (this specific issue is mentioned in ticket:10483:18).

comment:2 SergeyBiryukov2 years ago

The resulting post name:

αρνάκι-άσπρο-και-παχύ-της-μάνας-του-κα�%b-2

SergeyBiryukov2 years ago

comment:3 SergeyBiryukov2 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.5

We should be able to use mb_substr() to cut the string properly, since we have it in compat.php.

comment:4 nacin2 years ago

  • Keywords needs-unit-tests added

Looks good at a glance. Unit tests would be helpful here.

SergeyBiryukov2 years ago

SergeyBiryukov2 years ago

comment:5 SergeyBiryukov2 years ago

21013.patch didn't work for two reasons:

  1. When we run the query to check if the slug already exists, the slug should stay urlencoded.
  2. The decoded string is shorter than 200 characters, so the slug wasn't truncated. We would need a compat version of mb_strlen() as well.

21013.2.patch fixes that. The compat mb_strlen() implementation is based on mb_substr() code:
http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/compat.php#L16

21013.3.patch adds rtrim( $alt_post_name, '-' ), in case the string we get ends with a hyphen:

предлагаем-металлообрабатывающее--2
Last edited 2 years ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov2 years ago

Replaced urlencode() with utf8_uri_encode()

SergeyBiryukov2 years ago

SergeyBiryukov2 years ago

comment:7 follow-up: SergeyBiryukov2 years ago

As an alternative approach, we could attempt to truncate slugs without re-encoding.

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches, so the unit tests from [UT740] should be updated if 21013.5.patch gets approved.

21013.6.patch also handles 3-octet sequences.

Last edited 2 years ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov2 years ago

With trimming last hyphen

SergeyBiryukov2 years ago

comment:8 in reply to: ↑ 7 SergeyBiryukov2 years ago

  • Keywords needs-unit-tests removed

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches

21013.8.patch brings the same accuracy to the mb_substr() approach, so both approaches should be functionally the same now. Turned out mb_strlen() is not needed.

Updated unit tests: [UT741]

comment:9 SergeyBiryukov2 years ago

  • Keywords early added

SergeyBiryukov2 years ago

comment:10 SergeyBiryukov2 years ago

21013.9.patch adds a urldecode( $slug ) === $slug check to see if we could get away with substr() (as per nacin's suggestion).

comment:11 rfair4042 years ago

  • Cc rfair404 added

The latest patch seems to work for me and does seem to solve the problem.

In my case, publishing the first post with the title works as expected. url generated was pretty much as expected, something like :
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1%CE%BC/

Publishing the second post results in an error 400 "Bad Request" and a url that looks like this:
http://local.dev/blog/2012/08/06/%ce%b1%cf%81%ce%bd%ce%ac%ce%ba%ce%b9-%ce%ac%cf%83%cf%80%cf%81%ce%bf-%ce%ba%ce%b1%ce%b9-%cf%80%ce%b1%cf%87%cf%8d-%cf%84%ce%b7%cf%82-%ce%bc%ce%ac%ce%bd%ce%b1%cf%82-%cf%84%ce%bf%cf%85-%ce%ba%ce%b1%ce%b-2/

After applying the patch, I get a url like this:
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1-2/

I even tried to add several additional posts and pages with a similar name, and it appears to work across post types including pages.

SergeyBiryukov23 months ago

comment:12 SergeyBiryukov23 months ago

21013.10.patch simplifies the function by eliminating the cycle and passing $length to utf8_uri_encode() instead. Still passes the unit test.

comment:13 SergeyBiryukov21 months ago

#22263 was marked as a duplicate.

comment:14 meloniq20 months ago

  • Cc meloniq@… added

comment:16 nacin20 months ago

  • Keywords punt added

This looks good but I still feel a bit weary of committing this so late in a cycle. Early 3.6?

comment:17 SergeyBiryukov20 months ago

  • Keywords 3.6-early added; early punt removed
  • Milestone changed from 3.5 to Future Release

comment:18 SergeyBiryukov19 months ago

  • Milestone changed from Future Release to 3.6

SergeyBiryukov17 months ago

Refreshed

comment:19 SergeyBiryukov17 months ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 23420:

Properly truncate UTF-8 post slugs in wp_unique_post_slug(). fixes #21013.

comment:20 follow-up: alex-ye15 months ago

  • Cc nashwan.doaqan@… added

Great that this bug solved in 3.6

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?! , and sorry because I comment in a closed ticket but I am not sure that I should open a new one .

comment:21 in reply to: ↑ 20 SergeyBiryukov15 months ago

Replying to alex-ye:

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?!

Try this script to fix the existing slugs: 21013.fix-existing-slugs.php. Place it in the root directory of your site. Remember to back up your database before running the script.

Note: See TracTickets for help on using tickets.