Make WordPress Core

Opened 12 years ago

Closed 12 years ago

Last modified 12 years ago

#21013 closed defect (bug) (fixed)

Badly truncated slugs in posts with similar, long non-latin titles

Reported by: nevma's profile nevma Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 3.6 Priority: normal
Severity: normal Version: 3.4
Component: Permalinks Keywords: has-patch 3.6-early commit
Focuses: Cc:

Description

When WP automatically generates a slug for a post with non-latin title AND a number has to be appended to the slug to avoid duplicate slugs, then the resulting slug may be badly truncated.

To reproduce the problem:

Create and publish a post with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι"
Create and publish another one with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι, και άλλα τραγούδια"

In both cases, let WP create the slug.

You will see that the second slug is broken and the URL results in a 404 error.

It seems that, in this scenario, WP does not take into account the fact that the new slug may contain url-encoded characters from multibyte character sets, when truncating it.

Attachments (12)

21013.patch (2.3 KB) - added by SergeyBiryukov 12 years ago.
21013.2.patch (4.3 KB) - added by SergeyBiryukov 12 years ago.
21013.3.patch (4.6 KB) - added by SergeyBiryukov 12 years ago.
21013.4.patch (4.6 KB) - added by SergeyBiryukov 12 years ago.
Replaced urlencode() with utf8_uri_encode()
21013.5.patch (2.6 KB) - added by SergeyBiryukov 12 years ago.
21013.6.patch (3.0 KB) - added by SergeyBiryukov 12 years ago.
21013.7.patch (3.0 KB) - added by SergeyBiryukov 12 years ago.
With trimming last hyphen
21013.8.patch (2.6 KB) - added by SergeyBiryukov 12 years ago.
21013.9.patch (2.6 KB) - added by SergeyBiryukov 12 years ago.
21013.10.patch (2.6 KB) - added by SergeyBiryukov 12 years ago.
21013.11.patch (2.7 KB) - added by SergeyBiryukov 12 years ago.
Refreshed
21013.fix-existing-slugs.php (1.9 KB) - added by SergeyBiryukov 12 years ago.

Download all attachments as: .zip

Change History (33)

#1 @SergeyBiryukov
12 years ago

Related: #10483 (this specific issue is mentioned in ticket:10483:18).

#2 @SergeyBiryukov
12 years ago

The resulting post name:

αρνάκι-άσπρο-και-παχύ-της-μάνας-του-κα�%b-2

#3 @SergeyBiryukov
12 years ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.5

We should be able to use mb_substr() to cut the string properly, since we have it in compat.php.

#4 @nacin
12 years ago

  • Keywords needs-unit-tests added

Looks good at a glance. Unit tests would be helpful here.

#5 @SergeyBiryukov
12 years ago

21013.patch didn't work for two reasons:

  1. When we run the query to check if the slug already exists, the slug should stay urlencoded.
  2. The decoded string is shorter than 200 characters, so the slug wasn't truncated. We would need a compat version of mb_strlen() as well.

21013.2.patch fixes that. The compat mb_strlen() implementation is based on mb_substr() code:
http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/compat.php#L16

21013.3.patch adds rtrim( $alt_post_name, '-' ), in case the string we get ends with a hyphen:

предлагаем-металлообрабатывающее--2
Last edited 12 years ago by SergeyBiryukov (previous) (diff)

@SergeyBiryukov
12 years ago

Replaced urlencode() with utf8_uri_encode()

#7 follow-up: @SergeyBiryukov
12 years ago

As an alternative approach, we could attempt to truncate slugs without re-encoding.

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches, so the unit tests from [UT740] should be updated if 21013.5.patch gets approved.

21013.6.patch also handles 3-octet sequences.

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

@SergeyBiryukov
12 years ago

With trimming last hyphen

#8 in reply to: ↑ 7 @SergeyBiryukov
12 years ago

  • Keywords needs-unit-tests removed

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches

21013.8.patch brings the same accuracy to the mb_substr() approach, so both approaches should be functionally the same now. Turned out mb_strlen() is not needed.

Updated unit tests: [UT741]

#9 @SergeyBiryukov
12 years ago

  • Keywords early added

#10 @SergeyBiryukov
12 years ago

21013.9.patch adds a urldecode( $slug ) === $slug check to see if we could get away with substr() (as per nacin's suggestion).

#11 @rfair404
12 years ago

  • Cc rfair404 added

The latest patch seems to work for me and does seem to solve the problem.

In my case, publishing the first post with the title works as expected. url generated was pretty much as expected, something like :
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1%CE%BC/

Publishing the second post results in an error 400 "Bad Request" and a url that looks like this:
http://local.dev/blog/2012/08/06/%ce%b1%cf%81%ce%bd%ce%ac%ce%ba%ce%b9-%ce%ac%cf%83%cf%80%cf%81%ce%bf-%ce%ba%ce%b1%ce%b9-%cf%80%ce%b1%cf%87%cf%8d-%cf%84%ce%b7%cf%82-%ce%bc%ce%ac%ce%bd%ce%b1%cf%82-%cf%84%ce%bf%cf%85-%ce%ba%ce%b1%ce%b-2/

After applying the patch, I get a url like this:
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1-2/

I even tried to add several additional posts and pages with a similar name, and it appears to work across post types including pages.

#12 @SergeyBiryukov
12 years ago

21013.10.patch simplifies the function by eliminating the cycle and passing $length to utf8_uri_encode() instead. Still passes the unit test.

#13 @SergeyBiryukov
12 years ago

#22263 was marked as a duplicate.

#14 @meloniq
12 years ago

  • Cc meloniq@… added

#16 @nacin
12 years ago

  • Keywords punt added

This looks good but I still feel a bit weary of committing this so late in a cycle. Early 3.6?

#17 @SergeyBiryukov
12 years ago

  • Keywords 3.6-early added; early punt removed
  • Milestone changed from 3.5 to Future Release

#18 @SergeyBiryukov
12 years ago

  • Milestone changed from Future Release to 3.6

@SergeyBiryukov
12 years ago

Refreshed

#19 @SergeyBiryukov
12 years ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 23420:

Properly truncate UTF-8 post slugs in wp_unique_post_slug(). fixes #21013.

#20 follow-up: @alex-ye
12 years ago

  • Cc nashwan.doaqan@… added

Great that this bug solved in 3.6

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?! , and sorry because I comment in a closed ticket but I am not sure that I should open a new one .

#21 in reply to: ↑ 20 @SergeyBiryukov
12 years ago

Replying to alex-ye:

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?!

Try this script to fix the existing slugs: 21013.fix-existing-slugs.php. Place it in the root directory of your site. Remember to back up your database before running the script.

Note: See TracTickets for help on using tickets.