WordPress.org

Make WordPress Core

Opened 22 months ago

Closed 14 months ago

Last modified 12 months ago

#21013 closed defect (bug) (fixed)

Badly truncated slugs in posts with similar, long non-latin titles

Reported by: nevma Owned by: SergeyBiryukov
Milestone: 3.6 Priority: normal
Severity: normal Version: 3.4
Component: Permalinks Keywords: has-patch 3.6-early commit
Focuses: Cc:

Description

When WP automatically generates a slug for a post with non-latin title AND a number has to be appended to the slug to avoid duplicate slugs, then the resulting slug may be badly truncated.

To reproduce the problem:

Create and publish a post with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι"
Create and publish another one with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι, και άλλα τραγούδια"

In both cases, let WP create the slug.

You will see that the second slug is broken and the URL results in a 404 error.

It seems that, in this scenario, WP does not take into account the fact that the new slug may contain url-encoded characters from multibyte character sets, when truncating it.

Attachments (12)

21013.patch (2.3 KB) - added by SergeyBiryukov 22 months ago.
21013.2.patch (4.3 KB) - added by SergeyBiryukov 22 months ago.
21013.3.patch (4.6 KB) - added by SergeyBiryukov 22 months ago.
21013.4.patch (4.6 KB) - added by SergeyBiryukov 22 months ago.
Replaced urlencode() with utf8_uri_encode()
21013.5.patch (2.6 KB) - added by SergeyBiryukov 22 months ago.
21013.6.patch (3.0 KB) - added by SergeyBiryukov 22 months ago.
21013.7.patch (3.0 KB) - added by SergeyBiryukov 22 months ago.
With trimming last hyphen
21013.8.patch (2.6 KB) - added by SergeyBiryukov 22 months ago.
21013.9.patch (2.6 KB) - added by SergeyBiryukov 21 months ago.
21013.10.patch (2.6 KB) - added by SergeyBiryukov 20 months ago.
21013.11.patch (2.7 KB) - added by SergeyBiryukov 14 months ago.
Refreshed
21013.fix-existing-slugs.php (1.9 KB) - added by SergeyBiryukov 12 months ago.

Download all attachments as: .zip

Change History (33)

comment:1 SergeyBiryukov22 months ago

Related: #10483 (this specific issue is mentioned in ticket:10483:18).

comment:2 SergeyBiryukov22 months ago

The resulting post name:

αρνάκι-άσπρο-και-παχύ-της-μάνας-του-κα�%b-2

SergeyBiryukov22 months ago

comment:3 SergeyBiryukov22 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.5

We should be able to use mb_substr() to cut the string properly, since we have it in compat.php.

comment:4 nacin22 months ago

  • Keywords needs-unit-tests added

Looks good at a glance. Unit tests would be helpful here.

SergeyBiryukov22 months ago

SergeyBiryukov22 months ago

comment:5 SergeyBiryukov22 months ago

21013.patch didn't work for two reasons:

  1. When we run the query to check if the slug already exists, the slug should stay urlencoded.
  2. The decoded string is shorter than 200 characters, so the slug wasn't truncated. We would need a compat version of mb_strlen() as well.

21013.2.patch fixes that. The compat mb_strlen() implementation is based on mb_substr() code:
http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/compat.php#L16

21013.3.patch adds rtrim( $alt_post_name, '-' ), in case the string we get ends with a hyphen:

предлагаем-металлообрабатывающее--2
Last edited 22 months ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov22 months ago

Replaced urlencode() with utf8_uri_encode()

SergeyBiryukov22 months ago

SergeyBiryukov22 months ago

comment:7 follow-up: SergeyBiryukov22 months ago

As an alternative approach, we could attempt to truncate slugs without re-encoding.

This allows 1 character longer slugs than the previous patches due to more accurate string handling, so the unit tests from [UT740] should be updated if 21013.5.patch gets approved.

21013.6.patch also handles 3-octet sequences.

Version 0, edited 22 months ago by SergeyBiryukov (next)

SergeyBiryukov22 months ago

With trimming last hyphen

SergeyBiryukov22 months ago

comment:8 in reply to: ↑ 7 SergeyBiryukov22 months ago

  • Keywords needs-unit-tests removed

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches

21013.8.patch brings the same accuracy to the mb_substr() approach, so both approaches should be functionally the same now. Turned out mb_strlen() is not needed.

Updated unit tests: [UT741]

comment:9 SergeyBiryukov22 months ago

  • Keywords early added

SergeyBiryukov21 months ago

comment:10 SergeyBiryukov21 months ago

21013.9.patch adds a urldecode( $slug ) === $slug check to see if we could get away with substr() (as per nacin's suggestion).

comment:11 rfair40421 months ago

  • Cc rfair404 added

The latest patch seems to work for me and does seem to solve the problem.

In my case, publishing the first post with the title works as expected. url generated was pretty much as expected, something like :
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1%CE%BC/

Publishing the second post results in an error 400 "Bad Request" and a url that looks like this:
http://local.dev/blog/2012/08/06/%ce%b1%cf%81%ce%bd%ce%ac%ce%ba%ce%b9-%ce%ac%cf%83%cf%80%cf%81%ce%bf-%ce%ba%ce%b1%ce%b9-%cf%80%ce%b1%cf%87%cf%8d-%cf%84%ce%b7%cf%82-%ce%bc%ce%ac%ce%bd%ce%b1%cf%82-%cf%84%ce%bf%cf%85-%ce%ba%ce%b1%ce%b-2/

After applying the patch, I get a url like this:
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1-2/

I even tried to add several additional posts and pages with a similar name, and it appears to work across post types including pages.

SergeyBiryukov20 months ago

comment:12 SergeyBiryukov20 months ago

21013.10.patch simplifies the function by eliminating the cycle and passing $length to utf8_uri_encode() instead. Still passes the unit test.

comment:13 SergeyBiryukov18 months ago

#22263 was marked as a duplicate.

comment:14 meloniq18 months ago

  • Cc meloniq@… added

comment:16 nacin18 months ago

  • Keywords punt added

This looks good but I still feel a bit weary of committing this so late in a cycle. Early 3.6?

comment:17 SergeyBiryukov18 months ago

  • Keywords 3.6-early added; early punt removed
  • Milestone changed from 3.5 to Future Release

comment:18 SergeyBiryukov16 months ago

  • Milestone changed from Future Release to 3.6

SergeyBiryukov14 months ago

Refreshed

comment:19 SergeyBiryukov14 months ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 23420:

Properly truncate UTF-8 post slugs in wp_unique_post_slug(). fixes #21013.

comment:20 follow-up: alex-ye13 months ago

  • Cc nashwan.doaqan@… added

Great that this bug solved in 3.6

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?! , and sorry because I comment in a closed ticket but I am not sure that I should open a new one .

comment:21 in reply to: ↑ 20 SergeyBiryukov12 months ago

Replying to alex-ye:

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?!

Try this script to fix the existing slugs: 21013.fix-existing-slugs.php. Place it in the root directory of your site. Remember to back up your database before running the script.

Note: See TracTickets for help on using tickets.