WordPress.org

Make WordPress Core

Opened 12 months ago

Closed 4 months ago

Last modified 2 months ago

#21013 closed defect (bug) (fixed)

Badly truncated slugs in posts with similar, long non-latin titles

Reported by: nevma Owned by: SergeyBiryukov
Priority: normal Milestone: 3.6
Component: Permalinks Version: 3.4
Severity: normal Keywords: has-patch 3.6-early commit
Cc: rfair404, meloniq@…, nashwan.doaqan@…

Description

When WP automatically generates a slug for a post with non-latin title AND a number has to be appended to the slug to avoid duplicate slugs, then the resulting slug may be badly truncated.

To reproduce the problem:

Create and publish a post with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι"
Create and publish another one with this title (without the quotes): "Αρνάκι άσπρο και παχύ της μάνας του καμάρι, και άλλα τραγούδια"

In both cases, let WP create the slug.

You will see that the second slug is broken and the URL results in a 404 error.

It seems that, in this scenario, WP does not take into account the fact that the new slug may contain url-encoded characters from multibyte character sets, when truncating it.

Attachments (12)

21013.patch (2.3 KB) - added by SergeyBiryukov 12 months ago.
21013.2.patch (4.3 KB) - added by SergeyBiryukov 12 months ago.
21013.3.patch (4.6 KB) - added by SergeyBiryukov 12 months ago.
21013.4.patch (4.6 KB) - added by SergeyBiryukov 12 months ago.
Replaced urlencode() with utf8_uri_encode()
21013.5.patch (2.6 KB) - added by SergeyBiryukov 12 months ago.
21013.6.patch (3.0 KB) - added by SergeyBiryukov 12 months ago.
21013.7.patch (3.0 KB) - added by SergeyBiryukov 12 months ago.
With trimming last hyphen
21013.8.patch (2.6 KB) - added by SergeyBiryukov 12 months ago.
21013.9.patch (2.6 KB) - added by SergeyBiryukov 11 months ago.
21013.10.patch (2.6 KB) - added by SergeyBiryukov 10 months ago.
21013.11.patch (2.7 KB) - added by SergeyBiryukov 4 months ago.
Refreshed
21013.fix-existing-slugs.php (1.9 KB) - added by SergeyBiryukov 2 months ago.

Download all attachments as: .zip

Change History (33)

comment:1 SergeyBiryukov12 months ago

Related: #10483 (this specific issue is mentioned in ticket:10483:18).

comment:2 SergeyBiryukov12 months ago

The resulting post name:

αρνάκι-άσπρο-και-παχύ-της-μάνας-του-κα�%b-2

SergeyBiryukov12 months ago

comment:3 SergeyBiryukov12 months ago

  • Keywords has-patch added
  • Milestone changed from Awaiting Review to 3.5

We should be able to use mb_substr() to cut the string properly, since we have it in compat.php.

comment:4 nacin12 months ago

  • Keywords needs-unit-tests added

Looks good at a glance. Unit tests would be helpful here.

SergeyBiryukov12 months ago

SergeyBiryukov12 months ago

comment:5 SergeyBiryukov12 months ago

21013.patch didn't work for two reasons:

  1. When we run the query to check if the slug already exists, the slug should stay urlencoded.
  2. The decoded string is shorter than 200 characters, so the slug wasn't truncated. We would need a compat version of mb_strlen() as well.

21013.2.patch fixes that. The compat mb_strlen() implementation is based on mb_substr() code:
http://core.trac.wordpress.org/browser/tags/3.4/wp-includes/compat.php#L16

21013.3.patch adds rtrim( $alt_post_name, '-' ), in case the string we get ends with a hyphen:

предлагаем-металлообрабатывающее--2
Last edited 12 months ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov12 months ago

Replaced urlencode() with utf8_uri_encode()

SergeyBiryukov12 months ago

SergeyBiryukov12 months ago

comment:7 follow-up: SergeyBiryukov12 months ago

As an alternative approach, we could attempt to truncate slugs without re-encoding.

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches, so the unit tests from [UT740] should be updated if 21013.5.patch gets approved.

21013.6.patch also handles 3-octet sequences.

Last edited 12 months ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov12 months ago

With trimming last hyphen

SergeyBiryukov12 months ago

comment:8 in reply to: ↑ 7 SergeyBiryukov12 months ago

  • Keywords needs-unit-tests removed

Due to more accurate string handling, this allows 1 character longer slugs (still within the limit) than the previous patches

21013.8.patch brings the same accuracy to the mb_substr() approach, so both approaches should be functionally the same now. Turned out mb_strlen() is not needed.

Updated unit tests: [UT741]

comment:9 SergeyBiryukov12 months ago

  • Keywords early added

SergeyBiryukov11 months ago

comment:10 SergeyBiryukov11 months ago

21013.9.patch adds a urldecode( $slug ) === $slug check to see if we could get away with substr() (as per nacin's suggestion).

comment:11 rfair40411 months ago

  • Cc rfair404 added

The latest patch seems to work for me and does seem to solve the problem.

In my case, publishing the first post with the title works as expected. url generated was pretty much as expected, something like :
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1%CE%BC/

Publishing the second post results in an error 400 "Bad Request" and a url that looks like this:
http://local.dev/blog/2012/08/06/%ce%b1%cf%81%ce%bd%ce%ac%ce%ba%ce%b9-%ce%ac%cf%83%cf%80%cf%81%ce%bf-%ce%ba%ce%b1%ce%b9-%cf%80%ce%b1%cf%87%cf%8d-%cf%84%ce%b7%cf%82-%ce%bc%ce%ac%ce%bd%ce%b1%cf%82-%cf%84%ce%bf%cf%85-%ce%ba%ce%b1%ce%b-2/

After applying the patch, I get a url like this:
http://local.dev/blog/2012/08/06/%CE%B1%CF%81%CE%BD%CE%AC%CE%BA%CE%B9-%CE%AC%CF%83%CF%80%CF%81%CE%BF-%CE%BA%CE%B1%CE%B9-%CF%80%CE%B1%CF%87%CF%8D-%CF%84%CE%B7%CF%82-%CE%BC%CE%AC%CE%BD%CE%B1%CF%82-%CF%84%CE%BF%CF%85-%CE%BA%CE%B1-2/

I even tried to add several additional posts and pages with a similar name, and it appears to work across post types including pages.

SergeyBiryukov10 months ago

comment:12 SergeyBiryukov10 months ago

21013.10.patch simplifies the function by eliminating the cycle and passing $length to utf8_uri_encode() instead. Still passes the unit test.

comment:13 SergeyBiryukov8 months ago

#22263 was marked as a duplicate.

comment:14 meloniq8 months ago

  • Cc meloniq@… added

comment:16 nacin7 months ago

  • Keywords punt added

This looks good but I still feel a bit weary of committing this so late in a cycle. Early 3.6?

comment:17 SergeyBiryukov7 months ago

  • Keywords 3.6-early added; early punt removed
  • Milestone changed from 3.5 to Future Release

comment:18 SergeyBiryukov6 months ago

  • Milestone changed from Future Release to 3.6

SergeyBiryukov4 months ago

Refreshed

comment:19 SergeyBiryukov4 months ago

  • Owner set to SergeyBiryukov
  • Resolution set to fixed
  • Status changed from new to closed

In 23420:

Properly truncate UTF-8 post slugs in wp_unique_post_slug(). fixes #21013.

comment:20 follow-up: alex-ye3 months ago

  • Cc nashwan.doaqan@… added

Great that this bug solved in 3.6

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?! , and sorry because I comment in a closed ticket but I am not sure that I should open a new one .

comment:21 in reply to: ↑ 20 SergeyBiryukov2 months ago

Replying to alex-ye:

but I have many posts with this bug (more than 1000), Should I change the slug for each one ?!

Try this script to fix the existing slugs: 21013.fix-existing-slugs.php. Place it in the root directory of your site. Remember to back up your database before running the script.

Note: See TracTickets for help on using tickets.