WordPress.org

Make WordPress Core

Opened 7 years ago

Last modified 2 weeks ago

#10483 reviewing enhancement

Change post_name's length from 200 to 400

Reported by: elnur Owned by: SergeyBiryukov
Milestone: 4.7 Priority: normal
Severity: minor Version:
Component: Permalinks Keywords: dev-feedback needs-patch early
Focuses: Cc:

Description

Hello, guys! Thank you very much for providing such a great piece of software! I love WordPress very much! :)

I use WordPress in Russian language and the URLs on my blog consist of Russian characters. There is a post with not such a long URL in Russian, but since it gets encoded to special characters it becomes too long to get fit into post_name field of post table.

I've found what code needs to be changed to increase the length. I make these changes every time a new version is released. I think it would be better to submit a patch here so that others people can benefit from it and I will not need to make those changes every release.

I'm attaching the patch to this ticket and asking you to apply it to the code.

Thank you very much again, guys! You do a great job! :)

Cheers,
Elnur

Attachments (1)

post_name_patch.diff (1.1 KB) - added by elnur 7 years ago.

Download all attachments as: .zip

Change History (34)

#1 @Denis-de-Bernardy
7 years ago

if my memory serves me well, the protocol actually assumes a uri is never longer than 255 chars.

#2 follow-up: @elnur
7 years ago

This link http://www.ielnur.com/blog/2009/05/%d1%81%d0%bd%d0%be%d0%b2%d0%b0-%d0%b1%d1%80%d0%be%d1%81%d0%b8%d1%82%d1%8c-%d0%ba%d1%83%d1%80%d0%b8%d1%82%d1%8c-30-%d1%82%d0%b8%d0%b4%d0%bd%d0%b5%d0%b2%d0%bd%d0%be%d0%b5-%d0%b8%d1%81%d0%bf%d1%8b%d1%82%d0%b0%d0%bd%d0%b8%d0%b5/ is 258 chars long, post_name part is 223. It works fine with our local search engine http://www.yandex.ru and with http://www.google.ru as well.

But I might be wrong in trying to change post_name's length to 400. I remembered that varchar can't hold more than 255 chars. Isn't that true?

#3 in reply to: ↑ 2 @Denis-de-Bernardy
7 years ago

Replying to elnur:

But I might be wrong in trying to change post_name's length to 400. I remembered that varchar can't hold more than 255 chars. Isn't that true?

yeah, until Mysql 5

#4 follow-up: @elnur
7 years ago

So, wouldn't someone apply this patch to the code? :)

#5 in reply to: ↑ 4 @Denis-de-Bernardy
7 years ago

  • Component changed from General to Permalinks
  • Milestone changed from 2.8.3 to Future Release
  • Owner set to ryan
  • Priority changed from normal to low
  • Severity changed from normal to minor

Replying to elnur:

So, wouldn't someone apply this patch to the code? :)

We, no... Not until MySQL 5 is the default, anyway. And then we'd need to worry about url length...

Punting to Future in the meanwhile.

#6 @elnur
7 years ago

But what if we limit it with 255 chars?

#7 @Denis-de-Bernardy
7 years ago

sure, but then there's the domain too, and the second point raised above.

#8 @solarissmoke
5 years ago

  • Keywords close added

close as maybelater?

#9 @RyanMurphy
5 years ago

  • Keywords dev-feedback added

Since we're going to MySQL5 in 3.2, can't this be considered for commit?

#10 @nacin
5 years ago

  • Milestone Future Release deleted
  • Resolution set to maybelater
  • Status changed from new to closed

We're not adjusting anything for MySQL 5 at this time. Closing as maybelater.

#11 @nacin
5 years ago

  • Keywords close removed
  • Milestone set to Future Release
  • Resolution maybelater deleted
  • Status changed from closed to reopened

Reopening for discussion.

#12 follow-up: @linuxologos
5 years ago

  • Cc linuxologos@… added

Thanks for giving the opportunity to discuss this.

This is quite a big problem for languages with an alphabet totally different from English. We can't have a post_name with more than ~38 letters, since every letter is urlencoded to be stored in the database, so every letter is converted into many more characters, dramatically cutting down the maximum possible length of the "real" post-name.

#13 in reply to: ↑ 12 ; follow-up: @hakre
5 years ago

Replying to linuxologos:

Thanks for giving the opportunity to discuss this.

This is quite a big problem for languages with an alphabet totally different from English. We can't have a post_name with more than ~38 letters, since every letter is urlencoded to be stored in the database, so every letter is converted into many more characters, dramatically cutting down the maximum possible length of the "real" post-name.

It's probably worth to drop the urlencoding then inside the storage layer. AFAIK MySQL should be able to store UTF8 in colums, so to have 200 true UTF8 characters instead of 38 to 200 urlencoded, subset of us-ascii ones.

The related refactorings could benefit the overall UTF8 support of the application as a bonus.

#14 in reply to: ↑ 13 ; follow-up: @linuxologos
5 years ago

Replying to hakre:

Replying to linuxologos:

Thanks for giving the opportunity to discuss this.

This is quite a big problem for languages with an alphabet totally different from English. We can't have a post_name with more than ~38 letters, since every letter is urlencoded to be stored in the database, so every letter is converted into many more characters, dramatically cutting down the maximum possible length of the "real" post-name.

It's probably worth to drop the urlencoding then inside the storage layer. AFAIK MySQL should be able to store UTF8 in colums, so to have 200 true UTF8 characters instead of 38 to 200 urlencoded, subset of us-ascii ones.

The related refactorings could benefit the overall UTF8 support of the application as a bonus.

MySQL is able to store UTF8 indeed and that is already the fact for post_content and post_title in (wp_)posts table. They don't get urlencoded before stored in the db. post_name is urlencoded though, and I'm not sure if it's technically safe to alter this.

#15 in reply to: ↑ 14 @hakre
5 years ago

Replying to linuxologos:

Replying to hakre:

Replying to linuxologos:

[...]

MySQL is able to store UTF8 indeed and that is already the fact for post_content and post_title in (wp_)posts table. They don't get urlencoded before stored in the db. post_name is urlencoded though, and I'm not sure if it's technically safe to alter this.

I have not said that this is a trivial change and in fact, I can not even say if the project would be able to perform such changes and a refactoring properly at all.


Replying to Denis-de-Bernardy:

if my memory serves me well, the protocol actually assumes a URI is never longer than 255 chars.

Indeed, RFC 2616 suggests to avoid URIs longer than 255 chars:

The HTTP protocol does not place any a priori limit on the length of
a URI. Servers MUST be able to handle the URI of any resource they
serve, and SHOULD be able to handle URIs of unbounded length if they
provide GET-based forms that could generate such URIs. A server
SHOULD return 414 (Request-URI Too Long) status if a URI is longer
than the server can handle (see section 10.4.15).

Note: Servers ought to be cautious about depending on URI lengths
above 255 bytes, because some older client or proxy
implementations might not properly support these lengths.

from: 3.2.1 General Syntax

Next to that 255 char limit by caution, there is a physical one for the browsers. Microsoft Internet Explorer is introducing the lowest limit which is a little bit up to 2000 characters according to WWW FAQs: What is the maximum length of a URL?. Generally these lengths relate to one-char = one-byte in us-ASCII encoding of an (urlencoded) URL, a subset of URI.

I think the 414 response is something WP don't do so far, which is classified as SHOULD. I have no idea about the overall parameters this is related to, I think those are undocumented so far which need to reveal those from the code-base first before coping with that problem which is out of the scope of this ticket as well.

#16 follow-up: @ldebrouwer
5 years ago

  • Cc info@… added

I once did a clean URL conversion from the cyrillic alphabet to the 'regular' alphabet in a different CMS. 'Дистрибьюторы' would become 'distributori' as a URL slug. This would solve the problem because I believe the latter is still valid Russian. To me this seems a good solution because this can also be applied to other character sets like ancient Greek. Please let me know if this is desired because then I will look into a WordPress patch for it.

#17 @SergeyBiryukov
5 years ago

  • Keywords needs-patch added

Related: #16230

#18 in reply to: ↑ 16 ; follow-up: @SergeyBiryukov
5 years ago

Replying to ldebrouwer:

I once did a clean URL conversion from the cyrillic alphabet to the 'regular' alphabet in a different CMS. 'Дистрибьюторы' would become 'distributori' as a URL slug. This would solve the problem because I believe the latter is still valid Russian.

It's not Cyrillic, so it's not valid Russian. Transliteration is acceptable for some people (including me), and there are some plugins which transliterate post and term slugs, but it's only a workaround and not a long-term solution for everyone.

The best solution here would be to store slugs as is, not in urlencoded form, since it only allows 33 chars for non-English slugs, which is noticeably less than the original 200 characters limit.

The problem is not only the length, though. Create two posts on 3.3-trunk with the same title:

Предлагаем супер металлообрабатывающее оборудование

The first one will just have a truncated slug:

предлагаем-супер-металлообрабатываю

But the second one will have a broken slug:

предлагаем-супер-металлообрабатыва�%-2

#20 @ryan
5 years ago

http://dev.mysql.com/doc/refman/5.0/en/string-type-overview.html

"In MySQL 5.0, the range of M is 0 to 255 before MySQL 5.0.3, and 0 to 65,535 in MySQL 5.0.3 and later."

$required_mysql_version = '5.0';

#21 follow-up: @nacin
5 years ago

As discussed during the 3.2 cycle, we had aimed to choose 5.0.22. Version 5.0.15 was the first production version, and 5.0.22 was the first with any real usage. We didn't push it because we did not identify any version-specific things we would have wanted.

42.5% of all installs are on 5.0. However, our stats show that there are 59 total installs on 5.0.0, 5.0.1, and 5.0.2. And that number is probably inflated, based on how our stats collection works. I think it would be safe to bump the required version up a bit.

#22 in reply to: ↑ 18 @SergeyBiryukov
4 years ago

Replying to SergeyBiryukov:

But the second one will have a broken slug:

предлагаем-супер-металлообрабатыва�%-2

Related: #21013

This ticket was mentioned in IRC in #wordpress-dev by johnbillion. View the logs.


2 years ago

This ticket was mentioned in IRC in #wordpress-dev by simonwheatley. View the logs.


2 years ago

#25 @ryan
2 years ago

  • Owner ryan deleted
  • Status changed from reopened to assigned

#26 @archon810
16 months ago

I'm a bit confused. WP 4.2 changes the posts table and specifically the index on post_name to be 191 long, (which takes a long time on large tables, which is why I noticed). Reading over the reasoning, I was expecting the post_name field to be 400 long, but I still see it as 200 on all upgraded 4.2 instances.

Should I be expecting the field name to change to 400 on existing tables or is this only for new ones created from 4.2 on?

#27 in reply to: ↑ 21 ; follow-up: @netweb
3 weeks ago

Replying to nacin:

42.5% of all installs are on 5.0. However, our stats show that there are 59 total installs on 5.0.0, 5.0.1, and 5.0.2. And that number is probably inflated, based on how our stats collection works. I think it would be safe to bump the required version up a bit.

MySQL 5.0.x today is at 4.2% https://wordpress.org/about/stats/

Replying to Denis-de-Bernardy:

Replying to elnur:

So, wouldn't someone apply this patch to the code? :)

We, no... Not until MySQL 5 is the default, anyway. And then we'd need to worry about url length...

Punting to Future in the meanwhile.

Is now the future?

Related: #meta1884

#28 @SergeyBiryukov
3 weeks ago

  • Keywords early added
  • Milestone changed from Future Release to 4.7
  • Owner set to SergeyBiryukov
  • Priority changed from low to normal
  • Status changed from assigned to reviewing

#29 in reply to: ↑ 27 @dd32
3 weeks ago

Replying to netweb:

Replying to nacin:

42.5% of all installs are on 5.0. However, our stats show that there are 59 total installs on 5.0.0, 5.0.1, and 5.0.2. And that number is probably inflated, based on how our stats collection works. I think it would be safe to bump the required version up a bit.

MySQL 5.0.x today is at 4.2% https://wordpress.org/about/stats/
..
Is now the future?

Although MySQL 5.0.x is only 4.2%, the install count is still in the millions.

MySQL 5.0.3 was the first to support > 255char for VARCHAR.
Stats currently show there are 3 sites on MySQL 5.0.2 (running WordPress 2.8 & 3.0.1), and 36 sites on < MySQL 5.0.15 (Ranges from WordPress 2.8 ~ 4.5.3). To clarify, We have no sites which are recorded as currently running MySQL 5.0.0/5.0.1/5.0.3 in the WordPress.org stats.

If we wanted to bump the requirements in order to get >255 char for VARCHAR, then it's currently safe to bump it to 5.0.4 without any real user impact.

#30 @SergeyBiryukov
3 weeks ago

The suggested limit of 400 seems still not enough, as it only allows up to 66 encoded characters. Would like to explore our options here.

#31 @dd32
3 weeks ago

It's also worth remembering that we're limited by the 767-byte index limits, so post_name can only ever be indexed by the first 767bytes / 4bytes in utf8mb4 = 191 characters, so for encoded character urls, only by the first 31 characters if they're stored encoded in the database.

#32 @netweb
3 weeks ago

Quoting SergeyBiryukov in #meta1884]:

For a language where most words are longer than in English, this is often insufficient. English topic slugs, on the other hand, allow up to 80 characters (which is the maxlength attribute on the input).

500 (rounded up from 484) would allow for 80 characters, kind of bringing it line with the above English max of 80.

#33 @SergeyBiryukov
2 weeks ago

80 was a reference to the topic title input on support forums, it looks like the title input on Edit Post screen doesn't have a maxlength attribute at all. 80 is obviously better than 33 though :)

Note: See TracTickets for help on using tickets.