WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 2 years ago

#10249 accepted defect (bug)

Page slug in cyrillic = Error 404 - Not Found!

Reported by: kalifi Owned by: westi
Milestone: Future Release Priority: normal
Severity: major Version: 2.7
Component: Permalinks Keywords: has-patch dev-feedback needs-unit-tests 3.5-early
Focuses: Cc:

Description

When I create a page with page slug for example "киро"
then when I try to open domain/киро - Error 404 - Not Found

The permalinks are %postname%

Post slug with this slug is working just fine, the same BUG exists in 2.7, 2.7.1 and 2.8

Attachments (6)

10249.diff (677 bytes) - added by westi 4 years ago.
A fix of sorts... needs alot of testing
page-slugs-urldecode-other-way-round.diff (1.0 KB) - added by nbachiyski 3 years ago.
10249.1.patch (1.8 KB) - added by SergeyBiryukov 3 years ago.
Escape $in_string
10249.2.patch (2.0 KB) - added by SergeyBiryukov 3 years ago.
Escape $post_type, avoid $wpdb->prepare()
10249.3.patch (2.0 KB) - added by SergeyBiryukov 3 years ago.
10249.4.patch (1.2 KB) - added by SergeyBiryukov 3 years ago.

Download all attachments as: .zip

Change History (70)

comment:1 ryan5 years ago

  • Milestone changed from 2.8.1 to 2.9

Not a 2.8 regression so moving to 2.9.

comment:2 miqrogroove4 years ago

  • Keywords needs-patch added

Example "киро" works when using Month and Name permalinks. Setting the Custom Structure value to "/%postname%/" does cause the test page to throw status 404.

comment:3 ryan4 years ago

  • Milestone changed from 2.9 to 3.0

comment:4 dd324 years ago

  • Keywords needs-patch removed
  • Milestone 3.0 deleted
  • Resolution set to worksforme
  • Status changed from new to closed

This is currently working for me in trunk (3.0).

Tested with both /%postname%/ and /%year%/%postname%/

Re-open if this is still around, I believe this may have been fixed by a similar ticket near the start of the dev cycle.

comment:5 Ragnaripea4 years ago

  • Resolution worksforme deleted
  • Status changed from closed to reopened

I have no idea how did you manage to get this issue working with just /%postname%/ but I just installed 3.0b2 and confirmed it's not working. It is working with /%year%/%postname% but with the /%year%, it's also working in 2.9.2.

So I'll reopen it.

comment:6 Ragnaripea4 years ago

Ok, we managed to find a "solution" for it. I'm not sure if this is the proper way to do it and it might mess up some other stuff but for what we needed it, it's working.

In query.php (/wp-includes), add the piece of code specified on next line after

function parse_query($query) {

It's (in 3.0 beta 2) on line 1246.

The code that did the job:

if ( !$query['name'] && !$query['pagename'] && $query['category_name']) {
           $query['pagename']=$query['category_name'];
         }

I'm no programmer and I really have no idea what's the magic behind it but it's working. Working as in you can use cyrillic (most probably all other non-latin alphabets, too) slugs perfectly both on posts and pages with the /%postname% /%category%/%postname% permalinks (we're using the last one). No /%year% is needed to make it work.

Smart guys out here probably know what's going on when looking at this code, so they can either improve it or if it's known to cause some issues, tell us here that it's not working and shouldn't be used.

comment:7 nacin4 years ago

  • Milestone set to 3.0

comment:8 westi4 years ago

  • Owner changed from ryan to westi
  • Status changed from reopened to accepted

I can reproduce this..

Investigating.

comment:9 westi4 years ago

Ok this caused by the url encoded slug being in the explicit page rules but us matching against a rawurldecoded string.

comment:10 nacin4 years ago

  • Severity changed from normal to blocker

comment:11 westi4 years ago

It seems this was introduced for #3727 to fix a similar issue.

It seems rather bogus here as we need the string to be urlencoded to match against our rewrite rules

comment:12 westi4 years ago

  • Cc ryan added
  • Keywords has-patch 2nd-opinion added
  • Milestone changed from 3.0 to 3.1
  • Priority changed from high to normal
  • Severity changed from blocker to normal
  • Version set to 2.7

Not convinced this is a blocker as it works the same in 2.9

attaching a patch which fixes this for me but needs alot of review first.

Would prefer to leave till 3.1 as this is not a regression.

westi4 years ago

A fix of sorts... needs alot of testing

comment:13 Ragnaripea4 years ago

Ok, confirmed working. Non-latin slugs now work perfectly fine (tested with Russian and Greek). We removed the piece of code I mentioned earlier, added Westi's patch and it all works. So thanks, Westi! Nice to see that issue was noticed again. :)

comment:14 follow-up: miqrogroove4 years ago

Keep in mind Apache usually treats these URLs as identical:

/киро
/%D0%BA%D0%B8%D1%80%D0%BE

Most UAs will transmit the latter, but the former is also commonly used and completely valid when transmitted as a UTF-8 byte stream. As such, I doubt that removing a call to urldecode will fix this problem correctly. It is more likely that the stored permalink needs to be decoded and then matched against the decoded request value.

comment:15 Ragnaripea4 years ago

Hmm, now I have to correct myself that it's still not properly working. Sometimes it's working, sometimes it's not working. Totally unpredictable of how or why or when it's working or not.

comment:16 in reply to: ↑ 14 westi4 years ago

Replying to miqrogroove:

Keep in mind Apache usually treats these URLs as identical:

/киро
/%D0%BA%D0%B8%D1%80%D0%BE

Most UAs will transmit the latter, but the former is also commonly used and completely valid when transmitted as a UTF-8 byte stream. As such, I doubt that removing a call to urldecode will fix this problem correctly. It is more likely that the stored permalink needs to be decoded and then matched against the decoded request value.

Erm. Wouldn't that break all kinds of RFCs - URIs all have to use US-ASCII only chars and even then a limited set of such chars AFAIK

All UserAgents I've ever checked will always ensure that a URI used in an HTTP request has been urlencoded - they will usually now show you the pretty version of the url as that is much more user friendly.

comment:17 follow-up: miqrogroove4 years ago

I'm not aware of any such requirement. In the wild, UTF-8 requests are common if not frequent and are properly supported by servers. If WordPress is only intended to support the UAs you've checked, then I guess it's not a concern?

comment:18 in reply to: ↑ 17 westi4 years ago

Replying to miqrogroove:

I'm not aware of any such requirement. In the wild, UTF-8 requests are common if not frequent and are properly supported by servers. If WordPress is only intended to support the UAs you've checked, then I guess it's not a concern?

Can you give an example of a UA which sends UTF8 on the wire in requests?

comment:19 miqrogroove4 years ago

Yep, Baiduspider.

comment:20 SergeyBiryukov4 years ago

  • Cc sergeybiryukov.ru@… added

comment:21 SergeyBiryukov4 years ago

I can confirm the patch is working in my case (WordPress 3.0, Firefox 3.6.3, page slug is “детали”). Ragnaripea, is it not working with certains slugs or certain browsers?

comment:22 zx04 years ago

Yes, patch is working (WP 3.0).

comment:23 miqrogroove4 years ago

More thoughts on the UTF-8 headers issue: I haven't had a chance to test PHP's REQUEST_URI value yet. My idea to urldecode the requested and stored paths was based on the behavior of mod_rewrite, where the curly apostrophe is always represented as "\xe2\x80\xa6" instead of the urlencoded version. What might be a simpler solution if PHP is providing mixed encodings is to urlencode only the bytes >= 128 before comparing it to the stored version. That would convert any UTF-8 byte stream into the WordPress format of url-encoded UTF-8.

comment:24 hakre4 years ago

Oh that sounds like an old friend. Interestingly things get traction when Matt needs to do stuff for his customers. Well, good that finally then.

Related: #11175
Related: #11619
Related: #11669

But in the end I suspect this is related to the romanization / latinization component which creates the slug. Could benefit from an overhaul but everytime this was on topic it was denied to do so because of backwards compat.

comment:25 ivoapostolov4 years ago

OK, to add something on the issue. The patch provided westi solves the issue partly. Yes, it allows the usage of UTF-8 slugs for pages, but... it works only if you open the page from the current site. So if someone links to that page from other web site, it will return 404.
So in order to solve the issue, there should be some additional enhancement in the routing of the URLs.

comment:26 ivoapostolov4 years ago

More details on the issue. With the patch provided there are two options that I tested:
/%postname%/ - then it works fine if the link to the page is WITHOUT the trailing slash
/%postname% - then it works fine if the link to the page is WITH the trailing slash

However in both cases, if the page slug is "киро" in case one it will not open the page "example.com/киро/" if referred from external site.
In case two it will not open "example.com/киро" if referred from external site.

comment:27 westi3 years ago

(In [15960]) Attempt at a fix for url encoding issues with UTF8 chars. See #10249.

comment:28 nbachiyski3 years ago

westi's patch looks good. It shouldn't break anything, because later we are matching the regular expression against both the decoded and the raw URI values.

The only difference should be that if we decode it, we might lose the encoded version.

If we want to be extra sure, we could try matching against the encoded version, too, accounting for web servers, which put UTF-8 in the request URI and we end up with not-encoded version, only.

@ivoapostolov, I couldn't reproduce the slash problems—it works for me with and without slashes. If there isn't a slash, WordPress does a 301 redirect to the slashed URL. Moreover, there isn't any difference in routing based on referer. Do you still have this problem with a vanilla install and all plugins disabled?

comment:29 follow-up: unsalkorkmaz3 years ago

Still we cant use utf-8 chars in permalinks.
http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love
"şŞ İı Ğğ Üü Öö çÇ" chars important for me and those are utf-8 actually.

comment:30 in reply to: ↑ 29 hakre3 years ago

Replying to westi:

(In [15960]) Attempt at a fix for url encoding issues with UTF8 chars. See #10249.

IIRC while testing for #13413 I had exactly problems with removing the rawurldecode. I do not exactly remember what this breaks but I think it had something to do with backwards compatibility because the wp-rewrite-rule is expected to have this encoded.


Replying to unsalkorkmaz:

Still we cant use utf-8 chars in permalinks.
http://wordpress.org/extend/ideas/topic/non-latin-characters-need-love
"şŞ İı Ğğ Üü Öö çÇ" chars important for me and those are utf-8 actually.

Have you tried on trunk? If not, please do so.

Probably you need install/deactivate Normalize URLs (Wordpress Plugin) to make this go.

comment:31 follow-up: hakre3 years ago

Isn't this ticket a duplicate of #9591 (or better say to be closed as duplicate as being in the family)?

comment:32 in reply to: ↑ 31 nbachiyski3 years ago

Replying to hakre:

Isn't this ticket a duplicate of #9591 (or better say to be closed as duplicate as being in the family)?

Not really. The two problems are triggered by different parts of the code.

comment:33 nbachiyski3 years ago

If there is a chance that leaving the URI urlencoded, why don't we do it the other way round.

We leave the rawurldecode() call and when we match, we try to match by URI, urlencoded URI and urldecoded URI.

The patch is attached, it solves the problem in the ticket for me, and I can't imagine how it can break anything.

comment:34 ivoapostolov3 years ago

Latest patch solves the bug.

comment:35 unsalkorkmaz3 years ago

Well, I think i cant explain myself enough good (sorry for my english)

An example.. "примерно" you can use these russian chars in permalinks but "şŞ İı Ğğ Üü Öö çÇ" becoming "ss-ii-gg-uu-oo-cc" and this is a problem actually. You can try this yourself too. those chars are UTF8.

comment:36 unsalkorkmaz3 years ago

and after some tries on formatting.php i see removing those lines works in 823:
$title = remove_accents($title);

"şŞ İı Ğğ Üü Öö çÇ" becoming "şş-iı-ğğ-üü-öö-çç"

comment:37 unsalkorkmaz3 years ago

Opened a suggestion ticket about it:
http://core.trac.wordpress.org/ticket/15248

Version 0, edited 3 years ago by unsalkorkmaz (next)

comment:38 nacin3 years ago

  • Milestone changed from Awaiting Triage to Future Release

comment:39 scribu3 years ago

Related: #12897

comment:40 dragunoff3 years ago

  • Severity changed from normal to major
  • Version changed from 2.7 to 3.3

I thought that this bug was gone for good but it reappeared after a nightly build upgrade about a week ago. Pages with cyrillic slug throw a 404 error, while posts and taxonomies (tags, cats) work fine.

comment:41 scribu3 years ago

  • Version changed from 3.3 to 2.7

Thanks for the report, but please don't change the Version attribute. It should actually be called "First seen in:".

SergeyBiryukov3 years ago

Escape $in_string

SergeyBiryukov3 years ago

Escape $post_type, avoid $wpdb->prepare()

comment:42 SergeyBiryukov3 years ago

  • Keywords dev-feedback added; 2nd-opinion removed
  • Milestone changed from Future Release to 3.3

[18541] indeed broke get_page_by_path() for non-English slugs.

This is what Debug Bar shows:

WARNING: wp-includes/wp-db.php:890 - vsprintf() [function.vsprintf]: Too few arguments

SQL query is:

SELECT ID, post_name, post_parent FROM trunk_posts WHERE post_name IN ('%d0%ba%d0%b8%d1%80%d0%be') AND (post_type = '%s' OR post_type = 'attachment') AND trunk_posts.post_type IN ('post', 'page', 'attachment')

vsprintf() in $wpdb->prepare() thinks those encoded chars are placeholders.

Before [18541], only a single page slug was queried here, and it was escaped properly.

10249.1.patch fixes the issue by escaping % chars before passing $in_string to $wpdb->prepare().

10249.2.patch escapes $post_type, which makes $wpdb->prepare() unnecessary, since queried post slugs are also escaped earlier.

Both patches also contain the refreshed page-slugs-urldecode-other-way-round.diff.

comment:43 ocean903 years ago

Related: #16687

comment:44 follow-up: SergeyBiryukov3 years ago

Marked #17450 as a duplicate.

[15960] affected all PATH_INFO permalinks (not only pages) with non-ASCII chars. The latest patches fix that too.

comment:45 follow-up: ryan3 years ago

$in_string probably should be substituted in. A variable inside prepare() is usually a case of doing it wrong.

comment:46 ryan3 years ago

Of course, we need a way to pass in a string without it being automatically quoted.

comment:47 ryan3 years ago

Perhaps %r for raw. That is not already used by PHP.

comment:48 in reply to: ↑ 45 SergeyBiryukov3 years ago

Replying to ryan:

$in_string probably should be substituted in.

I thought of that, but $wpdb->prepare() adds slashes to inner quotes:

WHERE post_name IN ('%d0%ba%d0%b8%d1%80%d0%be\',\'%d0%ba%d0%b8%d0%ba%d0%b8%d1%80%d0%be')

And if we omit them in implode(), they are not added back:

WHERE post_name IN ('%d0%ba%d0%b8%d1%80%d0%be,%d0%ba%d0%b8%d0%ba%d0%b8%d1%80%d0%be')

Am I missing something?

Trying to modify prepare() now to use %r, as per your second comment.

comment:49 SergeyBiryukov3 years ago

Do we still need prepare() if page slugs and post type are escaped by esc_sql()?

comment:50 scribu3 years ago

Nope. esc_sql() is just a shorthand for $wpdb->prepare().

comment:51 ryan3 years ago

%r would be no outside quotes added as is done with %s and no escape_by_ref(). Writing a mini format parser to extract %r could become a fairly big chore.

Actually, esc_sql() is a weak escape. It should not be used here. We probably need to make a public version of wpdb::_escape(). Maybe esc().

comment:52 follow-up: ryan3 years ago

I prefer to have all queries go through prepare(), just in case it does something more someday, but let's just do a simple fix for now. How about:

$post_type_sql = $post_type;
$wpdb->escape_by_ref( $post_type_sql );
Last edited 3 years ago by ryan (previous) (diff)

SergeyBiryukov3 years ago

comment:53 in reply to: ↑ 52 SergeyBiryukov3 years ago

Done in 10249.3.patch.

We could possibly escape $post_type itself without introducing a new variable, but later it is passed to get_post(), though it's not actually used a post type there, but rather as a filter.

comment:54 ryan3 years ago

In [18652]:

Don't send page slugs through prepare() to avoid breaking octets in i18n page slugs. Props SergeyBiryukov. see #10249 #166687

comment:55 toscho3 years ago

  • Cc info@… added

comment:56 in reply to: ↑ 44 SergeyBiryukov3 years ago

As for including page-slugs-urldecode-other-way-round.diff:

[15960] affected all PATH_INFO permalinks (not only pages) with non-ASCII chars.

rawurldecode() of $req_uri was introduced in [4893]/[4979] as a fix for #3727. So Nikolay's patch (refreshed in 10249.3.patch) is a more proper fix than [15960].

Last edited 3 years ago by SergeyBiryukov (previous) (diff)

SergeyBiryukov3 years ago

comment:57 SergeyBiryukov3 years ago

Refreshed the remaining part. This would fix #17450 as well.

comment:58 SergeyBiryukov3 years ago

  • Keywords needs-unit-tests added

comment:59 westi2 years ago

  • Keywords 3.4-early added
  • Milestone changed from 3.3 to Future Release

While I would love to finally close this ticket I will only be happy to do so when we have good unit test coverage.

As this stage in the 3.3 cycle I don't think we should be tweaking this code any more.

comment:60 dragunoff2 years ago

  • Cc dragunoff@… added

comment:61 SergeyBiryukov2 years ago

  • Milestone changed from Future Release to 3.4

Will prepare unit tests in a week or two, if no one beats me to it.

comment:62 ryan2 years ago

  • Milestone changed from 3.4 to Future Release

comment:63 nacin2 years ago

  • Keywords 3.5-early added; 3.4-early removed

comment:64 nacin2 years ago

Similar bug, same solution: #17450.

Note: See TracTickets for help on using tickets.