WordPress.org

Make WordPress Core

Opened 6 years ago

Last modified 12 months ago

#10690 reopened defect (bug)

WordPress does not support non-ascii characters in the domain name

Reported by: paddya Owned by: markjaquith
Milestone: Awaiting Review Priority: normal
Severity: normal Version: 2.8.4
Component: Canonical Keywords: needs-unit-tests has-patch
Focuses: Cc:

Description

WordPress' clean_url() strips out most characters, which are non-ascii for security reasons. This is causing problems, if you want to run a WordPress blog on a domain containing non-ascii-characters (e.g. müller.com), because WordPress generates wrong URLs on redirects.

Attachments (1)

29170.2.diff (106.6 KB) - added by extendwings 12 months ago.
Patch from #29170

Download all attachments as: .zip

Change History (47)

comment:1 @paddya6 years ago

wp_sanitize_redirect() is affected by this issue, too.

comment:2 @scribu6 years ago

  • Component changed from General to Canonical
  • Keywords needs-patch added; URLs removed
  • Owner set to markjaquith

comment:3 @scribu6 years ago

  • Keywords changed from needs-patch, sanitizing to needs-patch sanitizing

comment:4 @azaozz6 years ago

  • Milestone changed from 2.9 to Future Release

No patch.

comment:5 @matt6 years ago

Part of the reason for being conservative in those functions when they were written is that we had security issues in the past with characters allowing for header splitting when included with redirects, and probably an assumption that all i18n stuff would be encoded and compatible with our current methods. Any fix for this should look at all the security implications.

comment:6 @nbachiyski6 years ago

  • Status changed from new to reviewing

The problem is not in clean_url(), which works perfectly fine with unicode URLs (see TestCleanUrl::test_not_ascii() in http://svn.automattic.com/wordpress-tests/wp-testcase/test_includes_formatting.php).

The problem is in wp_sanitize_redirect(), which uses a lot more restrictive filter than clean_url().

@markjaquith is there a reason wp_sanitize_redirect() to be different from clean_url()?

comment:7 @dwright6 years ago

  • Cc david_v_wright@… added

duplicate of http://core.trac.wordpress.org/ticket/10298
I posted a possible solution there. (patch) (it works but is not 'pretty')

comment:8 follow-up: @hakre6 years ago

  • Keywords needs-patch sanitizing removed

URLs per definition can only conain ASCII chars. If clean_url() removes non ascii (7bit I hope) chars, that's fine. Please use the PUNYCODE notation of your domain instead of the UTF8-representation.

I suggest to close as invalid.

comment:9 in reply to: ↑ 8 @dwright6 years ago

Replying to hakre:

URLs per definition can only conain ASCII chars. If clean_url() removes non ascii (7bit I hope) chars, that's fine. Please use the PUNYCODE notation of your domain instead of the UTF8-representation.

I suggest to close as invalid.

I believe that Wordpress has enough international usage to warrant proper IDN support.
It seems sub-optimal for Mr/Mrs. Müller to have to send users to their website: xn--mller-kva.com, rather than müller.com

An IDNA-enabled application is able to convert between the internationalised and ASCII representations of a domain name. It uses the ASCII form for DNS lookups but can present the internationalised form to users who presumably prefer to read and write domain names in non-ASCII scripts such as Arabic or Hiragana. Applications that do not support IDNA will not be able to handle domain names with non-ASCII characters, but will still be able to access such domains if given the (usually rather cryptic) ASCII equivalent. - http://en.wikipedia.org/wiki/Internationalized_domain_name

comment:10 @scribu6 years ago

  • Keywords needs-patch added

xref: #10298

comment:11 @scribu6 years ago

+1 on IDNA support.

comment:12 @hakre6 years ago

Just some pseudocode and thoughts:

$url_unknown = getoption('siteurl');
list($prefix, $domain, $suffix) = extract_domain_from_url($url);
$domain_ascii = ( seems_utf8( $domain ) ?  utf8_to_punycode($domain) : $domain ) 
$url_ascii =  $prefix . $domain_ascii . $suffix;

It might be an option to hook the option siteurl when read but that might break the admin output. the rest should work then.

Or: once entered, the domain will automatically converted and the users do not need to do it on their own (Safest way I suppose).

comment:13 @nbachiyski6 years ago

Most of the time we can use the UTF-8 representation of both the domain and the path. Only the headers impose an ASCII restriction, so we can run the domain through the punycode converter and urlencode the path.

Have you found other places, where the UTF-8 representation causes problems?

comment:14 @hakre6 years ago

There is a problem where the punycode representation causes a problem, I tagged it as related because it somehow is: #11734

For the rest of your comment: I think that's a good way to go. We assume that the domain names are UTF-8 encoded. For URLs, as long as we do not read them URL encoded they are not punycode-encoded as well. Should pyncode-encoding be traded as URLencoded? URLs can be properly split anyway, so this can be decently applied. maybe a helper function that does the job (add a paramter for the from / to type).

comment:15 @sirzooro6 years ago

  • Cc sirzooro added

comment:16 @dwright5 years ago

  • Resolution set to worksforme
  • Status changed from reviewing to closed

I am closing this ticket as works-for-me as there is an now IDNA plugin which enables one to use IDN's (domain name's that contains non-ascii characters) in WordPress http://wordpress.org/extend/plugins/idna/

comment:17 @scribu5 years ago

  • Milestone Future Release deleted

comment:18 follow-ups: @codestyling4 years ago

  • Resolution worksforme deleted
  • Status changed from closed to reopened
  • Version changed from 2.8.4 to 3.1

IDN handling is different related to Browsers! WebKit based browser like Safari and Chrome work with PunyCode URL's but others like IE, Firefox and Opera doesn't.
This is a problem of Cross Site Scripting detection and can be realize and tested, if the Blog is configured to an PunyCode Url.

example out of a case I did investigate:
IDN: http://с-проект.рф
PunyCode: http://xn----jtbpoegeo.xn--p1ai

If you try to call a JSON request like this example with the generated admin_url() out of WordPress, which would become the PunyCode one:

	new Ajax.Request('http://xn----jtbpoegeo.xn--p1ai/wp-admin/admin-ajax.php' ?>', 
		{  
			parameters: {
				action: 'get_download_section'
			},
			onSuccess: function(transport) {		
				elem.title=transport.responseJSON.title;
			},
			onFailure: function(transport) {
				alert('JSON security bug')
			}
		}
	);

and the answer is correct 'application/json' with correct JSON content, than this fails on all browsers except WebKit based!
If you try it with the original IDN Url like:

	new Ajax.Request('http://с-проект.рф/wp-admin/admin-ajax.php' ?>', 

it works now for all other browsers but fails now on WebKit based.

My suggestion will be a conditional convertion back to IDN, if browser is not WebKit based.
I did this inside my WordPress plugin "Codestyling Localization" and it works now in any case. I did use the class idna_convert from Matthias Sommerfeld for easy decode of PunyCode admin url's in such a case.

Please check it also in relation to #11734 / #10690 / #14648 because this may also affect the flash uploader feeded with PunyCode url's instead of IDN for some browser!

comment:19 @ocean904 years ago

  • Milestone set to Awaiting Review
  • Version changed from 3.1 to 2.8.4

comment:20 in reply to: ↑ 18 ; follow-up: @hakre4 years ago

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

comment:21 @toscho4 years ago

  • Cc info@… added

comment:22 in reply to: ↑ 20 ; follow-up: @codestyling4 years ago

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

comment:23 @codestyling4 years ago

  • Cc codestyling added

comment:24 in reply to: ↑ 22 ; follow-up: @hakre4 years ago

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

comment:25 in reply to: ↑ 24 ; follow-up: @codestyling4 years ago

Replying to hakre:

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

Here ist the default option description of Prototype.js Ajax classes: http://api.prototypejs.org/ajax/

Please look at the evalJS property set as default to true which results into automatic population of responseJSON instead of responseText if the returned content type is application/json.
As also noticed, this is strongly related to same origin policy and this will not work at all browsers the same way unsing IDN based URL's like the admin_url() do at WordPress for PunyCode URL's.

Even if this could be a bug at Prototype.js it affects function of WordPress plugins/themes using the onboard shipped Prototype.js functionality to solve the goals.

comment:26 in reply to: ↑ 25 @hakre4 years ago

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

Here ist the default option description of Prototype.js Ajax classes: http://api.prototypejs.org/ajax/

The link you provided says nothing about which encoding/format is expected/supported by that function for which browser. Until this information is not provided I can not say here exactly the cause of the problem is. Normally I only fix problems if I know their cause, so that's why I was asking. Until provided I will step back from the discussion a bit.

Even if this could be a bug at Prototype.js it affects function of WordPress plugins/themes using the onboard shipped Prototype.js functionality to solve the goals.

That's the case for any library used: If the library has a bug, that bug might get transparently transported into the application making use of the library.

If you see a problem in using Prototype.js Ajax across different browsers, please report the problem back to the prototype library project so they can fix it and every user of the library can benefit from the fix. Thanks.

Last edited 4 years ago by hakre (previous) (diff)

comment:27 in reply to: ↑ 18 @SergeyBiryukov4 years ago

Replying to codestyling:

IDN handling is different related to Browsers! WebKit based browser like Safari and Chrome work with PunyCode URL's but others like IE, Firefox and Opera doesn't.
This is a problem of Cross Site Scripting detection and can be realize and tested, if the Blog is configured to an PunyCode Url.

Related: #18952

comment:28 @SergeyBiryukov4 years ago

  • Summary changed from Wordpress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in URLs

comment:29 @SergeyBiryukov4 years ago

Closed #19279 as a duplicate.

comment:30 @espsjurs3 years ago

Hello there,

This is still an issue :-(

Is someone working on this?

Espen

comment:31 @SergeyBiryukov3 years ago

Now that AJAX for IDN is fixed in #18952, I don't see any issues other than #11734 when using the Punycode representation of URLs.

We should probably convert home and siteurl to Punycode automatically, as mentioned in #19279.

Last edited 3 years ago by SergeyBiryukov (previous) (diff)

comment:32 @espsjurs2 years ago

If you intend to use multisite you can forget special characters in the domain name. It will work as long as you use Punycode (the translated version illegible) of the domain name, but all the links generated from the WP will be 'puny'. Example url in registration mail and site generation page..

comment:33 @espsjurs2 years ago

  • Cc espsjurs added

comment:34 @mdawaffe2 years ago

New patch attached to #11734. We still need a patch here to convert home and siteurl to punycode/ascii.

comment:35 @joostdevalk16 months ago

The issue with sanitize_redirect is actually rather annoying as well, so it's not just home and siteurl...

comment:36 @nacin16 months ago

  • Keywords needs-unit-tests added
  • Summary changed from WordPress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in URLs

comment:37 @szepe.viktor16 months ago

This small wanna-be-plugin works on production sites:
https://gist.github.com/szepeviktor/9682764
(you shouldn't name your menus with non-ascii characters, because they will go into a CSS class)

Last edited 16 months ago by szepe.viktor (previous) (diff)

comment:38 @SergeyBiryukov12 months ago

#29170 was marked as a duplicate.

@extendwings12 months ago

Patch from #29170

comment:39 @extendwings12 months ago

  • Keywords has-patch added; needs-patch removed

comment:40 follow-up: @szepe.viktor12 months ago

@extendwings As I see it is an IDN domain name patch.
My mu-plugin handles non-ascii characters in the post slug (the URL path)
https://github.com/szepeviktor/wordpress-plugin-construction/blob/master/mu-latin-accent-urls/latin-accent-urls.php

comment:41 @szepe.viktor12 months ago

@SergeyBiryukov So I think it is not a duplicate of #29170

Last edited 12 months ago by SergeyBiryukov (previous) (diff)

comment:42 in reply to: ↑ 40 @SergeyBiryukov12 months ago

Replying to szepe.viktor:

@extendwings As I see it is an IDN domain name patch.

Yes, and this ticket is about IDN domains (see the description).

My mu-plugin handles non-ascii characters in the post slug (the URL path)

Looks like your plugin just disables half of remove_accents() for slugs. There are probably use cases for that, but the function is there for a reason, so I don't think we want to do that in core. It also doesn't appear to resolve the wp_sanitize_redirect() issue mentioned here.

comment:43 @szepe.viktor12 months ago

Thank you.

This is the use case:
http://esküvői-videók.hu/esk%C3%BCv%C5%91i-id%C3%A9zetek-7-r%C3%A9sz/
IDN domain name + non-ascii URL path

Last edited 12 months ago by szepe.viktor (previous) (diff)

comment:44 @szepe.viktor12 months ago

  • Summary changed from WordPress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in the domain name

comment:45 @rmccue12 months ago

Just noticed this ticket. Requests includes an IDNA encoder written from scratch by myself. It also has 97.9% test coverage; (relevant tests). Might be worth integrating?

comment:46 @extendwings12 months ago

@rmccue RFC 3490(IDNA2003) was obsoleted by 5890 and 5891(IDNA2008).
I guess that the library supporting IDNA2008 is suited.

Note: See TracTickets for help on using tickets.