Make WordPress Core

Opened 15 years ago

Last modified 5 years ago

#10690 reopened defect (bug)

WordPress does not support non-ascii characters in the domain name

Reported by: paddya's profile paddya Owned by: markjaquith's profile markjaquith
Milestone: Priority: normal
Severity: normal Version: 2.8.4
Component: Canonical Keywords: needs-unit-tests needs-patch
Focuses: Cc:

Description

WordPress' clean_url() strips out most characters, which are non-ascii for security reasons. This is causing problems, if you want to run a WordPress blog on a domain containing non-ascii-characters (e.g. müller.com), because WordPress generates wrong URLs on redirects.

Attachments (1)

29170.2.diff (106.6 KB) - added by extendwings 10 years ago.
Patch from #29170

Download all attachments as: .zip

Change History (50)

#1 @paddya
15 years ago

wp_sanitize_redirect() is affected by this issue, too.

#2 @scribu
15 years ago

  • Component changed from General to Canonical
  • Keywords needs-patch added; URLs removed
  • Owner set to markjaquith

#3 @scribu
15 years ago

  • Keywords changed from needs-patch, sanitizing to needs-patch sanitizing

#4 @azaozz
15 years ago

  • Milestone changed from 2.9 to Future Release

No patch.

#5 @matt
15 years ago

Part of the reason for being conservative in those functions when they were written is that we had security issues in the past with characters allowing for header splitting when included with redirects, and probably an assumption that all i18n stuff would be encoded and compatible with our current methods. Any fix for this should look at all the security implications.

#6 @nbachiyski
15 years ago

  • Status changed from new to reviewing

The problem is not in clean_url(), which works perfectly fine with unicode URLs (see TestCleanUrl::test_not_ascii() in http://svn.automattic.com/wordpress-tests/wp-testcase/test_includes_formatting.php).

The problem is in wp_sanitize_redirect(), which uses a lot more restrictive filter than clean_url().

@markjaquith is there a reason wp_sanitize_redirect() to be different from clean_url()?

#7 @dwright
15 years ago

  • Cc david_v_wright@… added

duplicate of http://core.trac.wordpress.org/ticket/10298
I posted a possible solution there. (patch) (it works but is not 'pretty')

#8 follow-up: @hakre
15 years ago

  • Keywords needs-patch sanitizing removed

URLs per definition can only conain ASCII chars. If clean_url() removes non ascii (7bit I hope) chars, that's fine. Please use the PUNYCODE notation of your domain instead of the UTF8-representation.

I suggest to close as invalid.

#9 in reply to: ↑ 8 @dwright
15 years ago

Replying to hakre:

URLs per definition can only conain ASCII chars. If clean_url() removes non ascii (7bit I hope) chars, that's fine. Please use the PUNYCODE notation of your domain instead of the UTF8-representation.

I suggest to close as invalid.

I believe that Wordpress has enough international usage to warrant proper IDN support.
It seems sub-optimal for Mr/Mrs. Müller to have to send users to their website: xn--mller-kva.com, rather than müller.com

An IDNA-enabled application is able to convert between the internationalised and ASCII representations of a domain name. It uses the ASCII form for DNS lookups but can present the internationalised form to users who presumably prefer to read and write domain names in non-ASCII scripts such as Arabic or Hiragana. Applications that do not support IDNA will not be able to handle domain names with non-ASCII characters, but will still be able to access such domains if given the (usually rather cryptic) ASCII equivalent. - http://en.wikipedia.org/wiki/Internationalized_domain_name

#10 @scribu
15 years ago

  • Keywords needs-patch added

xref: #10298

#11 @scribu
15 years ago

+1 on IDNA support.

#12 @hakre
15 years ago

Just some pseudocode and thoughts:

$url_unknown = getoption('siteurl');
list($prefix, $domain, $suffix) = extract_domain_from_url($url);
$domain_ascii = ( seems_utf8( $domain ) ?  utf8_to_punycode($domain) : $domain ) 
$url_ascii =  $prefix . $domain_ascii . $suffix;

It might be an option to hook the option siteurl when read but that might break the admin output. the rest should work then.

Or: once entered, the domain will automatically converted and the users do not need to do it on their own (Safest way I suppose).

#13 @nbachiyski
15 years ago

Most of the time we can use the UTF-8 representation of both the domain and the path. Only the headers impose an ASCII restriction, so we can run the domain through the punycode converter and urlencode the path.

Have you found other places, where the UTF-8 representation causes problems?

#14 @hakre
15 years ago

There is a problem where the punycode representation causes a problem, I tagged it as related because it somehow is: #11734

For the rest of your comment: I think that's a good way to go. We assume that the domain names are UTF-8 encoded. For URLs, as long as we do not read them URL encoded they are not punycode-encoded as well. Should pyncode-encoding be traded as URLencoded? URLs can be properly split anyway, so this can be decently applied. maybe a helper function that does the job (add a paramter for the from / to type).

#15 @sirzooro
15 years ago

  • Cc sirzooro added

#16 @dwright
15 years ago

  • Resolution set to worksforme
  • Status changed from reviewing to closed

I am closing this ticket as works-for-me as there is an now IDNA plugin which enables one to use IDN's (domain name's that contains non-ascii characters) in WordPress http://wordpress.org/extend/plugins/idna/

#17 @scribu
15 years ago

  • Milestone Future Release deleted

#18 follow-ups: @codestyling
14 years ago

  • Resolution worksforme deleted
  • Status changed from closed to reopened
  • Version changed from 2.8.4 to 3.1

IDN handling is different related to Browsers! WebKit based browser like Safari and Chrome work with PunyCode URL's but others like IE, Firefox and Opera doesn't.
This is a problem of Cross Site Scripting detection and can be realize and tested, if the Blog is configured to an PunyCode Url.

example out of a case I did investigate:
IDN: http://с-проект.рф
PunyCode: http://xn----jtbpoegeo.xn--p1ai

If you try to call a JSON request like this example with the generated admin_url() out of WordPress, which would become the PunyCode one:

	new Ajax.Request('http://xn----jtbpoegeo.xn--p1ai/wp-admin/admin-ajax.php' ?>', 
		{  
			parameters: {
				action: 'get_download_section'
			},
			onSuccess: function(transport) {		
				elem.title=transport.responseJSON.title;
			},
			onFailure: function(transport) {
				alert('JSON security bug')
			}
		}
	);

and the answer is correct 'application/json' with correct JSON content, than this fails on all browsers except WebKit based!
If you try it with the original IDN Url like:

	new Ajax.Request('http://с-проект.рф/wp-admin/admin-ajax.php' ?>', 

it works now for all other browsers but fails now on WebKit based.

My suggestion will be a conditional convertion back to IDN, if browser is not WebKit based.
I did this inside my WordPress plugin "Codestyling Localization" and it works now in any case. I did use the class idna_convert from Matthias Sommerfeld for easy decode of PunyCode admin url's in such a case.

Please check it also in relation to #11734 / #10690 / #14648 because this may also affect the flash uploader feeded with PunyCode url's instead of IDN for some browser!

#19 @ocean90
14 years ago

  • Milestone set to Awaiting Review
  • Version changed from 3.1 to 2.8.4

#20 in reply to: ↑ 18 ; follow-up: @hakre
14 years ago

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

#21 @toscho
14 years ago

  • Cc info@… added

#22 in reply to: ↑ 20 ; follow-up: @codestyling
14 years ago

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

#23 @codestyling
14 years ago

  • Cc codestyling added

#24 in reply to: ↑ 22 ; follow-up: @hakre
14 years ago

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

#25 in reply to: ↑ 24 ; follow-up: @codestyling
14 years ago

Replying to hakre:

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

Here ist the default option description of Prototype.js Ajax classes: http://api.prototypejs.org/ajax/

Please look at the evalJS property set as default to true which results into automatic population of responseJSON instead of responseText if the returned content type is application/json.
As also noticed, this is strongly related to same origin policy and this will not work at all browsers the same way unsing IDN based URL's like the admin_url() do at WordPress for PunyCode URL's.

Even if this could be a bug at Prototype.js it affects function of WordPress plugins/themes using the onboard shipped Prototype.js functionality to solve the goals.

#26 in reply to: ↑ 25 @hakre
14 years ago

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Replying to hakre:

Replying to codestyling:

Can you specify Ajax.Request? Is that of some javascript library?

Yes I can. This is belongs to Prototype.js library shipped with WordPress as one of the standard libraries. Documentation can be found here: http://api.prototypejs.org/ajax/Ajax/Request/

Thanks for providing the information. Can you provide a link to the prototype documentation of that Ajax.Request that is specifying the encoding of the URL parameter? The link you provided does not say in which encoding/format the URL has to be provided. This might just be a bug in prototype, not Wordpress.

Here ist the default option description of Prototype.js Ajax classes: http://api.prototypejs.org/ajax/

The link you provided says nothing about which encoding/format is expected/supported by that function for which browser. Until this information is not provided I can not say here exactly the cause of the problem is. Normally I only fix problems if I know their cause, so that's why I was asking. Until provided I will step back from the discussion a bit.

Even if this could be a bug at Prototype.js it affects function of WordPress plugins/themes using the onboard shipped Prototype.js functionality to solve the goals.

That's the case for any library used: If the library has a bug, that bug might get transparently transported into the application making use of the library.

If you see a problem in using Prototype.js Ajax across different browsers, please report the problem back to the prototype library project so they can fix it and every user of the library can benefit from the fix. Thanks.

Last edited 14 years ago by hakre (previous) (diff)

#27 in reply to: ↑ 18 @SergeyBiryukov
13 years ago

Replying to codestyling:

IDN handling is different related to Browsers! WebKit based browser like Safari and Chrome work with PunyCode URL's but others like IE, Firefox and Opera doesn't.
This is a problem of Cross Site Scripting detection and can be realize and tested, if the Blog is configured to an PunyCode Url.

Related: #18952

#28 @SergeyBiryukov
13 years ago

  • Summary changed from Wordpress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in URLs

#29 @SergeyBiryukov
13 years ago

Closed #19279 as a duplicate.

#30 @espsjurs
12 years ago

Hello there,

This is still an issue :-(

Is someone working on this?

Espen

#31 @SergeyBiryukov
12 years ago

Now that AJAX for IDN is fixed in #18952, I don't see any issues other than #11734 when using the Punycode representation of URLs.

We should probably convert home and siteurl to Punycode automatically, as mentioned in #19279.

Last edited 12 years ago by SergeyBiryukov (previous) (diff)

#32 @espsjurs
12 years ago

If you intend to use multisite you can forget special characters in the domain name. It will work as long as you use Punycode (the translated version illegible) of the domain name, but all the links generated from the WP will be 'puny'. Example url in registration mail and site generation page..

#33 @espsjurs
12 years ago

  • Cc espsjurs added

#34 @mdawaffe
11 years ago

New patch attached to #11734. We still need a patch here to convert home and siteurl to punycode/ascii.

#35 @joostdevalk
11 years ago

The issue with sanitize_redirect is actually rather annoying as well, so it's not just home and siteurl...

#36 @nacin
11 years ago

  • Keywords needs-unit-tests added
  • Summary changed from WordPress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in URLs

#37 @szepe.viktor
11 years ago

This small wanna-be-plugin works on production sites:
https://gist.github.com/szepeviktor/9682764
(you shouldn't name your menus with non-ascii characters, because they will go into a CSS class)

Last edited 11 years ago by szepe.viktor (previous) (diff)

#38 @SergeyBiryukov
10 years ago

#29170 was marked as a duplicate.

@extendwings
10 years ago

Patch from #29170

#39 @extendwings
10 years ago

  • Keywords has-patch added; needs-patch removed

#40 follow-up: @szepe.viktor
10 years ago

@extendwings As I see it is an IDN domain name patch.
My mu-plugin handles non-ascii characters in the post slug (the URL path)
https://github.com/szepeviktor/wordpress-plugin-construction/blob/master/mu-latin-accent-urls/latin-accent-urls.php

#41 @szepe.viktor
10 years ago

@SergeyBiryukov So I think it is not a duplicate of #29170

Last edited 10 years ago by SergeyBiryukov (previous) (diff)

#42 in reply to: ↑ 40 @SergeyBiryukov
10 years ago

Replying to szepe.viktor:

@extendwings As I see it is an IDN domain name patch.

Yes, and this ticket is about IDN domains (see the description).

My mu-plugin handles non-ascii characters in the post slug (the URL path)

Looks like your plugin just disables half of remove_accents() for slugs. There are probably use cases for that, but the function is there for a reason, so I don't think we want to do that in core. It also doesn't appear to resolve the wp_sanitize_redirect() issue mentioned here.

#43 @szepe.viktor
10 years ago

Thank you.

This is the use case:
http://esküvői-videók.hu/esk%C3%BCv%C5%91i-id%C3%A9zetek-7-r%C3%A9sz/
IDN domain name + non-ascii URL path

Last edited 10 years ago by szepe.viktor (previous) (diff)

#44 @szepe.viktor
10 years ago

  • Summary changed from WordPress does not support non-ascii characters in URLs to WordPress does not support non-ascii characters in the domain name

#45 @rmccue
10 years ago

Just noticed this ticket. Requests includes an IDNA encoder written from scratch by myself. It also has 97.9% test coverage; (relevant tests). Might be worth integrating?

#46 @extendwings
10 years ago

@rmccue RFC 3490(IDNA2003) was obsoleted by 5890 and 5891(IDNA2008).
I guess that the library supporting IDNA2008 is suited.

#47 @chriscct7
9 years ago

  • Keywords needs-patch added; has-patch removed

#48 @SergeyBiryukov
9 years ago

#34965 was marked as a duplicate.

#49 @SergeyBiryukov
9 years ago

#35719 was marked as a duplicate.

Note: See TracTickets for help on using tickets.