Opened 11 years ago
Last modified 4 weeks ago
#31992 accepted defect (bug)
Unicode Email Addresses
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Milestone: | 7.0 | Priority: | normal |
| Severity: | normal | Version: | |
| Component: | Formatting | Keywords: | has-test-info has-patch has-screenshots has-unit-tests |
| Focuses: | Cc: |
Description (last modified by )
Tested against trunk (2015-04-16)
Test case
$target_email = 'dummy-üñîçøðé.y.!#$%@üñîçøðé.gmail.com'; echo $target_email.'<br>'; echo sanitize_email($target_email).'<br>'; echo 'is_email : '.is_email($target_email);
Return
dummy-üñîçøðé.y.!#$%@üñîçøðé.gmail.com dummy-email.y.!#$%@gmail.com is_email :
Function is_email @ /wp-includes/formatting.php line 2177
Preg_replace @ line 2211 is not correct.
if ( !preg_match( '/^[a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]+$/', $local ) ) {
Function sanitize_email() @ /wp-includes/formatting.php line 2430
Preg_replace @ line 2460 is not correct.
$local = preg_replace( '/[^a-zA-Z0-9!#$%&\'*+\/=?^_`{|}~\.-]/', '', $local );
Attachments (2)
Change History (76)
#2
@
11 years ago
ugh... sorry. I actually pasted the email that was sanitized.
The test I made was
$target_email = 'dummy-üñîçøðé.y.!#$%@üñîçøðé.gmail.com'; echo $target_email.'<br>'; echo sanitize_email($target_email).'<br>'; echo 'is_email : '.is_email($target_email);
with return as
dummy-üñîçøðé.y.!#$%@üñîçøðé.gmail.com dummy-email.y.!#$%@gmail.com is_email :
The unicode characters were all removed. For international emails this can be a real problem.
ps. I actually used the Wiki page you sent as a base for my ticket. I tried a mix of one of the last examples in the "Valid email Examples" list.
#3
@
11 years ago
I had looked into email validation when I wrote my SMTP plugin, and eventually concluded it is nearly impossible to validate an email address at all. Most email validators are much more restrictive than any RFC requires.
I think currently my code doesn't even call sanitize_text_field because it failed too many of my test cases.
https://wordpress.org/plugins/postman-smtp/
http://girders.org/blog/2013/01/31/dont-rfc-validate-email-addresses/
#4
@
11 years ago
- Keywords reporter-feedback removed
- Summary changed from sanitize_email() and is_email() preg_replace/preg_match problems to Unicode Email Addresses
- Version trunk deleted
Replying to ysalame:
ugh... sorry. I actually pasted the email that was sanitized.
We need an admin to update the ticket description then.
#8
@
5 years ago
Hi!
I opened #51732 ticket to propose a patch to change email validation rules to work with EAI, but it was recognized a duplicate of this one.
As I can see, nothing was changed with EAI for 6 years from the time this ticket was open and Unicode EAI cannot be used in WP. I would like to propose some things to fix it:
1) Let's modify sanitize_email() and is_email() to use special constants in default-constants.php for regex check username and domain name (WP_IDN_LOCAL_MAIL_REGEX and WP_IDN_DOMAIN_REGEX) which every admin can change to validate IDN domains rules for language he uses. For example I added Russian IDN rules to these regexes.
2) Let's also add decode_punycode() to sanitize_email() function which check if e-mail domain part in Punycode from and convert it to Unicode as required by uasg.tech
3) In addition to all these things we can modify sanitize_user() in the same manner to allow Unicode usernames to WP Universal Acceptance ready.
I attached the patch for all of these.
#9
@
5 years ago
This patch does not solve the problem with e.g. "测试5@普遍接受-测试.世界"
A more simpler approach would be to use php FILTER_VALIDATE_EMAIL (which doesn't work for unicode, but we add the second if for that) here. An example validation function could look like this:
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
return true;
} elseif (preg_match("/^([\w\-\.\+]+@([\w\-]+\.)+[\w\-]{2,63})?$/u", $email)) {
return true;
} else {
return false;
}
I guess even more simple would be to use just the regex:
if (preg_match("/^([\w\-\.\+]+@([\w\-]+\.)+[\w\-]{2,63})?$/u", $email)) {
return true;
} else {
return false;
}
In any case is_email currently blocks unicode email addresses and should get fixed.
@
3 years ago
Patch to add support for unicode email addresses in is_email, but making no other changes and resolving no other limitations
#10
@
2 years ago
- Keywords has-testing-info has-patch has-screenshots added; is-email removed
Test Report
Patch tested: https://core.trac.wordpress.org/attachment/ticket/31992/31992-icann.patch
Steps to Test
- In WP admin, navigate to Users > Add New.
- Enter a Username in standard ASCII characters (no need to confuse with related #5918).
- Enter a valid Email with extended ASCII or unicode characters, e.g.:
grå@grå.org. - 🐞 Observe error returned:
Error: The email address is not correct.
Expected Results
- ✅ An email with extended ASCII or unicode characters should be accepted.
Environment
- Hardware: MacBook Pro Apple M1 Pro
- OS: macOS 13.5
- Browser: Safari 16.6
- Server: nginx/1.25.2
- PHP: 8.2.9
- Database: SQLite
- WordPress: 6.4-alpha-56267-src
Actual Results
- ✅ The Add New User form accepts the email with special characters.
- ❌ The email stored in the database (and subsequently displayed on the page) has special characters stripped out. E.g. (also see Figure 1):
grå@grå.org=>gr@gr.org测试5@普遍接受-测试.世界=>(empty!)
Supplemental Artifacts
#11
@
2 years ago
Thanks for testing, and thanks for not confusing it with that other issue.
My patch fixes only email address validation, it doesn't touch database storage. I'll be happy to work on database storage too, but do you want it on this issue? I lean towards fixing only is_email in this issue, and fixing other related issues in separate patches.
(JFYI, my most important goal is to accept unicode email addresses in forms managed by plugins, CF7 for example.)
#12
@
2 years ago
- Keywords changes-requested added
@agulbra, yes, it's fine to work on the DB portion with a separate patch file, but please do include it with this ticket. Based on testing 31992-icann.patch, addressing the DB storage issue is first required to prevent the issue noted in comment:10. Also, there are other workflows that will need testing, like during site creation, etc, so this could very well touch even more areas in Core.
I wanted to note that GitHub pull requests are the preferred mechanism for code review, as they facilitate easier reviews and collaboration. (For instance, see this branch with your first patch applied.) PRs are often faster to update, and additional requirements can be addressed through additional commits as they reveal themselves, such as the database storage update.
I've marked this ticket with changes-requested to signify that additional work is needed on top of the existing patch.
(In addition to the handbook, I invite you to join Make WordPress Slack if you have questions about the contribution workflow, particularly the #core channel.)
This ticket was mentioned in PR #5237 on WordPress/wordpress-develop by arnt.
2 years ago
#13
- Keywords has-unit-tests added
This adds support for unicode email addresses in is_email and sanitise_email. It is intended to be sufficient to accept unicode email addresses in contact forms etc.,
it does *not* aspire to unicode support everywhere, in particular it doesn't deal with unicode in logins or passwords.
It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one, but don't quite understand your testing policies. It looks like you want to keep the tests fast by avoiding unnecessary tests, I'd say, but I don't understand what you consider necessary and unnecessary. Anyway, if you want a file I'm happy to add one.
2 years ago
#14
I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.
@ironprogrammer commented on PR #5237:
2 years ago
#15
In response to @arnt:
It also does not add a tests/formatting/sanitizeEmail.php; I wouldn't mind adding one...
Absolutely, please do add unit tests (or updates to existing tests) for sanitize_email(). While Core test code coverage is pretty low today 😓, it is a project policy that new Core merges include appropriate tests.
I don't understand the test failures. They appear to be unrelated to these code changes. Could you take a look, @ironprogrammer ? Thanks.
It's not uncommon for E2E test timeouts to occur in this repo's automated jobs. These can be ignored.
@ironprogrammer commented on PR #5237:
2 years ago
#16
Thanks for the updated PR! The updates to sanitize_email() appear to have resolved the issue with storing the correct email in the database 🎉.
This isn't a formal test report (there are lots of scenarios needed), but I ran through a few smoke tests so as to get back to you quickly on what could be addressed next. It would be great to get additional reviewers/testers involved to cover what I've missed:
- ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
- ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
- ❓ Do new user creation, email change confirmation, or password change emails go to the correct address?
- ❓ Do links in these emails work correctly?
- ❓ Does the forgotten password workflow work correctly with a unicode address?
- ✅ Login: logging in with a unicode email works as expected.
- ✅ Add post comment: email with unicode is accepted for post comments.
- ❌ Comments (approval list): emails with unicode are not displayed correctly (Figure 1).
- ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
- ❓ Site setup wizard: can the initial admin use a unicode address, and do the emails work correctly?
- ❓ Do various author dropdowns display unicode addresses as expected?
- ❓ Are unicode emails in feeds presented accurately?
_Figure 1: Display of commenter unicode email addresses._
2 years ago
#17
Sigh. I see what you mean.
My own focus is/was quite different: I wanted Contact Form 7 and similar to accept unicode email addresses, which requires some utility functions in Wordpress to do the right thing. I didn't aspire to complete support in Wordpress. The thinking behind your response seems to be that either Wordpress supports it properly or not at all, there's no inbetween.
And, sadly, I think you're right and I was wrong.
I'll extend the PR. It'll take a few weeks. I'll also find some testers.
15 months ago
#18
Hi,
you asked for wider testing. Around 20 sites have now run the patch in production, the longest-running one for more than six months already. Nothing has broken. The sites are real (University of Somewhere, not someone's private five-visitor site).
If you want, I can send you the email addresses of some testers. When I pinged them last week, around half said it was okay to give you their addresses, I'm sure most will say the same if I ask a few times ;)
@ironprogrammer commented on PR #5237:
15 months ago
#20
Thanks for the update, @arnt, and for seeking wider testing! There are still some outstanding questions and one previous failure to consider, but it's great to see this moving forward! 🙌🏻
A good next step would be to rebase this PR on the latest trunk, and to resolve the coding standards issues to tidy things up for an updated review and testing. When the CI checks are in a happy state, then it will help attract more attention.
15 months ago
#21
Finding testers who were willing and able to run a patch on sites that mattered to them was hard work. That took a good long while. Do you want their names? If so, please send me mail.
Rebasing is not a problem, of course. The other stuff was a little difficult. I only recently found a good workaround for IntlChar's lack of support for UAX24.
Can you suggest a model end-to-end test that I could emulate and write automatic tests for the desired features? Manual testing is possible, but I'd much prefer to have complete automatic test coverage.
13 months ago
#22
Hi,
the remaining problems, as far as I can tell:
- The code now requires PHP 7.4, which I thought was OK but now I see that 7.4 is merely recommended, not required. Oops. Not clear to me how to do mb_str_split in 7.2.
- I've clicked around, but don't feel sure that I have all of the "various dropdowns". The filters in default-filters.php are okay, but I don't feel confident that every third-party plugin will be okay, or that I've looked at everything in the UI. Wordpress is large, there are nooks and corners…
12 months ago
#23
I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().
A bit on testing:
- ✅ Add New User: email with unicode characters is accepted and saved to the database correctly.
- ✅ Profile (edit user): email can be changed with unicode characters, and when saved, persists correctly.
- ✅ New user creation, email change confirmation, and password change emails go to the correct address.
- ✅ Links in all messages work correctly AFAICT (plugins can add new kinds of course…).
- ✅ The forgotten password workflow works correctly with a unicode address.
- ✅ Login: logging in with a unicode email works as expected.
- ✅ Add post comment: email with unicode is accepted for post comments.
- ✅ Comments (approval list): emails with unicode are displayed correctly.
- ✅ In the Comments list, the Email field under Quick Edit does appear correctly.
- ✅ WordPress works on an IPv6-only server (oh, you weren't wondering?)
And… oh drat… Must retest the site setup wizard, too much code has changed.
Two questions remain.
- What author dropdowns display email addresses? I can't find anything. Lots of user names displayed but not email addresses. Maybe I misunderstand. I checked for usage of email in the code that reads the database and use of that code in display code.
- How can email addresses ever appear in feeds? The feed generators don't seem to reveal authors etc.
12 months ago
#24
Just set up a new site from scratch using 'rød@grå.org' as admin address; it works.
11 months ago
#25
Hi @ironprogrammer,
would you mind having a look?
Also, on an unrelated matter, I'm with you. I've worked as a full-time maintainer on a widely used open source project and got a lot of flak. It hurt. Some people can be really shitty. Don't let it grind you down.
@ironprogrammer commented on PR #5237:
11 months ago
#26
Thanks a lot, @arnt, and apologies for not being involved with this the past several months 🙏🏻. I've got some other work priorities at this time, but hope to be able to dive back in soon!
In the meantime, for code review purposes, I would suggest raising this Trac ticket during #core Dev Chat, or dropping in a comment on the next agenda post to add it to open floor discussion. This will help spread some awareness of the initiative. If you're available on Slack and sync during that time to answer questions, that would be ideal.
Glad you could wrangle additional testing for the updates! If this info can be made public through test reports in Trac, that would be amazing -- and also prerequisite to merging as Core committers begin taking a look at the ticket.
You could get additional visibility by sharing the ticket in the #core-test channel. There may also be contributors willing to help put test instructions together based on what you've got so far. This update covers a lot of territory, so the more detailed the tests for each use case, the better.
#29
@
7 months ago
Hi @desrosj,
thanks for looking at this. I think reviewing/merging finally makes sense, since the required version of PHPMailer was released recently.
What can I do to help the review along? Would it help if I took a short trip to Basel for Wordcamp? Should https://github.com/WordPress/wordpress-develop/pull/5237 have any more documentation, or, or?
This ticket was mentioned in Slack in #core by agulbra. View the logs.
7 months ago
This ticket was mentioned in Slack in #core by francina. View the logs.
6 months ago
#32
@
5 months ago
This is the first of n test reports for this. All are basically the same — do something involving an å using unpatched WordPress 6.8, get a failure, do the same thing using this patch, it works. If you like Hindi, you can use अ instead of my å, if Chinese…
The domain grå.org belongs to me. Feel free to send mail there if you don't have an IDN yourself. Mail to rød@grå.org lands in my inbox, I won't complain, mail to blå@grå.org lands nowhere.
Environment: WordPress 6.8, in my case Brave as browser, both on linux, no plugins, WordPress unpacked but the installer has not yet run.
To pass this test, you need https://github.com/WordPress/wordpress-develop/pull/5237 and PHPMailer 6.10.
Steps to test: Open the installer web page. Enter rød@grå.org as the admin email address in the WordPress installer.
Test results: All is well if WordPress completes its installation and a message is sent to rød@grå.org. With 6.8, the email address is rejected and you are told to enter a valid email address.
#33
@
5 months ago
Test report 2/n.
Environment: WordPress 6.8, in my case Brave as browser, both on linux, no plugins, WordPress freshly installed, no additional plugins, no pages added.
To pass this test, you need https://github.com/WordPress/wordpress-develop/pull/5237 and PHPMailer 6.10.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Test results: All is well if WordPress adds the user and sends an email to the address you specified (and I suggest checking the log file of the outgoing MTA to see whether the message was actually sent). Later, WordPress will make an author page that looks like "/author/grå" and call the author "grå" on posts, which is a step forward from "gra" (one generally prefers not to call blog authors "gra").
With 6.8, you get an error message instead, no message is sent, no user is added.
(Feel free to use the grå.org domain for testing if you don't have one. It belongs to me (Arnt Gulbrandsen) and I don't mind. rød@ is forwarded to me, grå@ is an autoresponder.)
This ticket was mentioned in Slack in #core-test by sirlouen. View the logs.
5 months ago
This ticket was mentioned in Slack in #core-test by agulbra. View the logs.
5 months ago
@tusharbharti commented on PR #5237:
5 months ago
#37
Hi @arnt, I was curious if this PR would also handle things related to rfc 5322.
As I have a port of this ready in JavaScript ( Initially for @wordpress/url's isEmail ).
If we are fine with this, I can help with that in form of reviews.
Regards.
5 months ago
#38
I know 5322 very well, and also its replacement and am not aware of any related problems in Wordpress or PHPMailer. (My name is in both documents.) I'd be very happy to fix any related problems, if you'll point out anything specific.
I'm also more than happy to look at your javascript code.
@tusharbharti commented on PR #5237:
5 months ago
#39
I know 5322 very well, and also its replacement and am not aware of any related problems in Wordpress or PHPMailer. (My name is in both documents.) I'd be very happy to fix any related problems, if you'll point out anything specific.
(CORRECTION: No, my name isn't in 5322bis. The editor removed the names, the list grew too long.)
Ah my bad then, I was just doing tests using the current patch and saw some failing tests for comments, quotes and short domains ( I think that's what they called ? ), hence I wanted to add some improvements on them. 😄
I'm also more than happy to look at your javascript code.
Sure! here is the code Link
main focus points would be: #L161, #L175, #L189-L241
and then I guess supporting ip addresses.
This was what I was able to support from my JavaScript version
### Screenshot
#### Current Patch Converage
cc: @arnt
5 months ago
#40
That's admirably thorough testing!
By coinidence, Pete Resnick (the author of RFC 5322) and I were in the same room when I saw your table. We went over the cases together and agree about most of it.
First, a bit about the email address syntaxes. There are four:
- 5322 defines a very permissive syntax for interacting with legacy software (to be used e.g. when you parse old mail archives)
- 5322 defines a stricter syntax for generating mail today
- 5321 defines an even stricter syntax for addresses used with SMTP
- HTML defines an even stricter syntax) for using addresses in HTML forms
In general it's best to use addresses that meet the requirements of all four. Addresses that are valid according to only some rulesets tend to cause problems for users.
Now, to the addresses, with my opinions for how to handle each:
to..to@couc.ouandto.@couc.ouare invalid according to RFC 5321, therefore Wordpress cannot send mail to them. Rejecting these is a good idea.
(to)to@couc.ouandto@couc.ou(to)will be mangled during transmission by 5321-compliant software. Both addresses will arrive asto@couc.ou. Rejecting these is a good idea (while allowingto@couc.ouof course).
cow@[dead::beef],1@23456789and a few others use address literals, which have been used for attacks. The same attacks apply to Wordpress (you could use this to probe for IP addresses that the Wordpress host can reach, but which are firewalled away from the general network). Rejecting these is a good idea.
toto@coandadmin@mailserver1are local addresses, not world-reachable. (They could also be addresses within a TLD, but none of the TLDs allow that right now.) You could ssh to a Wordpress server and run 'ping co', because ping allows a local hostname. However, it's almost certainly bad to let users enter a local address in a Wordpress contact form (an attacker could use that to probe the local networkl). Rejecting these addresses makes sense.
1234567890123456789012345678901234567890123456789012345678901234+x@example.comis longer than one of the RFC limits. Most software supports longer localparts today. However, I surveyed a corpus of about 500,000 addresses, and *only* software used longer localparts, no users.issue-1234567890123456789012345678901234567890123456789012345678901234@support.example.comis a good example. If Wordpress allows longer localparts, bots will use that to spam the contact forms, and it will not help people. My opinion is that it's _probably_ best to reject these. Pete doesn't have an opinion on this address.
user@[IPv6:2001:db8::1]:8080passes, and I think it should not. It's not deliverable via SMTP on the public internet. However, it's also not a unicode matter. I'm uncertain whether _this PR_ should remove support for it.
Thanks for this testing! I'll add some of the addresses to the unit tests you used to the unit tests and push a new revision. I may change the handling of the :8080 address too.
@tusharbharti commented on PR #5237:
5 months ago
#41
Thanks for sharing such a detailed explanation!.
(to)to@… and to@…(to) will be mangled during transmission by 5321-compliant software. Both addresses will arrive as to@…. Rejecting these is a good idea (while allowing to@… of course).
O.O, I did think that () being treated as comment would cause issue like same address on arrival, but I thought as we are only doing looks like an email ( actual description for functionality of isEmail), accepting them would have been good Idea, but yeah rejecting them would make a lot of sense from a practical point of view.
cow@[dead::beef], 1@23456789 and a few others use address literals, which have been used for attacks. The same attacks apply to Wordpress (you could use this to probe for IP addresses that the Wordpress host can reach, but which are firewalled away from the general network). Rejecting these is a good idea.
toto@co and admin@mailserver1 are local addresses, not world-reachable. (They could also be addresses within a TLD, but none of the TLDs allow that right now.) You could ssh to a Wordpress server and run 'ping co', because ping allows a local hostname. However, it's almost certainly bad to let users enter a local address in a Wordpress contact form (an attacker could use that to probe the local networkl). Rejecting these addresses makes sense.
Oh, then that rejecting them make more sense, I was thinking of them as from local perspective like using tailscale vpn which allows shortdomains.
1234567890123456789012345678901234567890123456789012345678901234+x@… is longer than one of the RFC limits. Most software supports longer localparts today. However, I surveyed a corpus of about 500,000 addresses, and only software used longer localparts, no users. issue-1234567890123456789012345678901234567890123456789012345678901234@… is a good example. If Wordpress allows longer localparts, bots will use that to spam the contact forms, and it will not help people. My opinion is that it's probably best to reject these. Pete doesn't have an opinion on this address.
From what i have seen while searching and going through rfc 5321 & 2821, 64 seems to be the size that the local part of email address are allowed and I think that's still a bit far for from practically so ig sticking to 64 size looks better to me.
user@[IPv6:2001:db8::1]:8080 passes, and I think it should not. It's not deliverable via SMTP on the public internet. However, it's also not a unicode matter. I'm uncertain whether this PR should remove support for it.
Uh bit confused on this one, is this supposed to be valid email address as you said It's not deliverable via SMTP on the public internet. and right now both implementation is counting it as invalid.
---
Thanks for sharing the details again, I will update my implementation to reflect these. 😃
5 months ago
#42
I was confused about user@[IPv6:2001:db8::1]:8080. Looked at the table, switched tabs, wrote several paragraphs and by then I had forgotten what the table line says. That address is handled correctly IMO.
I'm not sure whether I get to do anything more on this today. But those addresses definitely need to go into a unit test, with comments as above.
peteresnick commented on PR #5237:
5 months ago
#43
Just two quick comments:
to..to@couc.ouandto.@couc.ouare invalid according to RFC 5321, therefore Wordpress cannot send mail to them. Rejecting these is a good idea.
Also invalid according to 5322 section 3 (the more restrictive) syntax.
(to)to@couc.ouandto@couc.ou(to)will be mangled during transmission by 5321-compliant software. Both addresses will arrive asto@couc.ou. Rejecting these is a good idea (while allowingto@couc.ouof course).
While OK according to 5322, you'd have to strip the comments before sending to an SMTP (5321) server, and I suspect software that is getting these address from a WordPress form does not do that all the time, so I would avoid.
@tusharbharti commented on PR #5237:
5 months ago
#44
Hi so I was checking on how to support quoted local emails and IPs on php side and updated my tests to follow the suggestions that were made during the discussion.
Did some updates on the patch and was able to make it work.
If it is fine to support quoted local and IPs then I can add review comments.
PS: updated the table format so it is easier to read now 😅
#45
@
4 months ago
- Milestone changed from Awaiting Review to 6.9
- Owner set to SergeyBiryukov
- Status changed from new to accepted
This ticket was mentioned in Slack in #core-test by sirlouen. View the logs.
4 months ago
#47
@
4 months ago
Test 3/n
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25. The installer has been run, but nothing added.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Click Users, select the user and click edit. You're now on .../user-edit.php. Scroll down to contact info and change the email address to something else, perhaps rød@grå.org, then click "update user" at the bottom.
Test results: All is well if WordPress sends email to the user to notify about the email address change. WordPress 6.8 will not succeed in sending this email.
#48
@
4 months ago
Test 4/n
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25. The installer has been run, but nothing added.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Click Users, select the user and click edit. You're now on .../user-edit.php. Scroll down to Account Management and click 'Password Reset'.
Test results: All is well if WordPress sends email to grå@grå.org with a new link, and clicking the link brings up the password reset page. WordPress 6.8 will not send the mail.
#49
@
4 months ago
Test 5/n
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25. The installer has been run, but nothing added.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Open an incognito/guest browser window, open wp-admin and log in as grå with the correct password. Click Profile, then edit the email address to blå@grå.org and click 'Update Profile'
Test results: All is well if WordPress sends email to blå@grå.org with a link to confirm the email address, and clicking the link brings works. WordPress 6.8 will not send the mail.
#50
@
4 months ago
Test 6/n
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Open an incognito/guest browser window, open wp-admin and log in as grå with the correct password. Click Profile, then edit the email address to blå@grå.org and click 'Update Profile'.
Test results: All is well if WordPress sends email to blå@grå.org with a link to confirm the email address, and clicking the link leads to the expected page works, and the new addres can then be used to log in. WordPress 6.8 will not send the mail.
#51
@
4 months ago
Test 7/n
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25. The installer has been run, but nothing added.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page. Select "users" on the left-hand bar, then "add new user" near the top of the screen. Enter "grå" as username, "grå@grå.org" as email address. Choose any name and role. Click "add new user" near the bottom. (Grå is Norwegian, it means grey, and I use that as a test since I'm Norwegian. It's pronounced as org backwards.)
Open an incognito/guest browser window, open wp-admin and click 'lost your password', then enter grå@grå.org as email address and click 'get new password'.
Test results: All is well if WordPress sends email to grå@grå.org with a link to confirm the email address, and clicking the link brings you to the page where you can set a new password. WordPress 6.8 will not send the mail.
#52
@
4 months ago
Test 8/8 (I think these eight are enough)
Environment: WordPress on the tip, after August 20 or so and with https://github.com/WordPress/wordpress-develop/pull/5237 applied. You need no plugins. You may use Mailpit 1.25.0 or later to pick up outgoing mail. Mailpit needs to listen on localhost port 25. The installer has been run, but nothing added.
I used Brave as browser and ran both on linux.
Steps to test: Open the WordPress admin page.
Open an incognito/guest browser window, open the blog home page, click on the initial posting (the one with "welcome to workpress, this is your first post"). Scroll down, write a comment (try to make it rhyme), then enter address grønn@grå.org as address and click 'post comment'.
Return to the WordPress admin page. There should be a red circle next to 'Comments' in the left-hand-bar. Click 'Comments'. You should see your new comment (along with one supplied by the installer).
Test results: All is well if WordPress displays the author address correctly. WordPress 6.8 will display a remarkably garbled address (assuming you manage to post the comment).
Steps to test: Click Quick Edit.
Test results: All is well if the email address is displayed correctly under the comment body text.
This ticket was mentioned in Slack in #core-test by jon_bossenger. View the logs.
6 weeks ago
#54
@
6 weeks ago
@agulbra based on your 8 tests, would you say the PR is ready, or is further testing needed?
@SirLouen commented on PR #5237:
5 weeks ago
#57
@arnt I sent you a PR with some improvements.
Also good to see that @dmsnell has self-assigned this. It will get through soon.
5 weeks ago
#58
I'm grateful for the nitpicks. Seriously. I also wish this could be shipped; that would make life simpler for me.
I'd like to have a talk about growing/shrinking it. I'll keep an eye on Slack tomorrow. Have several meetings though, including one which I chair, so I might be pressed for time, depending on your timezone.
I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.
@SirLouen commented on PR #5237:
5 weeks ago
#59
I decided not to rebase/push now. @SirLouen's comment change clearly should go in but it's minor, and I'd rather like to improve that cast but don't see how to write a unit test. I think we can continue talking without a new push.
Btw, also while I was reviewing this patch, I sent you a little PR with some little extra formatting tweaks for the tests part. Nothing big, just improving CS a bit and setting a wider group to test this ticket at once.
5 weeks ago
#60
@SirLouen I'll go and have some food now, and merge your PR either when I return or tomorrow morning. I really don't want to break anything due to jetlag and lack of focus.
5 weeks ago
#61
jetlag
having just jumped ahead by nine hours, I feel something similar @arnt — hope you recover quickly.
---
here are some things I reviewed yesterday:
- WordPress appears to call
sanitize_user()everywhere before detecting duplicate usernames, so changing that function should not accidentally allow duplicating user logins. - the
schema.phpindicates a 60-character login and 100-character email. forutf8mb4configurations that means people will still be able to enter emails with “100 characters”.
I was left with a few more questions in general though:
- for what reason are we adjusting
sanitize_user()if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects. - what should we be doing if the database table cannot store the email addresses? for instance, if a table is configured as
latin1we have a problem. while that allows us to store the raw bytes just fine, unless we ensure that PHP and MySQL agree on what’s transferring through the query we will end up with corruption. - even if we ensure that we store non-ASCII characters properly, we end up with a length restriction for _some_ email addresses based on the fact that
VARCHAR(100)means 100 bytes for alatin1table but up to 400 bytes for autf8mb4table. this is something to consider otherwise people might end up with some confusing error messages when submitting a valid email that looks like it’s shorter than WordPress and the database count it.
CREATE TABLE test_l1 (t varchar(100) character set latin1 collate latin1_swedish_ci) DEFAULT CHARSET=latin1;
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES ('a\u{1f170}b\u{1f171}')" );
$wpdb->query( "INSERT INTO test_latin1 (t) VALUES (_latin1'a\u{1f170}b\u{1f171}')" );
MariaDB [wordpress]> SELECT * FROM test_l1; +----------------------+ || t || +----------------------+ || a?b? || || a🅰b🅱 || +----------------------+ 2 rows in set (0.000 sec)
In the first case, we sent UTF-8 data to the database, which knows that the table is storing latin1, so it transparently re-encoded from UTF-8 to latin1, replacing unrepresentable characters as ?. In the second case we told the database that the UTF-8 bytes should be explicitly interpreted as latin1 and so it did store the raw bytes, however…
php > var_dump( $wpdb->get_col( 'SELECT t FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(20) "a🅰b🅱"
}
php > var_dump( $wpdb->get_col( 'SELECT CAST(t as binary) FROM test_l1' ) );
array(2) {
[0]=>
string(4) "a?b?"
[1]=>
string(10) "a🅰b🅱"
}
…unless we also do something on the read query side then the database will perform the same transparent re-encoding on read turning those UTF-8 bytes _back into UTF-8_ and thus double-encoding them.
This seems problematic to me in a _new_ way because after this change we are allowing people to create accounts with Unicode login names and emails, but these database queries won’t fail; they will corrupt the data and someone might lose access to the account they just created, or to an account for which they just changed their email address. If we want to ensure this doesn’t happen we would need to trace the data through all places that access these and ensure we properly manage the character encoding in the SQL session, something that doesn’t tend to go well in WordPress due to the distributed nature of code contributions and plugins.
To me this highlights the need for deprecating non-UTF-8 support (Core-62172), though I guarantee we will be doing whatever we need to to support Unicode emails before we do that.
---
talk about growing/shrinking it
As named this PR seems very grand in purpose: what does it even mean to support RFC 6530? I feel like as a Core tracking ticket that is appropriate, but the work is likely to involve many independent PRs and changes, including some big semantic changes around usernames and emails, possibly involving some database schema changes.
For now, however, if we assume that this PR should move forward, here are some things I propose, pending further security review:
- Drop the detection for cross-script spoofing. I would love to see some evidence that the risk is substantial enough that it’s worth the complexity, but I suspect that having rules which change based on the content of the input is going to be confusing for users and developers.
- Missing from all of the changes are checks that the input is valid UTF-8. If we start by rejecting invalid UTF-8 then we can make a lot of assumptions and drop safety-checks without increasing risk. From my reading of the RFC I don’t see any allowance for addresses with invalid UTF-8.
- Maintain existing behavior in the antispambot function. That is, let’s not skip non-ASCII characters just because we don’t see them as being present as threats. I think we can probably deprecate that function entirely and skip this, but we will create another discussion for that.
Further, I think we need to talk about what email characters are allowable. Our current set of rejected ASCII character includes octets which could be allowable email addresses. If our intention is to be restrictive, say to reject non-printable characters, or ambiguous whitespace, then we probably ought to see if we can replicate that experience for the upper Unicode planes too and not just turn away from those goals because it’s not ASCII.
Please enjoy your conference. There’s no rush; as this work will probably take a lot of cooperation. We’ll bring it up next week in the developer chat.
This ticket was mentioned in Slack in #core by welcher. View the logs.
5 weeks ago
#63
@
5 weeks ago
- Milestone changed from 6.9 to 7.0
This was reviewed in the 6.9 bug scrub today. We're 1 day from releasing RC 1 so we're going to move this to the 7.0 milestone.
@SergeyBiryukov commented on PR #5237:
4 weeks ago
#64
@arnt
I added some code to prevent PHP 7.2.24-7.3.x from choking on mb_str_split().
At a glance, would a polyfill for mb_str_split() be helpful here, so that we could remove the function_exists() checks and make the behavior consistent across all supported PHP versions?
4 weeks ago
#65
@SergeyBiryukov I would say that a polyfill makes sense if WP is to support PHP 7.2-3 for a while and a polyfill is simple. The support table makes it look as if support might be dropped in WP 7.0?
@dmsnell that's a long and very welcome message. Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.
4 weeks ago
#66
Do you think I should attend to that now, or leave you in peace until 6.9 is released? I'd really like to merge this soon, for several reasons, but I imagine the 6.9 release needs developer focus.
@arnt you don’t need to worry about me or my schedule. just like anticipated objections, it will likely be more advantageous to just move at your own pace and let others work at theirs, be it fast or slow.
---
I am guessing the most important thing right now is figuring out the issue of database corruption since this creates a new avenue for that, and for potentially locking users out of their accounts because of it. we don’t necessarily have to solve it, but it’s a noteworthy impact from this change and creates a problem we didn’t previously have.
This ticket was mentioned in Slack in #core by dmsnell. View the logs.
4 weeks ago
4 weeks ago
#68
I'll write several separate comments here, on separate topics, and push a new revision at the end. First, cross-script usernames.
There's effectively no evidence of cross-script usernames being used, either for good or for bad. There's an exception and ha half for good: ① Q and other Latin letters appear to be used a little bit in Chinese, including the well-known name 阿Q, and ② Arabic digits (0-9) are used with many writing systems, even ones that have other digits. (Arabic itself is an example of that, in the areas east of Lebanon/Jordan.)
I know about web forums etc. that allow non-ASCII usernames. Unfortunately for this, the ones I know only allow one script.
There's domains, though. Domains in many TLDs are like self-service usernames, so that experience is instructive. ICANN receives abuse reports for domains and forwards them to registrars; I don't read that feed closely but I think I would hear about examples of cross-script similarity abuse. I haven't so far.
Visually similar letters have been used in real attacks, but I don't know about any cross-script use. Rather, i has been used for l (PaypaI a long time ago, phiIIips recently) and 0 for o, and one clever attack used a little-known diacritic and an all-Latin domain.
Impersonators mostly rely on non-letter similarity as far as I can tell. "Look, my email or web page has your bank's logo so I must be a legit representive of your bank." "My bluesky account has a photo of that person so you can trust that I really am that person." It's similarity but it's not letters.
However, I suspect that if it's possible to impersonate a wordpress user on the same site and use the impersonation for a swindle, someone will eventually do it. It's more likely to be done by copying a profile photo than by leveraging cross-script similarity.
So… I think it's reasonable to drop single_unicode_script, either nor or later. The only substantive reason to keep is that having it from the start is IMO easier than adding it later.
4 weeks ago
#69
About UTF8 and other database encodings.
As I understand it, Wordpress 2.2 started using UTF8 for new databases and 4.2 converted old databases to UTF8, except when admins chose to override that. Overriding is possible to this day. There will be a few sites that do override.
I read ticket 62172 and it makes a great deal of sense to me. Agree about the brittleness. However, that change probably shouldn't be conflated with this.
It seems likely that a non-UTF8 database will suffer corruption in other tables, not just the usernames. That may be less serious. It won't lock anyone out. However, this problem could affect the core value of the site. When you publish a page, saving it to a non-UTF8 database can turn a dozen different currency symbols into question marks. Easy to overlook. Far from good. It seems… less urgent than usernames but worth solving.
Having written all this, I think I've formed an opion:
- This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.
- If the check fails, the error message should say something like "This name/password/address cannot be stored in the database. Note: DB_CHARSET other than UTF8 is deprecated since 7.0 and will be removed in a future version of Wordpress" without naming a version number.
- The difference between utf8 and utf8mb4 is best ignored.
Does that sound good?
4 weeks ago
#70
Next up: The scope of this PR.
As @dmsnell asks:
For what reason are we adjusting sanitize_user() if this is about Unicode email addresses? I wouldn’t assume we have to do anything to the user objects."
The first version of the PR did indeed touch only addresses. I extended it by request because as contributor, I try to conform to whatever is expected. At this point I'd prefer to keep the agglomerated PR because:
- The unit test and review issues are mostly similar.
- There are test scripts, props and commits that would either be difficult to split, or that would be bulky repetition.
4 weeks ago
#71
Email address length:
Wordpress currently supports 100 bytes reliably, more unreliably. With the change I suggest above it'll store 100 characters reliably or reject the address at entry time.
Email address length limits are a bit of a mess in general. This is partly due to the specifications. The length limit for an email address differs from the length limit for an email address on the public internet. Too subtle for most people. Then there's widespread disobedience. and even worse, there's software and SaaS vendors that judge email addresses and have their own unpublished length limits. If you enter a 70-byte email address in a newsletter signup form, you may well be rejected because a SaaS vendor classifies the address as a bot.
Both 100 bytes and 100 characters are so high that any human who tries to use such an address will run into problems quickly. Much software has shorter limits (a well-known CRM has 80) and few people can spell reliably enough to type a really long address correctly from a business card.
It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.
FWIW, my name is in both RFC 5321 (length on net) and 5322 (length).
4 weeks ago
#72
Does that sound good?
I’d like to hear the thoughts from some others who have more experience dealing with text encoding issues, particularly in places where we create new ones. Perhaps @mdawaffe has some or knows someone with thoughts.
A check for round-tripping seems reasonable and awkward at the same time, because making those checks everywhere would be excessive. For an email address though, maybe it’s worth the diversion.
It's sad, I wish the situation were cleaner, but 100 is a safe length in the world as it stands, and there isn't really any number that's obviously right.
We’re discussing a new reality which has no prior in WordPress and which can be dramatically different. The moment someone uses the flag of England in their email address, that’s 28 characters alone. Meaning that if someone uses the US flag in their email they are allowed 92 other ASCII characters but for the England flag it only leaves 72.
If we only allow storing Unicode emails on database tables with UTF-8 support then we could avoid this problem. Maybe that’s a reasonable feature-gate for this: don’t enable non-ASCII email address unless the {$wpdb->prefix}users table is stored as utf8mb4.
Were we to pursue that angle, it could be worth imagining a new global toggle or check like wp_supports_unicode_email_addresses() or something table-based. I know I have discussed the idea with @adamziel about checking database encodings at startup and caching them so we can reliably work with database tables regardless of the character set or collation. 🤔
I extended it https://github.com/WordPress/wordpress-develop/pull/5237#issuecomment-1736475880 because as contributor, I try to conform to whatever is expected.
Perhaps it’s my poor reading skills, but I re-read that comment multiple times and couldn’t find the request you’re referring to. Maybe @ironprogrammer left another comment on an older version of the code?
You are likely underselling your opinions here. This is a collaborative effort and we are all working together; I know from your previous comments that there are times you read more into things than people leave when they comment. It’s okay to ask questions and engage in the back and forth — if we all instantly did whatever anyone asked us then we’d have chaos, or LLM-level-slop.
---
I’d really love to consider this only in the context of sites with a utf8mb4 users table. I don’t know what to recommend as a way to do that, other than one obvious and necessary-but-insufficient task which is to verify that the emails are valid UTF-8.
---
There are a few places email addresses are read from query args: I think in the initial site installation and in some new user requests. This is not relatively important compared to the other feedback, but it will be something we want to make sure we check at some point, largely to ensure that when decoded they are valid UTF-8 and that they are properly percent-decoded. Perhaps it’s fine as-is and the email functions will verify that when passed the information (apart from the percent-encoding).
Either way, it’d be great to confirm what happens when some emails are sent with invalid query args.
---
If anyone would be interested in testing the same steps that others like @USERSATOSHI and @ironprogrammer have performed, but with a site setup with latin1 configured as the blog_charset and as the default database table charset, that would be helpful.
It would be valuable to test emails containing “normal” Unicode like non-latin scripts (which use only up to three bytes) as well as addresses containing characters from the supplemental planes, such as emoji (requiring four bytes).
@mdawaffe commented on PR #5237:
4 weeks ago
#73
This is a cool proposal. Note that I have not made the time to fully understand the conversation above. I don't think anything I've written below is particularly new: apologies if it is obviously repetitive :)
This PR should check roundtrip conversion via DB_CHARSET and refuse to store user names, passwords and email addresses if that fails.
wpdb->strip_invalid_text() (and friends) does something similar:
- allows anything in
latin1orbinarycolumns/tables. - allows 3-byte UTF-8 in
utf8(mb3)columns/tables. - allows all UTF-8 in
utf8mb4columns/tables. - Does an actual DB query for other encodings.
As you'd imagine from the method name, it strips characters (and optionally truncates the string) rather than bailing. I think, though, you could do a clever SELECT statement that would check charsets against the correct columns/tables, check if the result is different than the input, then bail as necessary before doing the UPDATE/INSERT.
I don't think that's necessary, though, since I think we should instead only allow non-ASCII email addresses in utf8mb4 tables: no roundtrip necessary. Just reject non-valid UTF-8 byte sequences in PHP and check column(/connection?) charsets. Any other DB setups are stuck with ASCII email addresses.
The difference between utf8 and utf8mb4 is best ignored.
I don't think we can ignore the difference. We should require utf8mb4 (supposing we take the direction I mentioned in my previous paragraph).
I don't have a strong opinion about limiting strings to having characters from only one script (plus ASCII). On small sites, it likely doesn't matter at all. On large sites (like WordPress.com or sites where WordPress+bbPress is used for public forums) with user registration available to anyone, spoofing is likely a big concern. I agree that uses_single_unicode_script() is easy to add now, easy to stop using later, and hard to add later, so it might be worth just doing now and reevaluating later. Note that (as you all know) domains are a special case. Many TLDs do not allow IDNs, and some only allow certain scripts. Email addresses contain domains :) I'm not sure if we should care about individual TLD limits. We probably shouldn't care.
I do have one strong opinion: non-WordPress code should be able to make "raw" SELECTs to the database and retrieve the correct username/email address. That is, we should not write "corrupted" or otherwise transformed data to these columns even if WordPress correctly uncorrupts/untransforms the values when it read them. That means no charset shenanigans or hacky encodings (e.g., Base64).
For usernames, another (non-?)option would be to allow any characters, but to reject new usernames that are too "close" to existing ones. I don't know how to do that efficiently without adding a new (indexed) column to the users table, which I think is a bad idea.
Fly-by suggestions:
- start with email addresses only. I'm reasonably confident that non-ASCII usernames will break something, but I have no clue what :)
- consider making non-ASCII values opt-in (with a WP filter), so we can get real world experience with sites that want it without altering the behavior of sites that don't care. If it all works: make it opt-out in some future version. (I don't know if we ever do that sort of thing. Perhaps I'm being too conservative.)
4 weeks ago
#74
I just pushed a new revision.
I discovered that wpdb already protects against not being able to store the data. The database may corrupt something, but wpdb already tries really hard to avoid getting that far. I added an extra unit test, since locking people out is worse than most failures.
Side note: That function uses latin1 in a way that looks incompatible with selecting using ilike at first glance. I didn't look closer into that, though.
FWIW I would be unhappy with a toggle for email address support. Those toggles are trouble magnets and I strongly advise against going that way. A toggle for username support is another thing.
I know a data set that can be used for username similarity detection… but that's an unpleasant slippery slope, with much scope for pedantry about details. Some of the decisions in the data set puzzled me. Best avoided, I'd say.
I've no opinion on restricting this to UTF8 databases. If WP were my baby, I'd announce that support for other databases will be sunset in the middlish future and thereby make the smaller question go away, along with the risk caused by brittle code.
I didn't have time to rebase, I'm afraid, but I wanted to get this pushed before I go on a trip tomorrow morning.
Can you specify exactly what the bug is? As far as I can see,
dummy-email.y.!#$%@gmail.comis a valid email address. See eg http://en.wikipedia.org/wiki/Email_address#Local_part.