#30130 closed enhancement (fixed)
Normalize characters with combining marks to precomposed characters
Reported by: | zodiac1978 | Owned by: | SergeyBiryukov |
---|---|---|---|
Milestone: | 6.1 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Formatting | Keywords: | has-patch |
Focuses: | Cc: |
Description
I ran into a little weird problem which I wanted to solve. And here it is:
I have a PDF file with German Umlauts (üöäÜÖÄ) and if I copy & paste them into WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.
This results in some problems:
- Search for words with umlauts doesn't work
- Proofreading fails
- W3C validation fails with warning "Text run is not in Unicode Normalization Form C." because precomposed characters are prefered (See: http://www.w3.org/International/docs/charmod-norm/#choice-of-normalization-form)
Solution: I made a proof-of-concept with the "content_save_pre" filter and it works. In this proof-of-concept I just replaced the two characters with the precomposed character:
$content = str_replace( "a\xCC\x88", "ä", $content );
$content = str_replace( "o\xCC\x88", "ö", $content );
$content = str_replace( "u\xCC\x88", "ü", $content );
$content = str_replace( "A\xCC\x88", "Ä", $content );
$content = str_replace( "O\xCC\x88", "Ö", $content );
$content = str_replace( "U\xCC\x88", "Ü", $content );
If we could (I know we can't, because WP is still supporting PHP 5.2) rely on PHP 5.3+ we could use a function for that:
http://php.net/manual/de/normalizer.normalize.php
So the above code (also used in the upcoming patch) would be just one line and much more general:
$content = normalizer_normalize($content, Normalizer::FORM_C );
Fun facts:
The problem is just on Mac OS X (Lion, 10.7.5) for me (on Ubuntu 14.04 or Win 7 I couldn't reproduce the problem).
Maybe this is an edge case and/or plugin territory.
Attachments (17)
Change History (78)
#1
@
10 years ago
- Component changed from General to Formatting
- Keywords has-patch dev-feedback added
#2
@
10 years ago
- Keywords needs-testing added
Would be interesting to hear if this is solved with different PDF creator or viewer software and/or in newer versions of Mac OS X.
#3
@
10 years ago
- Component changed from Formatting to Editor
Copy/paste issues are usually in the Editor component. Can you reproduce the problem on the front page demo of http://www.tinymce.com/ ?
#4
@
10 years ago
Yes, on Mac OS X Lion 10.7.5.
It is browser agnostic: Firefox shows two characters, so you see the problem easily. Safari and Chrome shows the two characters combined, but search/proofreading/validation fails too, because it should be a real precombined character. Not easy to understand ...
#5
@
10 years ago
- Component changed from Editor to TinyMCE
Patch seems reasonable, but you should also report the bug at http://www.tinymce.com/develop/bugtracker_bugs.php
#6
@
10 years ago
Done: http://www.tinymce.com/develop/bugtracker_view.php?id=7243
Reproduced the bug with Preview app on Mac OS X 10.9.5.
Adobe Reader and Acrobat CS5 are doing a better job. The text pdf contains three words. Just the first word has the problem. Weird.
This ticket was mentioned in Slack in #core by zodiac1978. View the logs.
10 years ago
#8
follow-ups:
↓ 9
↓ 11
@
10 years ago
The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé
?
#9
in reply to:
↑ 8
@
10 years ago
Replying to ocean90:
The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about
ßñçeëuûé
?
It could be an issue with all characters which can be combined from two characters, so I think "ß" is not an issue.
But all characters with accents or things like that:
http://www.fileformat.info/info/unicode/char/search.htm?q=combining&preview=entity
So, the patch should be extended with all of these characters (which could be really a huge amount of lines ...)
Just to be sure: In Chrome and Safari you don't see the problem from the screenshot, but the wrong character (better characters) is still used, so searching for the word(s) don't work.
Is it really working on Mac OS X 10.10?
#11
in reply to:
↑ 8
@
10 years ago
Replying to ocean90:
The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about
ßñçeëuûé
?
Confirmed, it is an issue with nearly all characters you mentioned. The normal "ß", "e" or "u" is fine, but all other characters are affected. And there are of course more.
The screenshots from Caspar are showing that the problem is on Mac OS X 10.10 too.
It would be just one line with PHP 5.3 to fix it with normalize function (would fix every not pre-combined character), but I'm afraid we have to do this manually for PHP 5.2 and there are many combinations we would have to check. :(
Furthermore I have forgotten the title. In the patch I just repair the content, but this problem could appear on every single input everywhere.
Maybe this should be fixed by Apple and not by us?
#12
follow-up:
↓ 13
@
10 years ago
To summarize:
- Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
- Doesn't depend on where the copied content is pasted, textarea or contentEditable.
- Doesn't happen when copying from Acrobat?
Perhaps some additional tests:
- What if copying from the same PDF file that is viewed in the internal viewer in Chrome?
- Does it happen for all PDF files?
If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the paste
TinyMCE plugin. There we can run it only on pasting on MacOS, etc.
#13
in reply to:
↑ 12
@
10 years ago
Replying to azaozz:
- Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
- Do not work properly in 10.10 (it is not easy to catch, because in Chrome search works and the appearance in the text editor looks right, but try to paste it in visual editor or in Firefox, then you can see the problem)
- Doesn't happen when copying from Acrobat?
- Acrobat and Adobe Reader also have this problem but just with the first character (I don't know why - weird behavior)
Perhaps some additional tests:
- What if copying from the same PDF file that is viewed in the internal viewer in Chrome?
Problem is not there if pasting from internal PDF-viewer from Firefox or from internal PDF viewer Chrome or from Adobe Reader Plugin in Safari.
- Does it happen for all PDF files?
Well, I can't test every PDF file ... ;)
My test pdf is from LibreOffice 4.2 (PDF Version 1.4)
If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the
paste
TinyMCE plugin. There we can run it only on pasting on MacOS, etc.
I can't help with that (beside of testing), but maybe this is the better way to solve this.
If someone wants to do more tests: If you turn on Permalinks and paste the words from the pdf into the title, then WordPress replaces "ü" with "ue", "ä" with "ae" and "ö" with "oe" for the permalink. This replacement doesn't work if the character isn't precombined, but a vowel followed by a diaeresis.
In the database the post_name should look something like this "fuenf-gaebe-schoen-direct-enter" - if you have the wrong characters you see something like this: "fu%cc%88nf-ga%cc%88be-scho%cc%88n-firefox".
%cc%88 is the diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm)
Seems to be broken in FF and Chrome. Preview app to Safari seems to be okay.
#15
follow-up:
↓ 16
@
10 years ago
I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...
#16
in reply to:
↑ 15
@
10 years ago
Replying to Kainer:
I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...
You can try my plugin: https://github.com/Zodiac1978/tl-normalizer
For me it is a huge timesaver if I have to copy&paste a list of workshops from a PDF to an online calendar.
#17
@
10 years ago
I've ran into this several times, it would be nice if WP took care of it automatically. Huge thanks to @zodiac1978 for that plugin!
#18
follow-up:
↓ 19
@
10 years ago
- Component changed from TinyMCE to Formatting
- Milestone changed from Awaiting Review to Future Release
We could use normalize in JavaScript, but with limited browser support.
This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.
Moving back to formatting.
#19
in reply to:
↑ 18
@
10 years ago
Replying to iseulde:
We could use normalize in JavaScript, but with limited browser support.
This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.
Moving back to formatting.
Yes, it affects all inputs.
My PHP 5.3+ solution (for titles, excerpt, content and comments) is now in the repo, too:
https://wordpress.org/plugins/normalizer/
#20
follow-up:
↓ 21
@
10 years ago
- Keywords needs-testing removed
So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?
#21
in reply to:
↑ 20
@
10 years ago
Replying to iseulde:
So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?
The filter is precombining the characters just before save, so an additional normalization on paste in TinyMCE would be beneficial, because until the post is not saved you have a broken proofread, search, etc.
I don't know if we should add this with a check for PHP 5.3 in core, because I don't know if this check for Normalization Form C is a performance problem, so I added the "needs testing" tag.
Maybe some more advanced PHP devs can have a look at this for the performance question. I mean, it solves a Mac-only problem which is just there if you copy&paste from a PDF. I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.
#22
@
10 years ago
I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.
Note that this would only run when writing to the database.
#23
@
9 years ago
The new patch is using the the Normalizer function from PHP 5.3+ (http://php.net/manual/de/normalizer.normalize.php). Unfortunately it is just there if the intl extension is loaded, so we have to check if the function exists.
#24
@
9 years ago
The slug is getting generated in the wrong way and the patch does not changes it at the moment.
Additionally there many more input fields in WordPress which could be in need of the function (e.g. Widgets)
#25
@
9 years ago
- Keywords needs-patch added; has-patch removed
As discussed during contributor day, it totally makes sense to normalize such strings on post save (and the patch does that well).
However, from a UX perspective we definitely need to do this in JavaScript as well. Otherwise you'll paste the messed up text and correct it by hand without knowing that WordPress will fix it for you. See comment 18.
Luckily, there's a new normalize()
method in the ECMAScript 2016 (ES6) standard we can leverage. It's supported by Chrome, Firefox, Opera and IE (11+). See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize for details.
#26
@
9 years ago
Awesome work!
If normalization makes it into core, it should also be applied to filenames of uploaded files - see issue #22363.
#27
follow-up:
↓ 28
@
9 years ago
There's a polyfill for the PHP Normalizer component in Symfony https://github.com/Zodiac1978/tl-normalizer, and also one for String.prototype.normalize
at https://github.com/walling/unorm, as mentioned by @zodiac1978 in his tinymce ticket (now at https://github.com/tinymce/tinymce/issues/1971).
As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack normalize()
to normalize text on pasting into tinymce.
#28
in reply to:
↑ 27
@
9 years ago
- Keywords has-patch added; needs-patch removed
Replying to gitlost:
As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack
normalize()
to normalize text on pasting into tinymce.
This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.
I expanded my previous ticket and added the normalize function to all sanitize filter additionally to the save_pre filter.
Unfortunately there are still some fields not normalized:
- Text Widget content
- Website title and slug
- vita on profile page
- Add New Tag/Category
- ALT-text for images in media upload panel
- Add Tags metabox
Any help is much appreciated. Ping @swissspidy
Maybe we can use this approach as a first step and leave the ticket open for the javascript solution.
#29
follow-up:
↓ 30
@
9 years ago
This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.
The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.
On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?
#30
in reply to:
↑ 29
@
9 years ago
Replying to gitlost:
This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.
The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.
On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?
Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.
Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.
#31
@
9 years ago
Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.
True enough, I suppose it was just the amount of filters and that I haven't got my head around the various pre_XXX
and XXX_save_pre
and edit_XXX
and XXX_edit_pre
and sanitize_XXX
etc and which do what when why...
Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.
Excellent, I'd be up for that. I'll ping @zodiac1978 on Slack to discuss...
I updated the fork again - apart from a major bug where it was adding the filters too late, there was a dependency in the Symfony Normalizer on PCRE with UTF-8 support, which I've belatedly discovered can't be relied on to be available, so added a workaround ... if it's good it should be useful for tickets #22363 and #24661 also. The big things left to do are to work out which filters to use, customizer support, and unit tests.
This ticket was mentioned in Slack in #core by swissspidy. View the logs.
9 years ago
#34
@
8 years ago
The fork mentioned is now available from the WP repository as UNFC Nörmalizer (thanks anonymous plugin reviewer!).
As to how or if to normalize input in core, I'm really not sure. Adding all those filters still doesn't seem right.
Also normalization may not always be desirable, eg in the case of CJK compatibility ideographs (they get mapped to unified ideographs under both NFC and NFD normalization), although it's hard to get a definite read on this as the Unicode Consortium seem to suggest it's not a problem - see eg Isn't it true that some Japanese can't write their own names in Unicode?, and I suppose it could be made locale dependent if necessary.
Anyway I'd lean towards a javascript only fix, only added for pasting in Chrome and Firefox under Mac OS X (and iOS?). These browsers support the normalize()
method so no polyfill would be needed, and (presumably) encompass the vast bulk of use cases. I can extract and adapt the code used in the plugin as a patch if there's interest.
Another option - use Safari!
PS In case people have difficulty replicating this bug, note that the paste needs to done in Chrome or Firefox - ironically Safari normalizes pastings (and upload filenames, but that's another story) to NFC, while the others just take what they're given. The copying from copy-paste-test.pdf can be done from Preview or from Adobe Reader, and as noted above (at least on the versions that come with Mountain Lion 10.8) Preview (Version 6) decomposes all the umlauted characters, while Adobe Reader (Version 11.0.13) decomposes only the u-umlaut. Think different.
PPS I do think the ability to normalize (via the Symfony polyfill) would be a worthwhile addition to core (eg for sanitize_file_name()
, remove_accents()
) and will open a ticket suggesting it.
PPPS Implementing the plugin threw up a number of issues, eg admin javascript in a lot of cases does not check and refresh its data based on what comes back from the server, and meta keys aren't sanitized - I'll (hopefully) open tickets for each of these.
#35
@
7 years ago
A processwire module:
https://github.com/justb3a/processwire-textformatter-normalizeutf8
This is using Patchwork as a compatibility layer
https://github.com/tchwork/utf8
Other CMS, like Concretet5 are using this too.
Here is a issue for Typo3 with the same topic:
https://forge.typo3.org/issues/57695
After tweeting to other CMS and some answers it could be solved with the last Apple mac OS 10.13.x
Can anyone with 10.13.x check this?
@
7 years ago
Normalize test in Firefox after re-saving the post with UNFC Nörmalize plugin activated
#36
@
7 years ago
(Oh, I didn't mean to spam this ticket with attachments, I though I could attach the images and use in one post without creating a post for each…)
I did a quick test using macOS 10.13.3. I opened the copy-paste-test.pdf file attached to this ticket in Preview 10.0, copied the text and saved in a new post using Firefox (59.0). The same issue still remains.
Chrome (65.0.3325.162) and Safari (11.0.3) renders the text correctly but Firefox does not. See the above screenshots.
I then activated the UNFC Nörmalize plugin an re-saved the post. The normalized text then looks fine in Firefox as well, see last screenshot above.
#37
@
7 years ago
Thanks for testing @AdrianB !
Did you try to search or proofread the strings in Chrome or Safari? Does this work correctly? (For me it is still broken in Chrome & Safari too)
#40
@
5 years ago
- Milestone changed from Future Release to 5.3
- Owner set to SergeyBiryukov
- Status changed from new to reviewing
#41
follow-up:
↓ 42
@
5 years ago
- Keywords close added
Don't seem to be able to reproduce this in MacOS 10.14 Firefox and Chrome (unless I'm doing something wrong). Perhaps this was (finally) fixed at the OS level?
Is anybody still able to reproduce?
#42
in reply to:
↑ 41
@
5 years ago
Replying to azaozz:
Don't seem to be able to reproduce this in MacOS 10.14 Firefox and Chrome (unless I'm doing something wrong). Perhaps this was (finally) fixed at the OS level?
Is anybody still able to reproduce?
I‘m on holiday, so I can‘t test myself, but maybe you are doing something wrong indeed. Gutenberg is fixing this for the content area in browsers that support this with some ES6 Javascript.
The problem remains for older browsers or if JS is disabled and unfortunately in the title (and therefore in the permalink) too.
Here are my slides with some additional info:
https://speakerdeck.com/zodiac1978/special-characters-and-where-to-find-them
#43
@
5 years ago
- Milestone changed from 5.3 to 5.4
This is going to miss the deadline for version 5.3 Beta 1. Punting to 5.4.
#44
@
5 years ago
- Keywords close removed
As this is milestoned for 5.4 I would like to propose a "roadmap" to normalization to get this ticket moving.
At the moment we have a "solution" in the block editor (Gutenberg) which is normalizing via JS in modern browsers:
https://github.com/WordPress/gutenberg/pull/6880
There are some problems with this approach. I have shown @soean at WordCamp Stuttgart that this is not working reliable. If you copy a NFD text (=text with decomposed characters) from within the block editor, the normalization (via JS on paste) does not happen.
Additionally this is not normalizing the title and therefore not the permalink too.
See this open issue: https://github.com/WordPress/gutenberg/issues/14178
I would like to propose exploring the idea of my latest patch a little bit more:
https://core.trac.wordpress.org/attachment/ticket/30130/30130.2.diff
There still some places left as mentioned https://core.trac.wordpress.org/ticket/30130#comment:28
I would like to find those missing filters and maybe this can be a first step to be included in core.
Unfortunately the needed normalizer function is not there by default. PHP 5.3+ has to be used (which is not a problem anymore), but the needed PHP module intl
is optional and not every hoster is using it.
I've added this issue on the health check plugin for checking for the module:
https://github.com/WordPress/health-check/issues/333
But to get this finally started I propose to add the approach from my last patch for those who have the module installed - Fallbacks, polyfills could be added later if necessary.
I have tested this on macOS 10.14 and this is still the case, therefore I remove the close
tag.
So please leave your feedback on this patch. Any problems with adding this function to so many filter hooks? Any performance problems with it?
#45
follow-up:
↓ 46
@
5 years ago
I just wanted to throw up a contrary view of this ticket.
I just spent a day fighting with this problem in reverse. Renaming a file to a string stored in a mysql database that included a precomposed character (U+0161) caused the OS (macOS) to convert that character to the compound form (U+0073 U+030C). WordPress than couldn't find the file because file_exists() was always false. I had to change the string in the db to the compound form to get it to work.
The Unicode Standard says that
Many compatibility decomposable characters are included in the Unicode Standard solely to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances.
Apple now enforces that. I could find no way to use U+016 in my file, it was forced to the compound form even if I entered the hex directly.
MSDN also recommends compound characters, saying that
Pre-composed characters may also be decomposed. For example, an application importing a text file containing the pre-composed character "ü" may decompose that character into a "u" followed by the non-spacing character "¨". This allows easy alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode standard defines decomposition for all pre-composed characters.
I haven't checked if Windows forces the decomposition or not but Microsoft clearly thinks you should decompose wherever possible.
I should also point out that the w3 document linked in the first post of this issue has been updated since 2014 and the latest version recommends NFC but admits it's not always appropriate or even available. (see https://www.w3.org/TR/charmod-norm/#normalizationChoice)
#46
in reply to:
↑ 45
;
follow-up:
↓ 47
@
5 years ago
Replying to a8bit:
I just wanted to throw up a contrary view of this ticket.
Hi @a8bit and thank you for your feedback!
I just spent a day fighting with this problem in reverse. Renaming a file to a string stored in a mysql database that included a precomposed character (U+0161) caused the OS (macOS) to convert that character to the compound form (U+0073 U+030C). WordPress than couldn't find the file because file_exists() was always false. I had to change the string in the db to the compound form to get it to work.
That shows IMHO exactly why everything should be normalized to NFC. Because then we have a common ground. macOS is using NFD (decomposed characters) internally and that's why Safari does normalize files on upload. But Chrome/Firefox are not doing this. We could wait for the browsers to fix it or we can fix it in WordPress.
The Unicode Standard says that
Many compatibility decomposable characters are included in the Unicode Standard solely to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances.
Apple now enforces that. I could find no way to use U+016 in my file, it was forced to the compound form even if I entered the hex directly.
That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.
MSDN also recommends compound characters, saying that
Pre-composed characters may also be decomposed. For example, an application importing a text file containing the pre-composed character "ü" may decompose that character into a "u" followed by the non-spacing character "¨". This allows easy alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode standard defines decomposition for all pre-composed characters.
I haven't checked if Windows forces the decomposition or not but Microsoft clearly thinks you should decompose wherever possible.
Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?
I should also point out that the w3 document linked in the first post of this issue has been updated since 2014 and the latest version recommends NFC but admits it's not always appropriate or even available. (see https://www.w3.org/TR/charmod-norm/#normalizationChoice)
Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:
Authoring tools SHOULD provide a means of normalizing resources and warn the user when a given resource is not in Unicode Normalization Form C.
or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:
Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.
I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.
I am still recommending to get this patch in and then see what breaks (if something breaks).
#47
in reply to:
↑ 46
;
follow-up:
↓ 48
@
5 years ago
Replying to zodiac1978:
Replying to a8bit:
That shows IMHO exactly why everything should be normalized to NFC. Because then we have a common ground. macOS is using NFD (decomposed characters) internally and that's why Safari does normalize files on upload. But Chrome/Firefox are not doing this. We could wait for the browsers to fix it or we can fix it in WordPress.
IMO it shows that everything should be normalized, just not necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is described by Unicode as for legacy systems. The future appears to be NFD.
That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.
This means if all text in WordPress is normalized to NFC any file comparisons with files on APFS that have multi-byte characters is going to fail.
I solved my problem today by writing a function to check the existence of files using both forms, doubling the file io's in the process. Not exactly optimal.
Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?
It was quoted as a source on the wikipedia page for precomposed characters http://msdn.microsoft.com/en-us/library/aa911606.aspx
Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:
The alternative would be NFD.
or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:
That would be great if no one ever read a website outside of their own country
I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.
They said that about 4-digit years ;)
I could mention that search and sort becomes more flexible with NFD because you can now choose to do those things with and without the compound characters, I don't see how proofreading is improved with NFC?
I am still recommending to get this patch in and then see what breaks (if something breaks).
I hope it all goes well, I don't have any skin in this game I was merely flagging up one of the edge cases I actually hit today in case no one had thought of it. Apple not allowing NFC is going to cause issues for international macOS users when comparing source and destination data, it remains to be seen how big of an issue that will be but I accept it's likely to be quite small.
#48
in reply to:
↑ 47
@
5 years ago
Replying to a8bit:
IMO it shows that everything should be normalized, just not necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is described by Unicode as for legacy systems. The future appears to be NFD.
That is not true. NFC is not described as for legacy systems. The linked document shows that NFD/NFC are just two different ways of doing it.
That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.
This means if all text in WordPress is normalized to NFC any file comparisons with files on APFS that have multi-byte characters is going to fail.
No, it means IMHO we have to normalize every input to NFC (as this is the recommendation from the W3C) to have a common ground. This is exactly the reason why normalization exists - to make comparison working again in those cases.
I solved my problem today by writing a function to check the existence of files using both forms, doubling the file io's in the process. Not exactly optimal.
If everything is NFC there is no need for this anymore.
Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?
It was quoted as a source on the wikipedia page for precomposed characters http://msdn.microsoft.com/en-us/library/aa911606.aspx
This text is from 2010 and outdated and for Windows Embedded CE.
Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:
The alternative would be NFD.
We would have the same things to do because the problem exists in the same way in the other direction. Many other OS are using NFC.
or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:
That would be great if no one ever read a website outside of their own country
I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.
They said that about 4-digit years ;)
That's both very funny, but you are missing to provide a solution to the mentioned problems.
I could mention that search and sort becomes more flexible with NFD because you can now choose to do those things with and without the compound characters, I don't see how proofreading is improved with NFC?
It is broken with NFD. Please see my talk and watch the slides where I show all the problems:
https://wordpress.tv/2019/08/28/torsten-landsiedel-special-characters-and-where-to-find-them/
I am still recommending to get this patch in and then see what breaks (if something breaks).
I hope it all goes well, I don't have any skin in this game I was merely flagging up one of the edge cases I actually hit today in case no one had thought of it. Apple not allowing NFC is going to cause issues for international macOS users when comparing source and destination data, it remains to be seen how big of an issue that will be but I accept it's likely to be quite small.
macOS is using NFD *internally* and it knows that, so every native API is normalizing text to NFC (as this is what is coming from a keyboard in most cases). For example Safari does normalize text to NFC on input or uploads. If you are using just native APIs everything is fine, but if NFD is getting through we have a problem. Firefox and Chrome are NOT normalizing on input or upload and that's creating such problems.
We also maybe need do make a difference between normalizing URLs, normalizing filenames and normalizing content. Maybe we end up with a different approach for filenames, but as I look at https://core.trac.wordpress.org/ticket/24661 it seems to be the best solution to normalize to NFC here too.
This ticket was mentioned in Slack in #core by david.baumwald. View the logs.
5 years ago
#50
@
5 years ago
- Keywords needs-refresh added
- Milestone changed from 5.4 to Future Release
This ticket still needs a decision and a refreshed patch. With 5.4 Beta 1 approaching, this is being moved to Future Release
. If any maintainer or committer feels this can be resolved in time or wishes to assume ownership during a specific cycle, feel free to update the milestone accordingly.
#51
follow-up:
↓ 52
@
5 years ago
I will happily provide the refreshed patch if I get some feedback from a maintainer/committer on the existing patch.
Pinging @azaozz @ocean90 @swissspidy @SergeyBiryukov
#52
in reply to:
↑ 51
@
4 years ago
Replying to zodiac1978:
I will happily provide the refreshed patch if I get some feedback from a maintainer/committer on the existing patch.
Pinging @azaozz @ocean90 @swissspidy @SergeyBiryukov
No reply in 11 months. Does this mean you don't want to fix it? Should we close it?
#53
@
3 years ago
Are there any news regarding the resolution of this ticket?
The problem is still present and in some circumstances it still causes malfunctions.
Thanks!
#54
@
3 years ago
- Keywords dev-feedback removed
- Milestone changed from Future Release to 6.1
need do make a difference between normalizing URLs, normalizing filenames and normalizing content.
Yes, the PR on https://core.trac.wordpress.org/ticket/24661 fixes the same issue using the same method when removing accents. Still unsure if this has to be applied to all the text submitted by the users, everywhere, and if it will need to be limited just for text submitted from Macs (and perhaps iPhones).
This ticket was mentioned in Slack in #core by jeffpaul. View the logs.
2 years ago
#57
@
2 years ago
- Resolution set to fixed
- Status changed from reviewing to closed
As discussed ahead of today's scheduled 6.1 Beta 1 release party (see the Slack link above), it was deemed that this ticket closed as fixed with r53754. @zodiac1978 if your review shows otherwise, please re-open for whatever was missed... thanks!
#58
@
2 years ago
- Keywords needs-refresh removed
Hey @JeffPaul @azaozz
I tried all the missing parts from https://core.trac.wordpress.org/ticket/30130#comment:28 and can confirm it is all fixed.
Using the text from the PDF and using copy from the Preview app I can still see the bug pasting on https://www.tiny.cloud, but pasting in pages, posts (content *and* title) and all other input fields mentioned in the comment linked above is not showing the bug anymore.
Still unsure if this has to be applied to all the text submitted by the users, everywhere, and if it will need to be limited just for text submitted from Macs (and perhaps iPhones).
As NFC is the recommended way (from the W3C) this should not be needed. From my point of view this is the correct way to fix this.
Thanks @audrasjb for the commit!
#60
@
2 years ago
Hi @gitlost ,
Thank for your plugin helpful in many years (UNFC Nörmalize: https://wordpress.org/plugins/unfc-normalize/). Can you check the new WordPress 6.1 update and your plugin? Does we need your plugin anymore?
Thanks
#61
@
2 years ago
Maybe I missed something, but this 6.1 fix only fixed the problem related to remove_accents, right?
I stumbled onto those issues while working on a very old (8+ years) large french website and remove_accents works well now. so that's cool, thanks everyone who contributed.
But MacOS users are still submitting content that, for example, can't be found by others through the search because the excerpt/title uses UTF-8 NFD normalisation and not NFC. Accents are not normalized through the fields listed or in the editor https://core.trac.wordpress.org/ticket/30130#comment:28
As a test I created a post simply titled "L’Araignée" (you can see here https://www.fontspace.com/unicode/analyzer#e=ZcyB that it's using a decomposed/combining accent).
When I search "araignée" using standard UTF-8 NFC accents, it's absent from the search results.
The W3C validator will still output warnings about "text run not in Unicode Normalization Form C".
What should be the documented/recommended route for people having this issue (some of them might even stumble on this issue through searches, like I did)?
- use @gitlost plugin?
- recommend MacOS users to use Safari instead of Firefox. I can't test it, is Safari still the only browser converting the strings to NFC?
- should we still think about on a fix in WordPress in a separate issue?
PDF for testing purpose