Make WordPress Core

Opened 9 years ago

Closed 18 months ago

Last modified 13 months ago

#30130 closed enhancement (fixed)

Normalize characters with combining marks to precomposed characters

Reported by: zodiac1978's profile zodiac1978 Owned by: sergeybiryukov's profile SergeyBiryukov
Milestone: 6.1 Priority: normal
Severity: normal Version:
Component: Formatting Keywords: has-patch
Focuses: Cc:

Description

I ran into a little weird problem which I wanted to solve. And here it is:

I have a PDF file with German Umlauts (üöäÜÖÄ) and if I copy & paste them into WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.

This results in some problems:

Solution: I made a proof-of-concept with the "content_save_pre" filter and it works. In this proof-of-concept I just replaced the two characters with the precomposed character:

$content = str_replace( "a\xCC\x88", "ä", $content );
$content = str_replace( "o\xCC\x88", "ö", $content );
$content = str_replace( "u\xCC\x88", "ü", $content );
$content = str_replace( "A\xCC\x88", "Ä", $content );
$content = str_replace( "O\xCC\x88", "Ö", $content );
$content = str_replace( "U\xCC\x88", "Ü", $content );

If we could (I know we can't, because WP is still supporting PHP 5.2) rely on PHP 5.3+ we could use a function for that:
http://php.net/manual/de/normalizer.normalize.php

So the above code (also used in the upcoming patch) would be just one line and much more general:
$content = normalizer_normalize($content, Normalizer::FORM_C );

Fun facts:
The problem is just on Mac OS X (Lion, 10.7.5) for me (on Ubuntu 14.04 or Win 7 I couldn't reproduce the problem).

Maybe this is an edge case and/or plugin territory.

Attachments (17)

copy-paste-test.pdf (4.5 KB) - added by zodiac1978 9 years ago.
PDF for testing purpose
patch.diff (709 bytes) - added by zodiac1978 9 years ago.
tinymce-front-page-screenshot-paste.png (12.3 KB) - added by zodiac1978 9 years ago.
Screenshot of the problem - pasting from PDF in textarea on http://www.tinymce.com/
from-osx-preview-to-visual.png (19.9 KB) - added by anonymized_7658014 9 years ago.
from osx 10.10 preview to visual editor
from-osx-preview-to-visual-html.png (48.3 KB) - added by anonymized_7658014 9 years ago.
from osx 10.10 preview to visual editor (HTML output)
from-acrobat10-to-visual.png (20.8 KB) - added by anonymized_7658014 9 years ago.
from osx 10.10 Acrobat 10 to visual editor
from-acrobat10-to-visual-html.png (38.9 KB) - added by anonymized_7658014 9 years ago.
from osx 10.10 Acrobat 10 to visual editor (HTML output)
Bildschirmfoto 2014-11-03 um 12.17.18.png (24.9 KB) - added by zodiac1978 9 years ago.
Correct transliteration if you enter the word directly
Bildschirmfoto 2014-11-03 um 12.16.59.png (24.0 KB) - added by zodiac1978 9 years ago.
Missing transliteration for copy/pasted word from PDF in Chrome and Firefox
30130.diff (1.7 KB) - added by zodiac1978 8 years ago.
Better approach using a PHP 5.3 function
30130.1.diff (1.7 KB) - added by zodiac1978 8 years ago.
Fixed tabs vs. spaces
Bildschirmfoto 2015-12-07 um 09.56.47.png (81.3 KB) - added by zodiac1978 8 years ago.
Still to do: Slug is not normalized before it is generated the first time
30130.2.diff (1.9 KB) - added by zodiac1978 8 years ago.
normalize-test-adrian-chrome.png (25.3 KB) - added by AdrianB 6 years ago.
Normalize test in Chrome
normalize-test-adrian-safari.png (25.0 KB) - added by AdrianB 6 years ago.
Normalize test in Safari
normalize-test-adrian-firefox.png (24.1 KB) - added by AdrianB 6 years ago.
Normalize test in Firefox
normalize-test-adrian-firefox-resave-with-normalize-plugin.png (24.7 KB) - added by AdrianB 6 years ago.
Normalize test in Firefox after re-saving the post with UNFC Nörmalize plugin activated

Download all attachments as: .zip

Change History (78)

@zodiac1978
9 years ago

PDF for testing purpose

@zodiac1978
9 years ago

#1 @zodiac1978
9 years ago

  • Component changed from General to Formatting
  • Keywords has-patch dev-feedback added

#2 @zodiac1978
9 years ago

  • Keywords needs-testing added

Would be interesting to hear if this is solved with different PDF creator or viewer software and/or in newer versions of Mac OS X.

#3 @miqrogroove
9 years ago

  • Component changed from Formatting to Editor

Copy/paste issues are usually in the Editor component. Can you reproduce the problem on the front page demo of http://www.tinymce.com/ ?

#4 @zodiac1978
9 years ago

Yes, on Mac OS X Lion 10.7.5.

It is browser agnostic: Firefox shows two characters, so you see the problem easily. Safari and Chrome shows the two characters combined, but search/proofreading/validation fails too, because it should be a real precombined character. Not easy to understand ...

Last edited 9 years ago by zodiac1978 (previous) (diff)

@zodiac1978
9 years ago

Screenshot of the problem - pasting from PDF in textarea on http://www.tinymce.com/

#5 @miqrogroove
9 years ago

  • Component changed from Editor to TinyMCE

Patch seems reasonable, but you should also report the bug at http://www.tinymce.com/develop/bugtracker_bugs.php

#6 @zodiac1978
9 years ago

Done: http://www.tinymce.com/develop/bugtracker_view.php?id=7243

Reproduced the bug with Preview app on Mac OS X 10.9.5.
Adobe Reader and Acrobat CS5 are doing a better job. The text pdf contains three words. Just the first word has the problem. Weird.

This ticket was mentioned in Slack in #core by zodiac1978. View the logs.


9 years ago

#8 follow-ups: @ocean90
9 years ago

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

#9 in reply to: ↑ 8 @zodiac1978
9 years ago

Replying to ocean90:

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

It could be an issue with all characters which can be combined from two characters, so I think "ß" is not an issue.
But all characters with accents or things like that:
http://www.fileformat.info/info/unicode/char/search.htm?q=combining&preview=entity

So, the patch should be extended with all of these characters (which could be really a huge amount of lines ...)

Just to be sure: In Chrome and Safari you don't see the problem from the screenshot, but the wrong character (better characters) is still used, so searching for the word(s) don't work.

Is it really working on Mac OS X 10.10?

@anonymized_7658014
9 years ago

from osx 10.10 preview to visual editor

@anonymized_7658014
9 years ago

from osx 10.10 preview to visual editor (HTML output)

@anonymized_7658014
9 years ago

from osx 10.10 Acrobat 10 to visual editor

@anonymized_7658014
9 years ago

from osx 10.10 Acrobat 10 to visual editor (HTML output)

#10 @anonymized_7658014
9 years ago

Those screenshots were all made in Chrome, btw.

#11 in reply to: ↑ 8 @zodiac1978
9 years ago

Replying to ocean90:

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

Confirmed, it is an issue with nearly all characters you mentioned. The normal "ß", "e" or "u" is fine, but all other characters are affected. And there are of course more.

The screenshots from Caspar are showing that the problem is on Mac OS X 10.10 too.

It would be just one line with PHP 5.3 to fix it with normalize function (would fix every not pre-combined character), but I'm afraid we have to do this manually for PHP 5.2 and there are many combinations we would have to check. :(

Furthermore I have forgotten the title. In the patch I just repair the content, but this problem could appear on every single input everywhere.

Maybe this should be fixed by Apple and not by us?

#12 follow-up: @azaozz
9 years ago

To summarize:

  • Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
  • Doesn't depend on where the copied content is pasted, textarea or contentEditable.
  • Doesn't happen when copying from Acrobat?

Perhaps some additional tests:

  • What if copying from the same PDF file that is viewed in the internal viewer in Chrome?
  • Does it happen for all PDF files?

If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the paste TinyMCE plugin. There we can run it only on pasting on MacOS, etc.

#13 in reply to: ↑ 12 @zodiac1978
9 years ago

Replying to azaozz:

  • Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
  • Do not work properly in 10.10 (it is not easy to catch, because in Chrome search works and the appearance in the text editor looks right, but try to paste it in visual editor or in Firefox, then you can see the problem)
  • Doesn't happen when copying from Acrobat?
  • Acrobat and Adobe Reader also have this problem but just with the first character (I don't know why - weird behavior)


Perhaps some additional tests:

  • What if copying from the same PDF file that is viewed in the internal viewer in Chrome?

Problem is not there if pasting from internal PDF-viewer from Firefox or from internal PDF viewer Chrome or from Adobe Reader Plugin in Safari.

  • Does it happen for all PDF files?

Well, I can't test every PDF file ... ;)
My test pdf is from LibreOffice 4.2 (PDF Version 1.4)

If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the paste TinyMCE plugin. There we can run it only on pasting on MacOS, etc.

I can't help with that (beside of testing), but maybe this is the better way to solve this.

If someone wants to do more tests: If you turn on Permalinks and paste the words from the pdf into the title, then WordPress replaces "ü" with "ue", "ä" with "ae" and "ö" with "oe" for the permalink. This replacement doesn't work if the character isn't precombined, but a vowel followed by a diaeresis.

In the database the post_name should look something like this "fuenf-gaebe-schoen-direct-enter" - if you have the wrong characters you see something like this: "fu%cc%88nf-ga%cc%88be-scho%cc%88n-firefox".

%cc%88 is the diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm)

Seems to be broken in FF and Chrome. Preview app to Safari seems to be okay.

@zodiac1978
9 years ago

Correct transliteration if you enter the word directly

@zodiac1978
9 years ago

Missing transliteration for copy/pasted word from PDF in Chrome and Firefox

#14 @boonebgorges
9 years ago

  • Version trunk deleted

#15 follow-up: @Kainer
9 years ago

I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...

#16 in reply to: ↑ 15 @zodiac1978
9 years ago

Replying to Kainer:

I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...

You can try my plugin: https://github.com/Zodiac1978/tl-normalizer
For me it is a huge timesaver if I have to copy&paste a list of workshops from a PDF to an online calendar.

#17 @AdrianB
9 years ago

I've ran into this several times, it would be nice if WP took care of it automatically. Huge thanks to @zodiac1978 for that plugin!

#18 follow-up: @iseulde
9 years ago

  • Component changed from TinyMCE to Formatting
  • Milestone changed from Awaiting Review to Future Release

We could use normalize in JavaScript, but with limited browser support.

This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.

Moving back to formatting.

#19 in reply to: ↑ 18 @zodiac1978
9 years ago

Replying to iseulde:

We could use normalize in JavaScript, but with limited browser support.

This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.

Moving back to formatting.

Yes, it affects all inputs.

My PHP 5.3+ solution (for titles, excerpt, content and comments) is now in the repo, too:
https://wordpress.org/plugins/normalizer/

#20 follow-up: @iseulde
9 years ago

  • Keywords needs-testing removed

So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?

#21 in reply to: ↑ 20 @zodiac1978
9 years ago

Replying to iseulde:

So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?

The filter is precombining the characters just before save, so an additional normalization on paste in TinyMCE would be beneficial, because until the post is not saved you have a broken proofread, search, etc.

I don't know if we should add this with a check for PHP 5.3 in core, because I don't know if this check for Normalization Form C is a performance problem, so I added the "needs testing" tag.

Maybe some more advanced PHP devs can have a look at this for the performance question. I mean, it solves a Mac-only problem which is just there if you copy&paste from a PDF. I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.

#22 @iseulde
9 years ago

I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.

Note that this would only run when writing to the database.

Last edited 9 years ago by iseulde (previous) (diff)

@zodiac1978
8 years ago

Better approach using a PHP 5.3 function

#23 @zodiac1978
8 years ago

The new patch is using the the Normalizer function from PHP 5.3+ (http://php.net/manual/de/normalizer.normalize.php). Unfortunately it is just there if the intl extension is loaded, so we have to check if the function exists.

@zodiac1978
8 years ago

Fixed tabs vs. spaces

@zodiac1978
8 years ago

Still to do: Slug is not normalized before it is generated the first time

#24 @zodiac1978
8 years ago

The slug is getting generated in the wrong way and the patch does not changes it at the moment.

Additionally there many more input fields in WordPress which could be in need of the function (e.g. Widgets)

#25 @swissspidy
8 years ago

  • Keywords needs-patch added; has-patch removed

As discussed during contributor day, it totally makes sense to normalize such strings on post save (and the patch does that well).

However, from a UX perspective we definitely need to do this in JavaScript as well. Otherwise you'll paste the messed up text and correct it by hand without knowing that WordPress will fix it for you. See comment 18.

Luckily, there's a new normalize() method in the ECMAScript 2016 (ES6) standard we can leverage. It's supported by Chrome, Firefox, Opera and IE (11+). See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize for details.

#26 @tar.gz
8 years ago

Awesome work!

If normalization makes it into core, it should also be applied to filenames of uploaded files - see issue #22363.

#27 follow-up: @gitlost
8 years ago

There's a polyfill for the PHP Normalizer component in Symfony https://github.com/Zodiac1978/tl-normalizer, and also one for String.prototype.normalize at https://github.com/walling/unorm, as mentioned by @zodiac1978 in his tinymce ticket (now at https://github.com/tinymce/tinymce/issues/1971).

As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack normalize() to normalize text on pasting into tinymce.

#28 in reply to: ↑ 27 @zodiac1978
8 years ago

  • Keywords has-patch added; needs-patch removed

Replying to gitlost:

As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack normalize() to normalize text on pasting into tinymce.

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

I expanded my previous ticket and added the normalize function to all sanitize filter additionally to the save_pre filter.

Unfortunately there are still some fields not normalized:

  • Text Widget content
  • Website title and slug
  • vita on profile page
  • Add New Tag/Category
  • ALT-text for images in media upload panel
  • Add Tags metabox

Any help is much appreciated. Ping @swissspidy

Maybe we can use this approach as a first step and leave the ticket open for the javascript solution.

@zodiac1978
8 years ago

#29 follow-up: @gitlost
8 years ago

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.

On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?

#30 in reply to: ↑ 29 @swissspidy
8 years ago

Replying to gitlost:

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.

On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?

Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.

Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.

#31 @gitlost
8 years ago

Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.

True enough, I suppose it was just the amount of filters and that I haven't got my head around the various pre_XXX and XXX_save_pre and edit_XXX and XXX_edit_pre and sanitize_XXX etc and which do what when why...

Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.

Excellent, I'd be up for that. I'll ping @zodiac1978 on Slack to discuss...

I updated the fork again - apart from a major bug where it was adding the filters too late, there was a dependency in the Symfony Normalizer on PCRE with UTF-8 support, which I've belatedly discovered can't be relied on to be available, so added a workaround ... if it's good it should be useful for tickets #22363 and #24661 also. The big things left to do are to work out which filters to use, customizer support, and unit tests.

This ticket was mentioned in Slack in #core by swissspidy. View the logs.


8 years ago

#33 @ocean90
8 years ago

Related: #35951

#34 @gitlost
8 years ago

The fork mentioned is now available from the WP repository as UNFC Nörmalizer (thanks anonymous plugin reviewer!).

As to how or if to normalize input in core, I'm really not sure. Adding all those filters still doesn't seem right.

Also normalization may not always be desirable, eg in the case of CJK compatibility ideographs (they get mapped to unified ideographs under both NFC and NFD normalization), although it's hard to get a definite read on this as the Unicode Consortium seem to suggest it's not a problem - see eg Isn't it true that some Japanese can't write their own names in Unicode?, and I suppose it could be made locale dependent if necessary.

Anyway I'd lean towards a javascript only fix, only added for pasting in Chrome and Firefox under Mac OS X (and iOS?). These browsers support the normalize() method so no polyfill would be needed, and (presumably) encompass the vast bulk of use cases. I can extract and adapt the code used in the plugin as a patch if there's interest.

Another option - use Safari!

PS In case people have difficulty replicating this bug, note that the paste needs to done in Chrome or Firefox - ironically Safari normalizes pastings (and upload filenames, but that's another story) to NFC, while the others just take what they're given. The copying from copy-paste-test.pdf can be done from Preview or from Adobe Reader, and as noted above (at least on the versions that come with Mountain Lion 10.8) Preview (Version 6) decomposes all the umlauted characters, while Adobe Reader (Version 11.0.13) decomposes only the u-umlaut. Think different.

PPS I do think the ability to normalize (via the Symfony polyfill) would be a worthwhile addition to core (eg for sanitize_file_name(), remove_accents()) and will open a ticket suggesting it.

PPPS Implementing the plugin threw up a number of issues, eg admin javascript in a lot of cases does not check and refresh its data based on what comes back from the server, and meta keys aren't sanitized - I'll (hopefully) open tickets for each of these.

#35 @zodiac1978
6 years ago

A processwire module:
https://github.com/justb3a/processwire-textformatter-normalizeutf8

This is using Patchwork as a compatibility layer
https://github.com/tchwork/utf8

Other CMS, like Concretet5 are using this too.

Here is a issue for Typo3 with the same topic:
https://forge.typo3.org/issues/57695

After tweeting to other CMS and some answers it could be solved with the last Apple mac OS 10.13.x

Can anyone with 10.13.x check this?

@AdrianB
6 years ago

Normalize test in Chrome

@AdrianB
6 years ago

Normalize test in Safari

@AdrianB
6 years ago

Normalize test in Firefox

@AdrianB
6 years ago

Normalize test in Firefox after re-saving the post with UNFC Nörmalize plugin activated

#36 @AdrianB
6 years ago

(Oh, I didn't mean to spam this ticket with attachments, I though I could attach the images and use in one post without creating a post for each…)

I did a quick test using macOS 10.13.3. I opened the copy-paste-test.pdf file attached to this ticket in Preview 10.0, copied the text and saved in a new post using Firefox (59.0). The same issue still remains.

Chrome (65.0.3325.162) and Safari (11.0.3) renders the text correctly but Firefox does not. See the above screenshots.

I then activated the UNFC Nörmalize plugin an re-saved the post. The normalized text then looks fine in Firefox as well, see last screenshot above.

#37 @zodiac1978
6 years ago

Thanks for testing @AdrianB !

Did you try to search or proofread the strings in Chrome or Safari? Does this work correctly? (For me it is still broken in Chrome & Safari too)

#40 @SergeyBiryukov
5 years ago

  • Milestone changed from Future Release to 5.3
  • Owner set to SergeyBiryukov
  • Status changed from new to reviewing

#41 follow-up: @azaozz
4 years ago

  • Keywords close added

Don't seem to be able to reproduce this in MacOS 10.14 Firefox and Chrome (unless I'm doing something wrong). Perhaps this was (finally) fixed at the OS level?

Is anybody still able to reproduce?

#42 in reply to: ↑ 41 @zodiac1978
4 years ago

Replying to azaozz:

Don't seem to be able to reproduce this in MacOS 10.14 Firefox and Chrome (unless I'm doing something wrong). Perhaps this was (finally) fixed at the OS level?

Is anybody still able to reproduce?

I‘m on holiday, so I can‘t test myself, but maybe you are doing something wrong indeed. Gutenberg is fixing this for the content area in browsers that support this with some ES6 Javascript.

The problem remains for older browsers or if JS is disabled and unfortunately in the title (and therefore in the permalink) too.

Here are my slides with some additional info:
https://speakerdeck.com/zodiac1978/special-characters-and-where-to-find-them

#43 @davidbaumwald
4 years ago

  • Milestone changed from 5.3 to 5.4

This is going to miss the deadline for version 5.3 Beta 1. Punting to 5.4.

#44 @zodiac1978
4 years ago

  • Keywords close removed

As this is milestoned for 5.4 I would like to propose a "roadmap" to normalization to get this ticket moving.

At the moment we have a "solution" in the block editor (Gutenberg) which is normalizing via JS in modern browsers:
https://github.com/WordPress/gutenberg/pull/6880

There are some problems with this approach. I have shown @soean at WordCamp Stuttgart that this is not working reliable. If you copy a NFD text (=text with decomposed characters) from within the block editor, the normalization (via JS on paste) does not happen.
Additionally this is not normalizing the title and therefore not the permalink too.
See this open issue: https://github.com/WordPress/gutenberg/issues/14178

I would like to propose exploring the idea of my latest patch a little bit more:
https://core.trac.wordpress.org/attachment/ticket/30130/30130.2.diff

There still some places left as mentioned https://core.trac.wordpress.org/ticket/30130#comment:28
I would like to find those missing filters and maybe this can be a first step to be included in core.

Unfortunately the needed normalizer function is not there by default. PHP 5.3+ has to be used (which is not a problem anymore), but the needed PHP module intl is optional and not every hoster is using it.

I've added this issue on the health check plugin for checking for the module:
https://github.com/WordPress/health-check/issues/333

But to get this finally started I propose to add the approach from my last patch for those who have the module installed - Fallbacks, polyfills could be added later if necessary.

I have tested this on macOS 10.14 and this is still the case, therefore I remove the close tag.

So please leave your feedback on this patch. Any problems with adding this function to so many filter hooks? Any performance problems with it?

#45 follow-up: @a8bit
4 years ago

I just wanted to throw up a contrary view of this ticket.

I just spent a day fighting with this problem in reverse. Renaming a file to a string stored in a mysql database that included a precomposed character (U+0161) caused the OS (macOS) to convert that character to the compound form (U+0073 U+030C). WordPress than couldn't find the file because file_exists() was always false. I had to change the string in the db to the compound form to get it to work.

The Unicode Standard says that

Many compatibility decomposable characters are included in the Unicode Standard solely to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances.

Apple now enforces that. I could find no way to use U+016 in my file, it was forced to the compound form even if I entered the hex directly.

MSDN also recommends compound characters, saying that

Pre-composed characters may also be decomposed. For example, an application importing a text file containing the pre-composed character "ü" may decompose that character into a "u" followed by the non-spacing character "¨". This allows easy alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode standard defines decomposition for all pre-composed characters.

I haven't checked if Windows forces the decomposition or not but Microsoft clearly thinks you should decompose wherever possible.

I should also point out that the w3 document linked in the first post of this issue has been updated since 2014 and the latest version recommends NFC but admits it's not always appropriate or even available. (see https://www.w3.org/TR/charmod-norm/#normalizationChoice)

#46 in reply to: ↑ 45 ; follow-up: @zodiac1978
4 years ago

Replying to a8bit:

I just wanted to throw up a contrary view of this ticket.

Hi @a8bit and thank you for your feedback!

I just spent a day fighting with this problem in reverse. Renaming a file to a string stored in a mysql database that included a precomposed character (U+0161) caused the OS (macOS) to convert that character to the compound form (U+0073 U+030C). WordPress than couldn't find the file because file_exists() was always false. I had to change the string in the db to the compound form to get it to work.

That shows IMHO exactly why everything should be normalized to NFC. Because then we have a common ground. macOS is using NFD (decomposed characters) internally and that's why Safari does normalize files on upload. But Chrome/Firefox are not doing this. We could wait for the browsers to fix it or we can fix it in WordPress.

The Unicode Standard says that

Many compatibility decomposable characters are included in the Unicode Standard solely to represent distinctions in other base standards. They support transmission and processing of legacy data. Their use is discouraged other than for legacy data or other special circumstances.

Apple now enforces that. I could find no way to use U+016 in my file, it was forced to the compound form even if I entered the hex directly.

That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.

MSDN also recommends compound characters, saying that

Pre-composed characters may also be decomposed. For example, an application importing a text file containing the pre-composed character "ü" may decompose that character into a "u" followed by the non-spacing character "¨". This allows easy alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode standard defines decomposition for all pre-composed characters.

I haven't checked if Windows forces the decomposition or not but Microsoft clearly thinks you should decompose wherever possible.

Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?

I should also point out that the w3 document linked in the first post of this issue has been updated since 2014 and the latest version recommends NFC but admits it's not always appropriate or even available. (see https://www.w3.org/TR/charmod-norm/#normalizationChoice)

Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:

Authoring tools SHOULD provide a means of normalizing resources and warn the user when a given resource is not in Unicode Normalization Form C.

or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:

Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content. Note that NFC is not always appropriate to the content or even available to content authors in some languages.

I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.

I am still recommending to get this patch in and then see what breaks (if something breaks).

#47 in reply to: ↑ 46 ; follow-up: @a8bit
4 years ago

Replying to zodiac1978:

Replying to a8bit:

That shows IMHO exactly why everything should be normalized to NFC. Because then we have a common ground. macOS is using NFD (decomposed characters) internally and that's why Safari does normalize files on upload. But Chrome/Firefox are not doing this. We could wait for the browsers to fix it or we can fix it in WordPress.

IMO it shows that everything should be normalized, just not necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is described by Unicode as for legacy systems. The future appears to be NFD.

That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.

This means if all text in WordPress is normalized to NFC any file comparisons with files on APFS that have multi-byte characters is going to fail.

I solved my problem today by writing a function to check the existence of files using both forms, doubling the file io's in the process. Not exactly optimal.

Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?

It was quoted as a source on the wikipedia page for precomposed characters http://msdn.microsoft.com/en-us/library/aa911606.aspx

Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:

The alternative would be NFD.

or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:

That would be great if no one ever read a website outside of their own country

I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.

They said that about 4-digit years ;)

I could mention that search and sort becomes more flexible with NFD because you can now choose to do those things with and without the compound characters, I don't see how proofreading is improved with NFC?

I am still recommending to get this patch in and then see what breaks (if something breaks).

I hope it all goes well, I don't have any skin in this game I was merely flagging up one of the edge cases I actually hit today in case no one had thought of it. Apple not allowing NFC is going to cause issues for international macOS users when comparing source and destination data, it remains to be seen how big of an issue that will be but I accept it's likely to be quite small.

#48 in reply to: ↑ 47 @zodiac1978
4 years ago

Replying to a8bit:

IMO it shows that everything should be normalized, just not necessarily to NFC. There is no way Apple is going to adopt NFC, NFC is described by Unicode as for legacy systems. The future appears to be NFD.

That is not true. NFC is not described as for legacy systems. The linked document shows that NFD/NFC are just two different ways of doing it.


That's correct, because the filesystem itself (HFS+ and APFS for example) are using NFD and not NFC.

This means if all text in WordPress is normalized to NFC any file comparisons with files on APFS that have multi-byte characters is going to fail.

No, it means IMHO we have to normalize every input to NFC (as this is the recommendation from the W3C) to have a common ground. This is exactly the reason why normalization exists - to make comparison working again in those cases.


I solved my problem today by writing a function to check the existence of files using both forms, doubling the file io's in the process. Not exactly optimal.

If everything is NFC there is no need for this anymore.

Windows doesn't force decomposition and I don't think you should do this and I can't find your source on MSDN if I google this text. Can you please share the link, so that I can check the source myself?

It was quoted as a source on the wikipedia page for precomposed characters http://msdn.microsoft.com/en-us/library/aa911606.aspx

This text is from 2010 and outdated and for Windows Embedded CE.

Agreed, but what would be the alternative? We could check and warn the user, as this is recommended by the document. But as the module with the needed function is optional that wouldn't be very reliable:

The alternative would be NFD.

We would have the same things to do because the problem exists in the same way in the other direction. Many other OS are using NFC.


or we could normalize locale-specific, because the biggest problem seems to be that other languages may have a problem with normalization:

That would be great if no one ever read a website outside of their own country

I think there are not many cases where you will really need NFD text. The advantages of a working search, working proofreading, etc. are outweighing any possible edge cases where the NFD text is needed.

They said that about 4-digit years ;)

That's both very funny, but you are missing to provide a solution to the mentioned problems.

I could mention that search and sort becomes more flexible with NFD because you can now choose to do those things with and without the compound characters, I don't see how proofreading is improved with NFC?

It is broken with NFD. Please see my talk and watch the slides where I show all the problems:
https://wordpress.tv/2019/08/28/torsten-landsiedel-special-characters-and-where-to-find-them/

I am still recommending to get this patch in and then see what breaks (if something breaks).

I hope it all goes well, I don't have any skin in this game I was merely flagging up one of the edge cases I actually hit today in case no one had thought of it. Apple not allowing NFC is going to cause issues for international macOS users when comparing source and destination data, it remains to be seen how big of an issue that will be but I accept it's likely to be quite small.

macOS is using NFD *internally* and it knows that, so every native API is normalizing text to NFC (as this is what is coming from a keyboard in most cases). For example Safari does normalize text to NFC on input or uploads. If you are using just native APIs everything is fine, but if NFD is getting through we have a problem. Firefox and Chrome are NOT normalizing on input or upload and that's creating such problems.

We also maybe need do make a difference between normalizing URLs, normalizing filenames and normalizing content. Maybe we end up with a different approach for filenames, but as I look at https://core.trac.wordpress.org/ticket/24661 it seems to be the best solution to normalize to NFC here too.

This ticket was mentioned in Slack in #core by david.baumwald. View the logs.


4 years ago

#50 @davidbaumwald
4 years ago

  • Keywords needs-refresh added
  • Milestone changed from 5.4 to Future Release

This ticket still needs a decision and a refreshed patch. With 5.4 Beta 1 approaching, this is being moved to Future Release. If any maintainer or committer feels this can be resolved in time or wishes to assume ownership during a specific cycle, feel free to update the milestone accordingly.

#51 follow-up: @zodiac1978
4 years ago

I will happily provide the refreshed patch if I get some feedback from a maintainer/committer on the existing patch.

Pinging @azaozz @ocean90 @swissspidy @SergeyBiryukov

#52 in reply to: ↑ 51 @zodiac1978
3 years ago

Replying to zodiac1978:

I will happily provide the refreshed patch if I get some feedback from a maintainer/committer on the existing patch.

Pinging @azaozz @ocean90 @swissspidy @SergeyBiryukov

No reply in 11 months. Does this mean you don't want to fix it? Should we close it?

#53 @alessandrolioce
23 months ago

Are there any news regarding the resolution of this ticket?
The problem is still present and in some circumstances it still causes malfunctions.

Thanks!

#54 @azaozz
21 months ago

  • Keywords dev-feedback removed
  • Milestone changed from Future Release to 6.1

need do make a difference between normalizing URLs, normalizing filenames and normalizing content.

Yes, the PR on https://core.trac.wordpress.org/ticket/24661 fixes the same issue using the same method when removing accents. Still unsure if this has to be applied to all the text submitted by the users, everywhere, and if it will need to be limited just for text submitted from Macs (and perhaps iPhones).

Last edited 21 months ago by azaozz (previous) (diff)

#55 @audrasjb
20 months ago

In 53754:

Formatting: Normalize to Unicode NFC encoding before converting accent characters in remove_accents().

This changeset adds Unicode sequence normalization from NFD to NFC, via the normalizer_normalize() PHP function which is available with the recommended intl PHP extension.

This fixes an issue where NFD characters were not properly sanitized. It also provides a unit test for NFD sequences (alternate Unicode representations of the same characters).

Props NumidWasNotAvailable, targz, nacin, nunomorgadinho, p_enrique, gitlost, SergeyBiryukov, markoheijnen, mikeschroder, ocean90, pento, helen, rodrigosevero, zodiac1978, ironprogrammer, audrasjb, azaozz, laboiteare, nuryko, virgar, dxd5001, onnimonni, johnbillion.
Fixes #24661, #47763, #35951.
See #30130, #52654.

This ticket was mentioned in Slack in #core by jeffpaul. View the logs.


18 months ago

#57 @JeffPaul
18 months ago

  • Resolution set to fixed
  • Status changed from reviewing to closed

As discussed ahead of today's scheduled 6.1 Beta 1 release party (see the Slack link above), it was deemed that this ticket closed as fixed with r53754. @zodiac1978 if your review shows otherwise, please re-open for whatever was missed... thanks!

#58 @zodiac1978
18 months ago

  • Keywords needs-refresh removed

Hey @JeffPaul @azaozz

I tried all the missing parts from https://core.trac.wordpress.org/ticket/30130#comment:28 and can confirm it is all fixed.

Using the text from the PDF and using copy from the Preview app I can still see the bug pasting on https://www.tiny.cloud, but pasting in pages, posts (content *and* title) and all other input fields mentioned in the comment linked above is not showing the bug anymore.

Still unsure if this has to be applied to all the text submitted by the users, everywhere, and if it will need to be limited just for text submitted from Macs (and perhaps iPhones).

As NFC is the recommended way (from the W3C) this should not be needed. From my point of view this is the correct way to fix this.

Thanks @audrasjb for the commit!

#59 @audrasjb
18 months ago

Great! Thank you for the confirmation :)

#60 @datverse
17 months ago

Hi @gitlost ,
Thank for your plugin helpful in many years (UNFC Nörmalize: https://wordpress.org/plugins/unfc-normalize/). Can you check the new WordPress 6.1 update and your plugin? Does we need your plugin anymore?
Thanks

#61 @julianoe
13 months ago

Maybe I missed something, but this 6.1 fix only fixed the problem related to remove_accents, right?
I stumbled onto those issues while working on a very old (8+ years) large french website and remove_accents works well now. so that's cool, thanks everyone who contributed.

But MacOS users are still submitting content that, for example, can't be found by others through the search because the excerpt/title uses UTF-8 NFD normalisation and not NFC. Accents are not normalized through the fields listed or in the editor https://core.trac.wordpress.org/ticket/30130#comment:28

As a test I created a post simply titled "L’Araignée" (you can see here https://www.fontspace.com/unicode/analyzer#e=ZcyB that it's using a decomposed/combining accent).
When I search "araignée" using standard UTF-8 NFC accents, it's absent from the search results.
The W3C validator will still output warnings about "text run not in Unicode Normalization Form C".

What should be the documented/recommended route for people having this issue (some of them might even stumble on this issue through searches, like I did)?

  • use @gitlost plugin?
  • recommend MacOS users to use Safari instead of Firefox. I can't test it, is Safari still the only browser converting the strings to NFC?
  • should we still think about on a fix in WordPress in a separate issue?
Note: See TracTickets for help on using tickets.