WordPress.org

Make WordPress Core

Opened 5 years ago

Last modified 4 weeks ago

#30130 new enhancement

Normalize characters with combining marks to precomposed characters

Reported by: zodiac1978 Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version:
Component: Formatting Keywords: dev-feedback has-patch
Focuses: Cc:

Description

I ran into a little weird problem which I wanted to solve. And here it is:

I have a PDF file with German Umlauts (üöäÜÖÄ) and if I copy & paste them into WordPress I get the vowel (uoaUOA) which followed by a diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm) instead of just one precomposed character.

This results in some problems:

Solution: I made a proof-of-concept with the "content_save_pre" filter and it works. In this proof-of-concept I just replaced the two characters with the precomposed character:

$content = str_replace( "a\xCC\x88", "ä", $content );
$content = str_replace( "o\xCC\x88", "ö", $content );
$content = str_replace( "u\xCC\x88", "ü", $content );
$content = str_replace( "A\xCC\x88", "Ä", $content );
$content = str_replace( "O\xCC\x88", "Ö", $content );
$content = str_replace( "U\xCC\x88", "Ü", $content );

If we could (I know we can't, because WP is still supporting PHP 5.2) rely on PHP 5.3+ we could use a function for that:
http://php.net/manual/de/normalizer.normalize.php

So the above code (also used in the upcoming patch) would be just one line and much more general:
$content = normalizer_normalize($content, Normalizer::FORM_C );

Fun facts:
The problem is just on Mac OS X (Lion, 10.7.5) for me (on Ubuntu 14.04 or Win 7 I couldn't reproduce the problem).

Maybe this is an edge case and/or plugin territory.

Attachments (17)

copy-paste-test.pdf (4.5 KB) - added by zodiac1978 5 years ago.
PDF for testing purpose
patch.diff (709 bytes) - added by zodiac1978 5 years ago.
tinymce-front-page-screenshot-paste.png (12.3 KB) - added by zodiac1978 5 years ago.
Screenshot of the problem - pasting from PDF in textarea on http://www.tinymce.com/
from-osx-preview-to-visual.png (19.9 KB) - added by glueckpress 5 years ago.
from osx 10.10 preview to visual editor
from-osx-preview-to-visual-html.png (48.3 KB) - added by glueckpress 5 years ago.
from osx 10.10 preview to visual editor (HTML output)
from-acrobat10-to-visual.png (20.8 KB) - added by glueckpress 5 years ago.
from osx 10.10 Acrobat 10 to visual editor
from-acrobat10-to-visual-html.png (38.9 KB) - added by glueckpress 5 years ago.
from osx 10.10 Acrobat 10 to visual editor (HTML output)
Bildschirmfoto 2014-11-03 um 12.17.18.png (24.9 KB) - added by zodiac1978 5 years ago.
Correct transliteration if you enter the word directly
Bildschirmfoto 2014-11-03 um 12.16.59.png (24.0 KB) - added by zodiac1978 5 years ago.
Missing transliteration for copy/pasted word from PDF in Chrome and Firefox
30130.diff (1.7 KB) - added by zodiac1978 4 years ago.
Better approach using a PHP 5.3 function
30130.1.diff (1.7 KB) - added by zodiac1978 4 years ago.
Fixed tabs vs. spaces
Bildschirmfoto 2015-12-07 um 09.56.47.png (81.3 KB) - added by zodiac1978 4 years ago.
Still to do: Slug is not normalized before it is generated the first time
30130.2.diff (1.9 KB) - added by zodiac1978 3 years ago.
normalize-test-adrian-chrome.png (25.3 KB) - added by AdrianB 18 months ago.
Normalize test in Chrome
normalize-test-adrian-safari.png (25.0 KB) - added by AdrianB 18 months ago.
Normalize test in Safari
normalize-test-adrian-firefox.png (24.1 KB) - added by AdrianB 18 months ago.
Normalize test in Firefox
normalize-test-adrian-firefox-resave-with-normalize-plugin.png (24.7 KB) - added by AdrianB 18 months ago.
Normalize test in Firefox after re-saving the post with UNFC Nörmalize plugin activated

Download all attachments as: .zip

Change History (56)

@zodiac1978
5 years ago

PDF for testing purpose

@zodiac1978
5 years ago

#1 @zodiac1978
5 years ago

  • Component changed from General to Formatting
  • Keywords has-patch dev-feedback added

#2 @zodiac1978
5 years ago

  • Keywords needs-testing added

Would be interesting to hear if this is solved with different PDF creator or viewer software and/or in newer versions of Mac OS X.

#3 @miqrogroove
5 years ago

  • Component changed from Formatting to Editor

Copy/paste issues are usually in the Editor component. Can you reproduce the problem on the front page demo of http://www.tinymce.com/ ?

#4 @zodiac1978
5 years ago

Yes, on Mac OS X Lion 10.7.5.

It is browser agnostic: Firefox shows two characters, so you see the problem easily. Safari and Chrome combine the two characters, but search/proofreading/validation fails too, because it should be a precombined character.

Version 0, edited 5 years ago by zodiac1978 (next)

@zodiac1978
5 years ago

Screenshot of the problem - pasting from PDF in textarea on http://www.tinymce.com/

#5 @miqrogroove
5 years ago

  • Component changed from Editor to TinyMCE

Patch seems reasonable, but you should also report the bug at http://www.tinymce.com/develop/bugtracker_bugs.php

#6 @zodiac1978
5 years ago

Done: http://www.tinymce.com/develop/bugtracker_view.php?id=7243

Reproduced the bug with Preview app on Mac OS X 10.9.5.
Adobe Reader and Acrobat CS5 are doing a better job. The text pdf contains three words. Just the first word has the problem. Weird.

This ticket was mentioned in Slack in #core by zodiac1978. View the logs.


5 years ago

#8 follow-ups: @ocean90
5 years ago

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

#9 in reply to: ↑ 8 @zodiac1978
5 years ago

Replying to ocean90:

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

It could be an issue with all characters which can be combined from two characters, so I think "ß" is not an issue.
But all characters with accents or things like that:
http://www.fileformat.info/info/unicode/char/search.htm?q=combining&preview=entity

So, the patch should be extended with all of these characters (which could be really a huge amount of lines ...)

Just to be sure: In Chrome and Safari you don't see the problem from the screenshot, but the wrong character (better characters) is still used, so searching for the word(s) don't work.

Is it really working on Mac OS X 10.10?

@glueckpress
5 years ago

from osx 10.10 preview to visual editor

@glueckpress
5 years ago

from osx 10.10 preview to visual editor (HTML output)

@glueckpress
5 years ago

from osx 10.10 Acrobat 10 to visual editor

@glueckpress
5 years ago

from osx 10.10 Acrobat 10 to visual editor (HTML output)

#10 @glueckpress
5 years ago

Those screenshots were all made in Chrome, btw.

#11 in reply to: ↑ 8 @zodiac1978
5 years ago

Replying to ocean90:

The attached PDF works for me on OSX 10.10 with Preview.app. Is this just an issue with umlauts? What about ßñçeëuûé?

Confirmed, it is an issue with nearly all characters you mentioned. The normal "ß", "e" or "u" is fine, but all other characters are affected. And there are of course more.

The screenshots from Caspar are showing that the problem is on Mac OS X 10.10 too.

It would be just one line with PHP 5.3 to fix it with normalize function (would fix every not pre-combined character), but I'm afraid we have to do this manually for PHP 5.2 and there are many combinations we would have to check. :(

Furthermore I have forgotten the title. In the patch I just repair the content, but this problem could appear on every single input everywhere.

Maybe this should be fixed by Apple and not by us?

#12 follow-up: @azaozz
5 years ago

To summarize:

  • Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
  • Doesn't depend on where the copied content is pasted, textarea or contentEditable.
  • Doesn't happen when copying from Acrobat?

Perhaps some additional tests:

  • What if copying from the same PDF file that is viewed in the internal viewer in Chrome?
  • Does it happen for all PDF files?

If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the paste TinyMCE plugin. There we can run it only on pasting on MacOS, etc.

#13 in reply to: ↑ 12 @zodiac1978
5 years ago

Replying to azaozz:

  • Happens only when copying from a PDF file that is viewed in the Preview app on Mac OSX 10.7.5 and 10.9.5 (and possibly all versions in between). Works properly in 10.10.
  • Do not work properly in 10.10 (it is not easy to catch, because in Chrome search works and the appearance in the text editor looks right, but try to paste it in visual editor or in Firefox, then you can see the problem)
  • Doesn't happen when copying from Acrobat?
  • Acrobat and Adobe Reader also have this problem but just with the first character (I don't know why - weird behavior)


Perhaps some additional tests:

  • What if copying from the same PDF file that is viewed in the internal viewer in Chrome?

Problem is not there if pasting from internal PDF-viewer from Firefox or from internal PDF viewer Chrome or from Adobe Reader Plugin in Safari.

  • Does it happen for all PDF files?

Well, I can't test every PDF file ... ;)
My test pdf is from LibreOffice 4.2 (PDF Version 1.4)

If we decide to fix this, thinking it should probably be fixed from JS on one of the events fired in the paste TinyMCE plugin. There we can run it only on pasting on MacOS, etc.

I can't help with that (beside of testing), but maybe this is the better way to solve this.

If someone wants to do more tests: If you turn on Permalinks and paste the words from the pdf into the title, then WordPress replaces "ü" with "ue", "ä" with "ae" and "ö" with "oe" for the permalink. This replacement doesn't work if the character isn't precombined, but a vowel followed by a diaeresis.

In the database the post_name should look something like this "fuenf-gaebe-schoen-direct-enter" - if you have the wrong characters you see something like this: "fu%cc%88nf-ga%cc%88be-scho%cc%88n-firefox".

%cc%88 is the diaeresis (http://www.fileformat.info/info/unicode/char/0308/index.htm)

Seems to be broken in FF and Chrome. Preview app to Safari seems to be okay.

@zodiac1978
5 years ago

Correct transliteration if you enter the word directly

@zodiac1978
5 years ago

Missing transliteration for copy/pasted word from PDF in Chrome and Firefox

#14 @boonebgorges
5 years ago

  • Version trunk deleted

#15 follow-up: @Kainer
5 years ago

I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...

#16 in reply to: ↑ 15 @zodiac1978
5 years ago

Replying to Kainer:

I have a lot of "German" work to do which includes copy & paste from pdf-viewer "Preview" on a mac.
Huge amount of texts from many different pdf-files. Really annoying bug striking me hard. Might try another pdf viewer for this now...

You can try my plugin: https://github.com/Zodiac1978/tl-normalizer
For me it is a huge timesaver if I have to copy&paste a list of workshops from a PDF to an online calendar.

#17 @AdrianB
4 years ago

I've ran into this several times, it would be nice if WP took care of it automatically. Huge thanks to @zodiac1978 for that plugin!

#18 follow-up: @iseulde
4 years ago

  • Component changed from TinyMCE to Formatting
  • Milestone changed from Awaiting Review to Future Release

We could use normalize in JavaScript, but with limited browser support.

This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.

Moving back to formatting.

#19 in reply to: ↑ 18 @zodiac1978
4 years ago

Replying to iseulde:

We could use normalize in JavaScript, but with limited browser support.

This is not a TinyMCE problem though, it affects all input. A PHP solution might be better, but we could consider also cleaning this up for the editor on paste.

Moving back to formatting.

Yes, it affects all inputs.

My PHP 5.3+ solution (for titles, excerpt, content and comments) is now in the repo, too:
https://wordpress.org/plugins/normalizer/

#20 follow-up: @iseulde
4 years ago

  • Keywords needs-testing removed

So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?

#21 in reply to: ↑ 20 @zodiac1978
4 years ago

Replying to iseulde:

So can we do this for PHP 5.3+ in core? Needs someone else to give feedback, this is not my area.
Should we in addition also normalise on paste in TinyMCE in browsers that support it?

The filter is precombining the characters just before save, so an additional normalization on paste in TinyMCE would be beneficial, because until the post is not saved you have a broken proofread, search, etc.

I don't know if we should add this with a check for PHP 5.3 in core, because I don't know if this check for Normalization Form C is a performance problem, so I added the "needs testing" tag.

Maybe some more advanced PHP devs can have a look at this for the performance question. I mean, it solves a Mac-only problem which is just there if you copy&paste from a PDF. I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.

#22 @iseulde
4 years ago

I wouldn't add a normalization just for this case if it would slow down all other sites without any reason.

Note that this would only run when writing to the database.

Last edited 4 years ago by iseulde (previous) (diff)

@zodiac1978
4 years ago

Better approach using a PHP 5.3 function

#23 @zodiac1978
4 years ago

The new patch is using the the Normalizer function from PHP 5.3+ (http://php.net/manual/de/normalizer.normalize.php). Unfortunately it is just there if the intl extension is loaded, so we have to check if the function exists.

@zodiac1978
4 years ago

Fixed tabs vs. spaces

@zodiac1978
4 years ago

Still to do: Slug is not normalized before it is generated the first time

#24 @zodiac1978
4 years ago

The slug is getting generated in the wrong way and the patch does not changes it at the moment.

Additionally there many more input fields in WordPress which could be in need of the function (e.g. Widgets)

#25 @swissspidy
4 years ago

  • Keywords needs-patch added; has-patch removed

As discussed during contributor day, it totally makes sense to normalize such strings on post save (and the patch does that well).

However, from a UX perspective we definitely need to do this in JavaScript as well. Otherwise you'll paste the messed up text and correct it by hand without knowing that WordPress will fix it for you. See comment 18.

Luckily, there's a new normalize() method in the ECMAScript 2016 (ES6) standard we can leverage. It's supported by Chrome, Firefox, Opera and IE (11+). See https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize for details.

#26 @tar.gz
4 years ago

Awesome work!

If normalization makes it into core, it should also be applied to filenames of uploaded files - see issue #22363.

#27 follow-up: @gitlost
3 years ago

There's a polyfill for the PHP Normalizer component in Symfony https://github.com/Zodiac1978/tl-normalizer, and also one for String.prototype.normalize at https://github.com/walling/unorm, as mentioned by @zodiac1978 in his tinymce ticket (now at https://github.com/tinymce/tinymce/issues/1971).

As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack normalize() to normalize text on pasting into tinymce.

#28 in reply to: ↑ 27 @zodiac1978
3 years ago

  • Keywords has-patch added; needs-patch removed

Replying to gitlost:

As a demo I forked zodiac's tl-normalizer plugin (https://github.com/gitlost/tl-normalizer) to incorporate the polyfills. This would mean that there's no restrictions now on PHP installation, or on browsers (such as IE9) that lack normalize() to normalize text on pasting into tinymce.

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

I expanded my previous ticket and added the normalize function to all sanitize filter additionally to the save_pre filter.

Unfortunately there are still some fields not normalized:

  • Text Widget content
  • Website title and slug
  • vita on profile page
  • Add New Tag/Category
  • ALT-text for images in media upload panel
  • Add Tags metabox

Any help is much appreciated. Ping @swissspidy

Maybe we can use this approach as a first step and leave the ticket open for the javascript solution.

@zodiac1978
3 years ago

#29 follow-up: @gitlost
3 years ago

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.

On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?

#30 in reply to: ↑ 29 @swissspidy
3 years ago

Replying to gitlost:

This is great, but this would just work for TinyMCE editor, am I right? It doesn't help us with all the other input fields.

The javascript stuff as uploaded was just for paste in TinyMCE, but can be used on anything - I updated the fork with support for paste in any standard admin text input or textarea plus some media panels (attachment details and settings). I'd be interested if you could try it out to see if it works.

On the server side, that seems like an awful lot of filters! - and more required. Perhaps javascript is the way to go for most stuff?

Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.

Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.

#31 @gitlost
3 years ago

Not everyone has JavaScript enabled, so it would definitely some PHP equivalent where possible.

True enough, I suppose it was just the amount of filters and that I haven't got my head around the various pre_XXX and XXX_save_pre and edit_XXX and XXX_edit_pre and sanitize_XXX etc and which do what when why...

Exploring this in a plugin is a good idea. I'd actually suggest to go further down that route and turn it into more of a feature project/plugin.

Excellent, I'd be up for that. I'll ping @zodiac1978 on Slack to discuss...

I updated the fork again - apart from a major bug where it was adding the filters too late, there was a dependency in the Symfony Normalizer on PCRE with UTF-8 support, which I've belatedly discovered can't be relied on to be available, so added a workaround ... if it's good it should be useful for tickets #22363 and #24661 also. The big things left to do are to work out which filters to use, customizer support, and unit tests.

This ticket was mentioned in Slack in #core by swissspidy. View the logs.


3 years ago

#33 @ocean90
3 years ago

Related: #35951

#34 @gitlost
3 years ago

The fork mentioned is now available from the WP repository as UNFC Nörmalizer (thanks anonymous plugin reviewer!).

As to how or if to normalize input in core, I'm really not sure. Adding all those filters still doesn't seem right.

Also normalization may not always be desirable, eg in the case of CJK compatibility ideographs (they get mapped to unified ideographs under both NFC and NFD normalization), although it's hard to get a definite read on this as the Unicode Consortium seem to suggest it's not a problem - see eg Isn't it true that some Japanese can't write their own names in Unicode?, and I suppose it could be made locale dependent if necessary.

Anyway I'd lean towards a javascript only fix, only added for pasting in Chrome and Firefox under Mac OS X (and iOS?). These browsers support the normalize() method so no polyfill would be needed, and (presumably) encompass the vast bulk of use cases. I can extract and adapt the code used in the plugin as a patch if there's interest.

Another option - use Safari!

PS In case people have difficulty replicating this bug, note that the paste needs to done in Chrome or Firefox - ironically Safari normalizes pastings (and upload filenames, but that's another story) to NFC, while the others just take what they're given. The copying from copy-paste-test.pdf can be done from Preview or from Adobe Reader, and as noted above (at least on the versions that come with Mountain Lion 10.8) Preview (Version 6) decomposes all the umlauted characters, while Adobe Reader (Version 11.0.13) decomposes only the u-umlaut. Think different.

PPS I do think the ability to normalize (via the Symfony polyfill) would be a worthwhile addition to core (eg for sanitize_file_name(), remove_accents()) and will open a ticket suggesting it.

PPPS Implementing the plugin threw up a number of issues, eg admin javascript in a lot of cases does not check and refresh its data based on what comes back from the server, and meta keys aren't sanitized - I'll (hopefully) open tickets for each of these.

#35 @zodiac1978
18 months ago

A processwire module:
https://github.com/justb3a/processwire-textformatter-normalizeutf8

This is using Patchwork as a compatibility layer
https://github.com/tchwork/utf8

Other CMS, like Concretet5 are using this too.

Here is a issue for Typo3 with the same topic:
https://forge.typo3.org/issues/57695

After tweeting to other CMS and some answers it could be solved with the last Apple mac OS 10.13.x

Can anyone with 10.13.x check this?

@AdrianB
18 months ago

Normalize test in Chrome

@AdrianB
18 months ago

Normalize test in Safari

@AdrianB
18 months ago

Normalize test in Firefox

@AdrianB
18 months ago

Normalize test in Firefox after re-saving the post with UNFC Nörmalize plugin activated

#36 @AdrianB
18 months ago

(Oh, I didn't mean to spam this ticket with attachments, I though I could attach the images and use in one post without creating a post for each…)

I did a quick test using macOS 10.13.3. I opened the copy-paste-test.pdf file attached to this ticket in Preview 10.0, copied the text and saved in a new post using Firefox (59.0). The same issue still remains.

Chrome (65.0.3325.162) and Safari (11.0.3) renders the text correctly but Firefox does not. See the above screenshots.

I then activated the UNFC Nörmalize plugin an re-saved the post. The normalized text then looks fine in Firefox as well, see last screenshot above.

#37 @zodiac1978
18 months ago

Thanks for testing @AdrianB !

Did you try to search or proofread the strings in Chrome or Safari? Does this work correctly? (For me it is still broken in Chrome & Safari too)

Note: See TracTickets for help on using tickets.