Make WordPress Core

Opened 11 years ago

Last modified 4 years ago

#26842 new defect (bug)

Contenteditable, multiple spaces, &nbsp, and U+00A0

Reported by: azaozz's profile azaozz Owned by:
Milestone: Future Release Priority: normal
Severity: normal Version: 4.7
Component: Formatting Keywords: needs-unit-tests
Focuses: Cc:

Description

In contenteditable mode when the user types multiple spaces (ASCII char 32, U+0020) they are preserved. The browsers insert   as every other character, the string is       etc.

In WordPress TinyMCE is set to

'entities' => '38,amp,60,lt,62,gt',
'entity_encoding' => 'raw',

Anything other than the three basic "htmlspecialchars" &, < and > is outputted as UTF-8 when serializing the DOM. This outputs the (multiple)   as U+00A0 which in PHP shows as 0xC2 0xA0(reference).

A problem with 0xC2 0xA0 is that in PHP the regex \s matches 0xA0 in certain cases, fails to match the "white space", breaks the UTF char, and sometimes leaves an  behind. One example is wptexturize(), see #22692.

Another problem is that the user is not aware there are multiple   when looking in the Text editor or the html source, as U+00A0 are "invisible".

Change History (38)

#1 @azaozz
11 years ago

  • Keywords needs-unit-tests added

Changing the TinyMCE settings to

'entities' => '38,amp,60,lt,62,gt,160,nbsp',
'entity_encoding' => 'named', // the default

outputs all U+00A0 as  . This also happens when using the defaults for these options. Been testing it for a while and seems nothing is broken by having the HTML entities instead of the UTF chars in the post content.

#2 @miqrogroove
11 years ago

The nbsp entity used to show up for me. Was that caused by an older version of the editor or an older version of Chrome?

#4 follow-up: @nacin
11 years ago

  • Type changed from enhancement to defect (bug)

Let's fix this.

#5 @miqrogroove
11 years ago

azaozz describes part of the problem as having "invisible" special characters both in Visual and Text modes.

On that premise, I wondered if ' ' would be a better solution than '  ', where the latter has the additional problem of taking line breaks directly between the space and the no-break space.

On the contrary, entering   in the Text tab between sentences simply causes an invisible character to get inserted, leaving no way to distinguish between regular space and enspace in either tab. The output looks nice, but the editing process is too convoluted.

For this particular bug, I think it is important to ask if other whitespace characters should be editable as entities, and not only nbsp?

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


11 years ago

#7 @miqrogroove
11 years ago

IRC summary: We will focus on adopting both \xC2\xA0 and   as synonyms for a space because they are already commonly used whether or not the author wants them.

To that end, I will refresh #22692 and then add unit tests for other functions that handle post content and spaces.

Last edited 11 years ago by miqrogroove (previous) (diff)

#8 @miqrogroove
11 years ago

New patches added to #22692. They do not change TinyMCE, but I am testing them with the proposed configuration and they should make it fully compatible with this ticket.

#9 @miqrogroove
11 years ago

Possible problem or quirk in the editor that needs testing:

Changing the TinyMCE configuration affects the behavior of leading spaces in paragraphs. In the default configuration, any leading spaces disappear on tab switch. With   present, the spaces stick.

#10 @miqrogroove
11 years ago

Next problem:

  appears to have 2 distinct meanings in the editor. In either configuration,   is used as a placeholder for blank lines. This doesn't seem to break anything in the editor, but we will have to step very carefully around the wpautop() logic and how it handles those entities.

This ticket was mentioned in IRC in #wordpress-dev by azaozz. View the logs.


11 years ago

This ticket was mentioned in IRC in #wordpress-dev by miqrogroove. View the logs.


11 years ago

#13 @azaozz
11 years ago

  • Milestone changed from 3.9 to Future Release

Changing the TinyMCE configuration affects the behavior of leading spaces in paragraphs.

Yes, since the charset in JS is UTF-8, \s matches U+00A0. Once we have the HTML entities there, they will be treated as string. Leading spaces in paragraphs are still user input.

Will need to add tests (both JS and PHP) and try to cover all places that will be affected by this change. Most notably the JS wpautop() and pre_wpautop() and also the PHP wpautop().

#14 @miqrogroove
10 years ago

wp_spaces_regexp() has been implemented in trunk for wptexturize, smilies, and shortcodes.

The ticket for revising wpautop() appears to be #27733.

#15 @azaozz
9 years ago

#23778 was marked as a duplicate.

#16 @azaozz
9 years ago

#31297 was marked as a duplicate.

#17 @jeremyclarke
9 years ago

Glad to see movement on this! Our francophone authors are dying to have proper support for the true meaning of NBSP, which is absolutely necessary for correct French grammar which often has spaces between words and punctuation that need to never be split across a line break.

#18 @azaozz
9 years ago

#34520 was marked as a duplicate.

This ticket was mentioned in Slack in #core-editor by iseulde. View the logs.


9 years ago

#20 @azaozz
8 years ago

#31157 was marked as a duplicate.

#21 follow-up: @galbaras
8 years ago

Last change seems to be from 2 years ago, but I'm not seeing any changes in the editor yet.

Making non-breaking spaces visible in text mode would make them possible to track and fix, if necessary, so why not add them and be done with it? Is there even a downside?

#22 @raykaii
8 years ago

i need a fix for this too being french we need to have this for proper spelling...

#23 in reply to: ↑ 21 ; follow-ups: @CK MacLeod
8 years ago

  • Version set to 4.7

Replying to galbaras:

Last change seems to be from 2 years ago, but I'm not seeing any changes in the editor yet.

Making non-breaking spaces visible in text mode would make them possible to track and fix, if necessary, so why not add them and be done with it? Is there even a downside?

I agree that would be a good change - showing all invisible characters in Text mode or at least providing the option to view the "real" source, even though in this instance, apparently, the problem is even deeper than HTML, and seems to have to do with how different browsers interpret the deeper-underlying unicode character.

Just to keep everyone current, the issue appears in WebKit-derived browsers - so far Chrome and Brave in my testing - but not in Firefox, Edge, Sleipnir, just to name three. I hesitate to attempt to work up a list, since I suppose the problem could evaporate tomorrow, even if it's been going on for years. Also, to be clear, it arises for users (at least for users like me and like other participants in this support thread https://wordpress.org/support/topic/unwanted-non-breaking-spaces-nbsp/) as a text-replacement problem mainly, not a "multiple spaces" problem: When the user highlights and replaces selected text (but not words including following spaces, default word-highlight behavior), following white space gets converted into a non-breaking space, screwing up word-wrapping.

Another possibility would be to introduce "elimination of all  's in post content prior to save to database. The following function works, but, like most of the other fixes proposed on other threads, or that you can find around the Web - some focusing on WP php functions, some on tinyMCE or other Javascript methods - it doesn't allow users who want non-breaking spaces to keep them:

add_filter( 'content_save_pre', 'remove_buggy_nbsps', 99 );

function remove_buggy_nbsps( $content ) {

   return str_replace( '\xc2\xa0', ' ', $content); 

}

It could be improved upon in various ways, no doubt, but, like I said, it works. If this functionality were the default, then you could introduce a settings option to disable substitution globally. Alternatively, a metabox or editor button (probably from a plug-in) could supply an option to turn off nbsp-replacement selectively for individual posts (or conceivably for spans).

Finally, I'll also note that, if you use the tinyMCE Advanced plug-in, you can (dealing with a big of hinkiness) highlight the spots in the Visual Editor where the invisible characters appear. I still haven't found an Editor extension that shows the code in the same way an inspection tool will, however: That would be a good function, and helpful for developers and other users apart from the nbsp bug.

In the meantime, the best I could do for a user dealing with this problem was to advise him to use a different browser for editing posts in WP, or to be more careful when performing edits on a WebKit browser (highlighting properly and always over-writing white spaces).

Last edited 8 years ago by CK MacLeod (previous) (diff)

#24 in reply to: ↑ 23 @CK MacLeod
8 years ago

Replying to CK MacLeod:

Just wanted to note I've corrected some misinformation in my prior comment (the code was wrong).

I've gone into more detail here - https://ckmacleod.com/2017/03/23/exterminating-non-breaking-space-bug/ - but, in short, the snippet that I gave previously just happened to work, more or less by coincidence, if you have Next Generation Gallery installed (long story). And I'm still not sure I've stated the underlying character-encoding problem correctly.

More at Exterminating the Non-Breaking Space Bughttps://wp.me/p4h0Xw-gGR, including ways to deal with already-archived posts and to exclude certain types of post.

I'll also note that the code used on another thread (at https://core.trac.wordpress.org/ticket/31157#comment:5 )also works - though I don't see why it would be preferable:

tinymce.on('AddEditor', function(event) {

  var editor = event.editor;

  editor.on('getContent', function (e) {

    var content = editor.getContent({format: "raw", no_events: 1});
    e.content = content.replace(/ /ig, ' ');
    
  });

});

The above actually relates to the peculiar feature of Next Generation Gallery plug-in (buried deep in its code) that served to enabled the function that worked by coincidence. My main purpose here is just not to leave my mistake from months ago uncorrected, since I know that people like me looking for solutions often arrive at threads like this one looking for answers.

UPDATE:

One other thing: I mentioned the long story about Next Generation Gallery. It's very widely in use (1 million installations I believe). It turns out that, because of the peculiar way that it amends the TinyMCE editor, it converts, but does not eliminate, the problematic characters - \xc2\xa0, in UTF-8 - into the HTML  : So, I think NGG users will have to substitute the latter for the former. Or a more comprehensive fix would have to fix both. I'm going to stop here, though, before I discover some new complication...

Last edited 8 years ago by CK MacLeod (previous) (diff)

#25 in reply to: ↑ 23 ; follow-up: @y0uri
7 years ago

Replying to CK MacLeod:

The following function works, but, like most of the other fixes proposed on other threads, or that you can find around the Web - some focusing on WP php functions, some on tinyMCE or other Javascript methods - it doesn't allow users who want non-breaking spaces to keep them:

add_filter( 'content_save_pre', 'remove_buggy_nbsps', 99 );

function remove_buggy_nbsps( $content ) {

   return str_replace( '\xc2\xa0', ' ', $content); 

}

It could be improved upon in various ways, no doubt, but, like I said, it works.

It only works if you replace these single quotes by double quotes, ie.:

return str_replace( "\xc2\xa0", ' ', $content);

Thanks for this fix btw. The nbsp's were messing things up in another system that depends on a WP feed, when editing there would sometimes appear nbsp's in the content, instead of regular spaces as were there before. And only when editing in Webkit-based browsers. Your filter function solves this problem.

Last edited 7 years ago by y0uri (previous) (diff)

#26 follow-up: @galbaras
7 years ago

Now, all we have to do is put this into core. Any takers?

Although, truthfully, the best approach is to find where double spaces are converted into a space and a non-breaking space and make the change there. Maybe Ephox can help with this?

#27 in reply to: ↑ 25 @CK MacLeod
7 years ago

I wonder where the double quote problem came in - didn't come up in my experiments.

I ended up with a hybrid approach for my own site, as the post I linked above indicates.

Replying to y0uri:

Replying to CK MacLeod:

The following function works, but, like most of the other fixes proposed on other threads, or that you can find around the Web - some focusing on WP php functions, some on tinyMCE or other Javascript methods - it doesn't allow users who want non-breaking spaces to keep them:

add_filter( 'content_save_pre', 'remove_buggy_nbsps', 99 );

function remove_buggy_nbsps( $content ) {

   return str_replace( '\xc2\xa0', ' ', $content); 

}

It could be improved upon in various ways, no doubt, but, like I said, it works.

It only works if you replace these single quotes by double quotes, ie.:

return str_replace( "\xc2\xa0", ' ', $content);

Thanks for this fix btw. The nbsp's were messing things up in another system that depends on a WP feed, when editing there would sometimes appear nbsp's in the content, instead of regular spaces as were there before. And only when editing in Webkit-based browsers. Your filter function solves this problem.

#28 in reply to: ↑ 26 @CK MacLeod
7 years ago

Replying to galbaras:

Now, all we have to do is put this into core. Any takers?

Although, truthfully, the best approach is to find where double spaces are converted into a space and a non-breaking space and make the change there. Maybe Ephox can help with this?

Not sure about the last part, but am wondering if/how Gutenberg will handle this, and will check the latest Beta if someone hasn't done so already.

#29 follow-up: @galbaras
7 years ago

I've emailed someone from Ephox. Hopefully, they'll help, but otherwise, the filter should be just fine.

#30 @afraithe
7 years ago

Andrew Ozz answered this well in another ticket.

https://core.trac.wordpress.org/ticket/31157

We can't know what is suppose to be there or not, your browser is trying to compensate with &nbsp in order to show spaces, and this can happen in various strange unexpected operations, if you select a bit strange and copy & paste there is a possibility of an &nbsp being generated etc.

#31 in reply to: ↑ 29 ; follow-up: @y0uri
7 years ago

Replying to galbaras:

[...] but otherwise, the filter should be just fine.

I agree, just so long as you mean the filter to be manually implemented when needed (functions.php or plugin), not in core WP. Nbsp's are rarely a real problem, which this filter can take care of, and they are used plenty on purpose. As @afraithe mentions, you can't know if it was intended or not.

Looks like the browsers that are causing this issue need to be fixed (not holding my breath).

#32 @CK MacLeod
7 years ago

Was going to write a comment advocating filterable default automatic removal of non-breaking spaces, but, just now, on Chrome 63 (using WP 4.9.2, Twenty Seventeen), I was unable to re-produce the problem.

#33 in reply to: ↑ 31 ; follow-up: @galbaras
7 years ago

Replying to y0uri:

I agree, just so long as you mean the filter to be manually implemented when needed (functions.php or plugin), not in core WP. Nbsp's are rarely a real problem, which this filter can take care of, and they are used plenty on purpose. As @afraithe mentions, you can't know if it was intended or not.

Looks like the browsers that are causing this issue need to be fixed (not holding my breath).

It's not a "real problem". What's the harm in an extra white space, after all. Still, it is a "real world problem" that WordPress should be able to handle, including for those who can't even find their functions.php with a flashlight :)

#34 in reply to: ↑ 33 @y0uri
7 years ago

Replying to galbaras:

Replying to y0uri:

I agree, just so long as you mean the filter to be manually implemented when needed (functions.php or plugin), not in core WP. Nbsp's are rarely a real problem, which this filter can take care of, and they are used plenty on purpose. As @afraithe mentions, you can't know if it was intended or not.

Looks like the browsers that are causing this issue need to be fixed (not holding my breath).

It's not a "real problem". What's the harm in an extra white space, after all.

You'd think so, but the randomly appearing nbsp's were a real problem for a client. As far as I can tell, it goes like this: They have a WooCommerce product feed to an aggregator which I guess moderates product descriptions (post content). Then when my client edited e.g. product prices, nbsp's would appear in the descriptions and those products would in the next feed update appear in the aggregator's moderation queue again, without there being any "real" changes. They didn't like that. I'd rather have the aggregator change their code to regex replace nbsp's w/ regular spaces, makes sense to me for their use, but they made it my problem. The aforementioned filter fixed it, looks like.

PS. I did say "rarely" :)

Last edited 7 years ago by y0uri (previous) (diff)

#35 in reply to: ↑ 4 @jerclarke
6 years ago

Replying to nacin:

Let's fix this.

6 years ago this was 7 years old 😕

#36 @azaozz
5 years ago

  • Component changed from TinyMCE to Formatting

The block editor doesn't use contentEditable any more, the invisible U+00A0 chars are not added automatically by the browsers. This fixes the main problem in this ticket, but thinking we still may need to look at handling edge cases of 0xC2 0xA0 better in PHP. Moving this to Formatting.

#37 @azaozz
5 years ago

#31156 was marked as a duplicate.

#38 @desrosj
4 years ago

#33448 was marked as a duplicate.

Note: See TracTickets for help on using tickets.