Make WordPress Core

Opened 10 years ago

Closed 9 years ago

Last modified 9 years ago

#30966 closed defect (bug) (fixed)

Let's make word count better (master ticket)

Reported by: iseulde's profile iseulde Owned by: iseulde's profile iseulde
Milestone: 4.3 Priority: normal
Severity: normal Version:
Component: Editor Keywords: has-patch
Focuses: javascript Cc:

Description (last modified by iseulde)

  1. Make the code readable (don't use short variable names such as tx, w, tc, t, wc, c, which have no meaning).
  2. Make the word count function more abstract so that it can be used elsewhere. So, it should only count the words in a given text, nothing else.
  3. Maybe move to wp-includes.
  4. The word count should update at the right times. At the moment it updates on enter, delete and backspace. It should at least be updated on space too. Consider things like cut and paste, and look at TinyMCE events that change the content.
  5. Exclude more patterns like known shortcodes and ellipsis (#27386, #27391).
  6. Add some good unit tests for all of this.
  7. Make sure it's as fast as it can be (e.g. don't needlessly manipulate the DOM).

This could be a small improvement that makes people less annoyed with WordPress. :)

More inspiration, although not ideal either:
https://github.com/tinymce/tinymce/blob/master/js/tinymce/plugins/wordcount/plugin.js

Depends on:

#20738
Improve Javascript Word Counting for TinyMCE
#26620
Word Count is one number short
#27386
Editor: Word count should exclude shortcodes
#27391
Editor: Word count should exclude ellipsis


Attachments (17)

30966.patch (10.6 KB) - added by iseulde 10 years ago.
30966.2.patch (11.0 KB) - added by iseulde 10 years ago.
30966.3.patch (11.0 KB) - added by iseulde 10 years ago.
30966.4.patch (9.1 KB) - added by iseulde 10 years ago.
30966.5.patch (11.2 KB) - added by iseulde 9 years ago.
30966.6.patch (7.7 KB) - added by iseulde 9 years ago.
30966.7.patch (9.1 KB) - added by iseulde 9 years ago.
30966.8.patch (2.0 KB) - added by iseulde 9 years ago.
30966.9.patch (810 bytes) - added by iseulde 9 years ago.
30966.10.patch (901 bytes) - added by iseulde 9 years ago.
30966.11.patch (2.5 KB) - added by iseulde 9 years ago.
30966.12.patch (2.4 KB) - added by iseulde 9 years ago.
30966.13.patch (2.5 KB) - added by azaozz 9 years ago.
30966.14.patch (2.5 KB) - added by iseulde 9 years ago.
30966.15.patch (5.4 KB) - added by iseulde 9 years ago.
30966.16.patch (6.2 KB) - added by iseulde 9 years ago.
30966-translator-comment.diff (2.1 KB) - added by ocean90 9 years ago.
https://translate.wordpress.org/projects/wp/dev/ja/default?filters[status]=either&filters[original_id]=533157&filters[translation_id]=21527772

Download all attachments as: .zip

Change History (85)

#1 @iseulde
10 years ago

  • Description modified (diff)

#2 @wonderboymusic
10 years ago

yes, do this - I had to rewrite it for a NYT project because it didn't work as advertised. I only modified the existing code, would love to see what you can come up with.

#3 @iseulde
10 years ago

  • Description modified (diff)

#4 @iseulde
10 years ago

If someone could elaborate on #20738, we could handle this here too.

#5 @iseulde
10 years ago

  • Owner set to avryl
  • Status changed from new to accepted

Going to take a closer look at this.

@iseulde
10 years ago

#6 @iseulde
10 years ago

  • Keywords has-patch added; needs-patch removed
  • Milestone changed from Future Release to 4.2

In patch above:

Word count will update at various input events, and is debounced with 500 ms. Need to make sure this isn't too intensive. We can even debounce with 1 s; the word count doesn't really need to update until after you stopped typing.

It now excludes a lot more... Common punctuation and symbols are excluded by unicode blocks. I might add a few more blocks later. Previously a single dash or € mark or other common symbol somewhere would be counted as a word. We obviously can't exclude all the punctuation/symbols from all alphabets (so ranges in different blocks), but we can exclude the most common ones and blocks that are not alphabets. Maybe we could localise a supplement? So a translation could exclude more unicode ranges. Just a thought.

Added a small selection of characters that are "contracting": the apostrophe and hyphen. Maybe I missed some. All other characters to be excluded are "expanding". Previously word/word or word—word would be counted as one word. For most it doesn't really matter which category they're in, but from what I see there are only a few "contracting".

Shortcodes are now also excluded.

wordCounter now only does one things, and that's count words. :) Now you can easily use it somewhere else.

I added a few really simple tests.

#7 @iseulde
10 years ago

Should URLs be excluded from the word count? Or should it count as one word? Imo it shouldn't be counted as it's not a word.

@iseulde
10 years ago

#8 @iseulde
10 years ago

.1: Exclude anything inside code, form, script and noscript.

@iseulde
10 years ago

#9 @iseulde
10 years ago

  • Description modified (diff)

#10 @jorbin
10 years ago

This ticket has sadly not seen any activity in two months. While fixing up the word count is something I would love to see (one of the bugs linked in the descriptions is one I reported), I wonder if it makes sense to continue to include this in the 4.2 milestone at this stage.

#11 @iseulde
10 years ago

  • Milestone changed from 4.2 to Future Release

Too late now.

@iseulde
10 years ago

#12 @iseulde
10 years ago

Refreshed.

This ticket was mentioned in Slack in #core-editor by iseulde. View the logs.


10 years ago

This ticket was mentioned in Slack in #core-editor by chriscct7. View the logs.


10 years ago

#16 @azaozz
10 years ago

The most important "feature" of the word count script is to be super fast and unobtrusive. The moment it starts interfering with the typing even on the slowest device it becomes redundant, or even a bug.

In that terms thinking we should explore some other options:

  • Update the count only on Enter. This still can slow down typing but at least is at the beginning of new paragraph where slowness is less noticeable.
  • Make it run only on demand. Many text editors, including desktop/computer apps, have word count, character count and other text stats, but none has a "dynamic" word count as far as I've seen. Perhaps we should match that and make the word count run only on click?

#17 @krogsgard
10 years ago

In my experience, I treat word count as a thing I glance at while not actually writing, but during the writing experience. If it's updating with every keystroke it could be a distraction. But some interval updates would be great. Also, +1 to @azaozz's note on speed implications.

I have no problem with it updating on enter only. I do, however, hope it doesn't get moved to a click event. It's a valuable feature that I really enjoy directly in the edit window.

#18 @iseulde
10 years ago

  • Milestone changed from Future Release to 4.3

@krogsgard The current patch debounces the counting with half a second, which we could increase to a second or even a few. Then it would only update once you stop typing. Even if the counter is really slow, it doesn't matter that much as it won't be triggered a lot.

#19 @RDall
10 years ago

I personally like use and advise my clients to use the Word Count that is included in WordPress. It helps me and those I advise write content to a certain amount. Which has benefits to SEO. It small and unobtrusive if it's not improved it should still be kept for these obvious advantages.

#20 follow-up: @azaozz
9 years ago

Thinking we may want to consider other stats too. For example total number of characters or letters (is that useful?), number of paragraphs, images, links, etc.

Also, another consideration is what to do for Japanese, Chinese, Korean, etc. languages. I'm not sure now word count works there, or even if it makes sense. As far as I know a word can be a single glyph or several glyphs long, each glyph being a syllable.

#21 @helen
9 years ago

Chinese should be counted as one word for each glyph, I think Japanese and Korean would be the same for their various writing systems as far as I understand them but would be better to have confirmation from people who actually read/write in those. I don't know about anything past CJK, though.

#22 @Nao
9 years ago

We count Japanese text by characters/letters/glyphs, not words (i.e. the indicator should show "100文字 (=100 characters")"), just like Twitter.

#23 in reply to: ↑ 20 ; follow-up: @chriscct7
9 years ago

Replying to azaozz:

Thinking we may want to consider other stats too. For example total number of characters or letters (is that useful?), number of paragraphs, images, links, etc.

Also, another consideration is what to do for Japanese, Chinese, Korean, etc. languages. I'm not sure now word count works there, or even if it makes sense. As far as I know a word can be a single glyph or several glyphs long, each glyph being a syllable.

#20738 implements multibyte character counting

#24 in reply to: ↑ 23 @azaozz
9 years ago

Replying to chriscct7:

Yep, we can count multibyte chars (as "words") easily. The problem is which multibyte chars to exclude (emoji, etc.) and how to make it as fast as possible. A nested loop doesn't seem that fast, but some speed tests would be needed anyway :)

Also, only certain multibyte char ranges should be counted as words. This will make it possible to count "mixed" content, for example Chinese + English in the same post.

#25 @iseulde
9 years ago

Let's start with this and iterate on speed and multibyte chars.

@iseulde
9 years ago

#26 @iseulde
9 years ago

Refreshed the patch.

#27 follow-up: @iseulde
9 years ago

@helen, @Nao, so would certain characters be excluded from this count? Or would it just count every character? What about spaces? If it should work just like Twitter, it would include all that.

#28 @azaozz
9 years ago

According to the comments and patches on #8759, the character count works differently for Japanese and Chinese, one counts spaces, the other doesn't.

There is also some discussion whether to count Latin/ASCII chars as words in these languages, starting at https://core.trac.wordpress.org/ticket/8759#comment:15.

@iseulde
9 years ago

@iseulde
9 years ago

#29 @iseulde
9 years ago

In 32856:

Editor: restructure word count

  • The WordCounter should only do one thing: count words. This makes it also easier to test.
  • Add some really basic unit tests.
  • Instead of only refreshing the count on enter and delete, refresh the count when the user stops typing. Also look at paste and content changes in TinyMCE.
  • Use match instead of replace when it is appropriate.
  • More readable code.

See #30966. Fixes #26620.

#30 @iseulde
9 years ago

We should try to exclude everything that is not a letter. Right now this is done for some characters by removing them from the string. Ideally we should look at which ranges and blocks can be excluded. However, some characters can connect words, such as /, -- and . These should replaced with spaces. And then - will usually connect two part of a word, so it should be removed.

#31 @azaozz
9 years ago

Yeah, when counting words we can replace all punctuation with spaces. Will make things a bit easier.

Another "speed-up" thing to try is using editor.getBody().textContent instead of editor.getContent{'raw'). That will give us only the text, with tags and entities already removed. If that works well, we will need to pass the context to the counting function as we will still need to strip tags for the Text editor (the tag stripping regex needs to catch html comments too).

@iseulde
9 years ago

#32 @dd32
9 years ago

The current delay appears to be set to 2000ms, that feels way too slow.. It gives me enough time to hit enter, click outside of the area, etc, to try and get it to trigger.

At a minimum I feel that it should continue to trap the enter key to trigger a recount, and possibly decrease the delay to be something far smaller.

#33 @iseulde
9 years ago

I don't think anyone would be looking at the word count while typing though (like you do when you're testing it). But yes, maybe lowering it to 1s is better. There was even discussion about setting it to 3s, but I thought it was too much. We can also refresh on enter, but I'm not sure if there's more expectation for it to change then. What if you just leave the cursor at the end, or you're at the end of your post?

#34 @obenland
9 years ago

#20738 was marked as a duplicate.

#35 @iseulde
9 years ago

In 33060:

Editor: refresh word count quicker

2 seconds is a bit slow. Debouncing with 1 second means it can only run maximum once per second. In reality it won't run us much. Even people who type slow will usually type faster than 1 character per second.

See #30966.

#36 @iseulde
9 years ago

Another "speed-up" thing to try is using editor.getBody().textContent instead of editor.getContent{'raw'). That will give us only the text, with tags and entities already removed. If that works well, we will need to pass the context to the counting function as we will still need to strip tags for the Text editor (the tag stripping regex needs to catch html comments too).

Should we exclude code and form?

#37 @iseulde
9 years ago

Or at least <pre><code>.

@iseulde
9 years ago

#38 @obenland
9 years ago

@iseulde, could you get this one over the finish line before beta3?

#39 @iseulde
9 years ago

@obenland Sure!

#40 in reply to: ↑ 27 @tenpura
9 years ago

Replying to iseulde:

@helen, @Nao, so would certain characters be excluded from this count? Or would it just count every character? What about spaces? If it should work just like Twitter, it would include all that.

For Japanese, every character matches to [\S \u00A0\u3000] shoud basically be counted. Please check out the WP Multibyte Patch's word-count.js for more details. There are some exceptions.

This ticket was mentioned in Slack in #core by iseulde. View the logs.


9 years ago

#42 follow-up: @iseulde
9 years ago

@tenpura So you count everything, including spaces? Do tabs count? Maybe [^\n\r] would do?

#43 in reply to: ↑ 42 @tenpura
9 years ago

Replying to iseulde:

Maybe [^\n\r] would do?

No. That would pick up unwanted characters.

#44 @iseulde
9 years ago

@tenpura And which are those characters?

@iseulde
9 years ago

#45 @iseulde
9 years ago

We can add a new type that counts characters, but includes spaces. The regular expression might need to be adjusted a bit if more characters need to be excludes (maybe tabs?). There are some other characters in \s that are not really spaces, but they wouldn't be used in WordPress.

#46 @iseulde
9 years ago

In 33290:

Editor: word count: add all type

This type will count all characters including spaces.

See #30966.

#47 @iseulde
9 years ago

In 33291:

Editor: word count: update translator message

See #30966.

@iseulde
9 years ago

#48 @iseulde
9 years ago

In 33292:

Editor: word count: exclude more characters

Also only exclude these characters for the words type. They should be counted for other types.
Add the ASCIIOnly option to the uglify config to preserve escaped unicode characters.

See #30966. Fixes #27391.

#49 @iseulde
9 years ago

To do:

  • Exclude astral unicode blocks that contain punctuation and symbols. The most important one is emoji. (word count only)
  • DONE. Exclude shortcodes.
  • Exclude HTML comments.
  • Exclude text inside certain HTML tags like <pre><code> (or <code> in general?) and maybe <form>? <script> can also be use by admins and should also be excluded.
  • Any HTML entities that should be replaced?
  • DONE. Make sure astrals are counted as one character. In JavaScript the length of an astral character is two.
  • Figure out how a mix of languages should be counted (but most likely wontfix).
  • Should words separated by / count as two? What do we do with URLs?
Version 5, edited 9 years ago by iseulde (previous) (next) (diff)

#50 @iseulde
9 years ago

In 33295:

Editor: word count: remove redundant script enqueuing

It is now a dependency of post.

Part props johnjamesjacoby.
Fixes #31766. See #30966.

This ticket was mentioned in Slack in #polyglots by azaozz. View the logs.


9 years ago

@iseulde
9 years ago

@azaozz
9 years ago

#52 @iseulde
9 years ago

In 33299:

Editor: word count: exclude shortcodes

Props desaiuditd, adamsilverstein, azaozz and iseulde.
Fixes #27386. See #30966.

#53 @iseulde
9 years ago

In 33320:

Editor: word count: count astrals as one character

This makes sure an emoji, for example, is counted as one character.

See #30966.

#54 @iseulde
9 years ago

However, emoji that are replaced with images in browsers that don't support them are not counted.

#55 @obenland
9 years ago

@iseulde, do you think this can be closed before beta4?

@iseulde
9 years ago

#57 @iseulde
9 years ago

Last patch for this ticket above. Removes HTML comments and fixes HTML entities.
Let's move the other points in to a new ticket. https://core.trac.wordpress.org/ticket/30966#comment:49

#58 @iseulde
9 years ago

The patch removes all entities for word count (not character). In the case that the entity is attached to a word this is no problem. In the case that it's on its own, there's a tiny chance it might need to be counted as a word. But considering that we don't create them, and even convert them back when saving or switching to visual, and that, when they're added by the user, they're likely to be a special character, I don't think this is a problem.

#59 @adrian.pop
9 years ago

Word count does not work if language is set to Romanian. It shows all the time 1. In other language I tested - EN, DE, FR - is working fine.

#60 @iseulde
9 years ago

@adrian.pop That's a translation issue.

https://translate.wordpress.org/projects/wp/dev/ro/default?filters[status]=either&filters[original_id]=256099&filters[translation_id]=16871789

I suggested a new one for you.
I wonder if we should filter out any incorrect translations, like it did before.

#61 @adrian.pop
9 years ago

Thanks. I'll check tomorrow. All the best!

#62 @iseulde
9 years ago

In 33344:

Editor: word count: exclude HTML comments and entities

Also make sure type is one of the three allowed types and remove unnecessary regex flags.

See #30966.

#63 @iseulde
9 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

Let's close this for 4.3 and open new tickets for other problems.

#64 @iseulde
9 years ago

Here's what's left:

  • Exclude astral unicode blocks that contain punctuation and symbols. The most important one is emoji. (word count only)
  • Exclude text inside certain HTML tags like <pre><code> (or <code> in general?) and maybe <form>? <script> can also be use by admins and should also be excluded.
  • Figure out how a mix of languages should be counted (but most likely wontfix).
  • Should words separated by / count as two? What do we do with URLs?

One ticket per issue? One ticket for all?

#65 @obenland
9 years ago

Are those for a future release? If yes, I'd separate them out.

@iseulde
9 years ago

#66 @iseulde
9 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

Reopening until we figure out if we really need the all type. It seems we can just include spaces for the character type.

If we keep all, we also need to fix https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L2821

@iseulde
9 years ago

#67 @iseulde
9 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

In 33440:

Editor: word count: better names for types.

Also fix it in wp_trim_words().

Fixes #30966.

#68 @ocean90
9 years ago

In 33517:

Editor: word count: Remove indentation from the translator comment.

Avoids a duplicate comment in the POT file.

see #30966.

Note: See TracTickets for help on using tickets.