#30966 closed defect (bug) (fixed)
Let's make word count better (master ticket)
Reported by: | iseulde | Owned by: | iseulde |
---|---|---|---|
Milestone: | 4.3 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Editor | Keywords: | has-patch |
Focuses: | javascript | Cc: |
Description (last modified by )
- Make the code readable (don't use short variable names such as
tx
,w
,tc
,t
,wc
,c
, which have no meaning). - Make the word count function more abstract so that it can be used elsewhere. So, it should only count the words in a given text, nothing else.
- Maybe move to
wp-includes
. - The word count should update at the right times. At the moment it updates on
enter
,delete
andbackspace
. It should at least be updated onspace
too. Consider things like cut and paste, and look at TinyMCE events that change the content. - Exclude more patterns like known shortcodes and ellipsis (#27386, #27391).
- Add some good unit tests for all of this.
- Make sure it's as fast as it can be (e.g. don't needlessly manipulate the DOM).
This could be a small improvement that makes people less annoyed with WordPress. :)
More inspiration, although not ideal either:
https://github.com/tinymce/tinymce/blob/master/js/tinymce/plugins/wordcount/plugin.js
Depends on:
Attachments (17)
Change History (85)
#5
@
10 years ago
- Owner set to avryl
- Status changed from new to accepted
Going to take a closer look at this.
#6
@
10 years ago
- Keywords has-patch added; needs-patch removed
- Milestone changed from Future Release to 4.2
In patch above:
Word count will update at various input events, and is debounced with 500 ms. Need to make sure this isn't too intensive. We can even debounce with 1 s; the word count doesn't really need to update until after you stopped typing.
It now excludes a lot more... Common punctuation and symbols are excluded by unicode blocks. I might add a few more blocks later. Previously a single dash or € mark or other common symbol somewhere would be counted as a word. We obviously can't exclude all the punctuation/symbols from all alphabets (so ranges in different blocks), but we can exclude the most common ones and blocks that are not alphabets. Maybe we could localise a supplement? So a translation could exclude more unicode ranges. Just a thought.
Added a small selection of characters that are "contracting": the apostrophe and hyphen. Maybe I missed some. All other characters to be excluded are "expanding". Previously word/word or word—word would be counted as one word. For most it doesn't really matter which category they're in, but from what I see there are only a few "contracting".
Shortcodes are now also excluded.
wordCounter now only does one things, and that's count words. :) Now you can easily use it somewhere else.
I added a few really simple tests.
#7
@
10 years ago
Should URLs be excluded from the word count? Or should it count as one word? Imo it shouldn't be counted as it's not a word.
#10
@
10 years ago
This ticket has sadly not seen any activity in two months. While fixing up the word count is something I would love to see (one of the bugs linked in the descriptions is one I reported), I wonder if it makes sense to continue to include this in the 4.2 milestone at this stage.
This ticket was mentioned in Slack in #core-editor by iseulde. View the logs.
10 years ago
This ticket was mentioned in Slack in #core-editor by chriscct7. View the logs.
10 years ago
#16
@
10 years ago
The most important "feature" of the word count script is to be super fast and unobtrusive. The moment it starts interfering with the typing even on the slowest device it becomes redundant, or even a bug.
In that terms thinking we should explore some other options:
- Update the count only on Enter. This still can slow down typing but at least is at the beginning of new paragraph where slowness is less noticeable.
- Make it run only on demand. Many text editors, including desktop/computer apps, have word count, character count and other text stats, but none has a "dynamic" word count as far as I've seen. Perhaps we should match that and make the word count run only on click?
#17
@
10 years ago
In my experience, I treat word count as a thing I glance at while not actually writing, but during the writing experience. If it's updating with every keystroke it could be a distraction. But some interval updates would be great. Also, +1 to @azaozz's note on speed implications.
I have no problem with it updating on enter only. I do, however, hope it doesn't get moved to a click event. It's a valuable feature that I really enjoy directly in the edit window.
#18
@
10 years ago
- Milestone changed from Future Release to 4.3
@krogsgard The current patch debounces the counting with half a second, which we could increase to a second or even a few. Then it would only update once you stop typing. Even if the counter is really slow, it doesn't matter that much as it won't be triggered a lot.
#19
@
10 years ago
I personally like use and advise my clients to use the Word Count that is included in WordPress. It helps me and those I advise write content to a certain amount. Which has benefits to SEO. It small and unobtrusive if it's not improved it should still be kept for these obvious advantages.
#20
follow-up:
↓ 23
@
9 years ago
Thinking we may want to consider other stats too. For example total number of characters or letters (is that useful?), number of paragraphs, images, links, etc.
Also, another consideration is what to do for Japanese, Chinese, Korean, etc. languages. I'm not sure now word count works there, or even if it makes sense. As far as I know a word can be a single glyph or several glyphs long, each glyph being a syllable.
#21
@
9 years ago
Chinese should be counted as one word for each glyph, I think Japanese and Korean would be the same for their various writing systems as far as I understand them but would be better to have confirmation from people who actually read/write in those. I don't know about anything past CJK, though.
#22
@
9 years ago
We count Japanese text by characters/letters/glyphs, not words (i.e. the indicator should show "100文字 (=100 characters")"), just like Twitter.
#23
in reply to:
↑ 20
;
follow-up:
↓ 24
@
9 years ago
Replying to azaozz:
Thinking we may want to consider other stats too. For example total number of characters or letters (is that useful?), number of paragraphs, images, links, etc.
Also, another consideration is what to do for Japanese, Chinese, Korean, etc. languages. I'm not sure now word count works there, or even if it makes sense. As far as I know a word can be a single glyph or several glyphs long, each glyph being a syllable.
#20738 implements multibyte character counting
#24
in reply to:
↑ 23
@
9 years ago
Replying to chriscct7:
Yep, we can count multibyte chars (as "words") easily. The problem is which multibyte chars to exclude (emoji, etc.) and how to make it as fast as possible. A nested loop doesn't seem that fast, but some speed tests would be needed anyway :)
Also, only certain multibyte char ranges should be counted as words. This will make it possible to count "mixed" content, for example Chinese + English in the same post.
#27
follow-up:
↓ 40
@
9 years ago
@helen, @Nao, so would certain characters be excluded from this count? Or would it just count every character? What about spaces? If it should work just like Twitter, it would include all that.
#28
@
9 years ago
According to the comments and patches on #8759, the character count works differently for Japanese and Chinese, one counts spaces, the other doesn't.
There is also some discussion whether to count Latin/ASCII chars as words in these languages, starting at https://core.trac.wordpress.org/ticket/8759#comment:15.
#30
@
9 years ago
We should try to exclude everything that is not a letter. Right now this is done for some characters by removing them from the string. Ideally we should look at which ranges and blocks can be excluded. However, some characters can connect words, such as /
, --
and —
. These should replaced with spaces. And then -
will usually connect two part of a word, so it should be removed.
#31
@
9 years ago
Yeah, when counting words we can replace all punctuation with spaces. Will make things a bit easier.
Another "speed-up" thing to try is using editor.getBody().textContent
instead of editor.getContent{'raw')
. That will give us only the text, with tags and entities already removed. If that works well, we will need to pass the context to the counting function as we will still need to strip tags for the Text editor (the tag stripping regex needs to catch html comments too).
#32
@
9 years ago
The current delay appears to be set to 2000ms, that feels way too slow.. It gives me enough time to hit enter, click outside of the area, etc, to try and get it to trigger.
At a minimum I feel that it should continue to trap the enter key to trigger a recount, and possibly decrease the delay to be something far smaller.
#33
@
9 years ago
I don't think anyone would be looking at the word count while typing though (like you do when you're testing it). But yes, maybe lowering it to 1s is better. There was even discussion about setting it to 3s, but I thought it was too much. We can also refresh on enter, but I'm not sure if there's more expectation for it to change then. What if you just leave the cursor at the end, or you're at the end of your post?
#36
@
9 years ago
Another "speed-up" thing to try is using editor.getBody().textContent instead of editor.getContent{'raw'). That will give us only the text, with tags and entities already removed. If that works well, we will need to pass the context to the counting function as we will still need to strip tags for the Text editor (the tag stripping regex needs to catch html comments too).
Should we exclude code
and form
?
#40
in reply to:
↑ 27
@
9 years ago
Replying to iseulde:
@helen, @Nao, so would certain characters be excluded from this count? Or would it just count every character? What about spaces? If it should work just like Twitter, it would include all that.
For Japanese, every character matches to [\S \u00A0\u3000]
shoud basically be counted. Please check out the WP Multibyte Patch's word-count.js for more details. There are some exceptions.
This ticket was mentioned in Slack in #core by iseulde. View the logs.
9 years ago
#42
follow-up:
↓ 43
@
9 years ago
@tenpura So you count everything, including spaces? Do tabs count? Maybe [^\n\r]
would do?
#45
@
9 years ago
We can add a new type
that counts characters, but includes spaces. The regular expression might need to be adjusted a bit if more characters need to be excludes (maybe tabs?). There are some other characters in \s
that are not really spaces, but they wouldn't be used in WordPress.
#49
@
9 years ago
To do:
- Exclude astral unicode blocks that contain punctuation and symbols. The most important one is emoji. (word count only)
- DONE. Exclude shortcodes.
- Exclude HTML comments.
- Exclude text inside certain HTML tags like
<pre><code>
(or<code>
in general?) and maybe<form>
?<script>
can also be use by admins and should also be excluded. - Any HTML entities that should be replaced?
- DONE. Make sure astrals are counted as one character. In JavaScript the
length
of an astral character is two. - Figure out how a mix of languages should be counted (but most likely wontfix).
- Should words separated by
/
count as two? What do we do with URLs?
This ticket was mentioned in Slack in #polyglots by azaozz. View the logs.
9 years ago
#54
@
9 years ago
However, emoji that are replaced with images in browsers that don't support them are not counted.
#57
@
9 years ago
Last patch for this ticket above. Removes HTML comments and fixes HTML entities.
Let's move the other points in to a new ticket. https://core.trac.wordpress.org/ticket/30966#comment:49
#58
@
9 years ago
The patch removes all entities for word count (not character). In the case that the entity is attached to a word this is no problem. In the case that it's on its own, there's a tiny chance it might need to be counted as a word. But considering that we don't create them, and even convert them back when saving or switching to visual, and that, when they're added by the user, they're likely to be a special character, I don't think this is a problem.
#59
@
9 years ago
Word count does not work if language is set to Romanian. It shows all the time 1. In other language I tested - EN, DE, FR - is working fine.
#60
@
9 years ago
@adrian.pop That's a translation issue.
I suggested a new one for you.
I wonder if we should filter out any incorrect translations, like it did before.
#63
@
9 years ago
- Resolution set to fixed
- Status changed from accepted to closed
Let's close this for 4.3 and open new tickets for other problems.
#64
@
9 years ago
Here's what's left:
- Exclude astral unicode blocks that contain punctuation and symbols. The most important one is emoji. (word count only)
- Exclude text inside certain HTML tags like
<pre><code>
(or<code>
in general?) and maybe<form>
?<script>
can also be use by admins and should also be excluded. - Figure out how a mix of languages should be counted (but most likely wontfix).
- Should words separated by
/
count as two? What do we do with URLs?
One ticket per issue? One ticket for all?
#66
@
9 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
Reopening until we figure out if we really need the all
type. It seems we can just include spaces for the character
type.
If we keep all
, we also need to fix https://core.trac.wordpress.org/browser/trunk/src/wp-includes/formatting.php#L2821
yes, do this - I had to rewrite it for a NYT project because it didn't work as advertised. I only modified the existing code, would love to see what you can come up with.