WordPress.org

Make WordPress Core

Opened 14 months ago

Last modified 14 months ago

#40759 new defect (bug)

Word Count Discrepancies

Reported by: pento Owned by:
Milestone: Awaiting Review Priority: normal
Severity: normal Version:
Component: Editor Keywords:
Focuses: administration Cc:

Description (last modified by pento)

I've noticed several discrepancies between how WordPress, Pages, Google Docs, and Word count words. Given the following text, all four count things quite differently.

a 1
foo-bar
e.g.
jack & jill
5 @ $4.99
.
fuzz@baz.blog
WordPress Word Pages Docs
Individual Words (a, jack, jill) 3 3 3 3
Individual Numbers (1, 5) 0 2 2 2
Hyphenated Words (foo-bar) 1 1 2 1
Abbreviations (e.g.) 1 1 2 2
Punctuation that translates to a word (&) 0 1 0 0
Punctuation that translates to a word in this usage (@) 0 1 0 0
Punctuation that doesn't translate to a word (.) 0 1 0 0
Compound number ($4.99) 0 1 1 2
Email address (fuzz@baz.blog) 1 1 3 3

I tend to fall in the camp of "what would a reasonable native speaker count as a word", which is probably closest to Word's definition, minus the punctuation that doesn't translate to a word.

Attachments (1)

40759.patch (883 bytes) - added by iseuldebot 14 months ago.

Download all attachments as: .zip

Change History (8)

#1 follow-up: @pento
14 months ago

Side note: the Word test was done with Word Online, which appears to just split by whitespace to determine words, Word for Android appears to do the same. I don't know if the desktop versions have a more complex word count algorithm.

#2 @pento
14 months ago

  • Description modified (diff)

Added Google Docs counts to original description.

#3 @pento
14 months ago

  • Description modified (diff)

#4 in reply to: ↑ 1 @voldemortensen
14 months ago

Replying to pento:

Side note: the Word test was done with Word Online, which appears to just split by whitespace to determine words, Word for Android appears to do the same. I don't know if the desktop versions have a more complex word count algorithm.

Word 2011 for Mac also seems to just split whitespace.

#5 @azaozz
14 months ago

Yeah, the WordPress word counter doesn't count numbers as words (for Cyrillic, Greek and Latin alphabets). That was the outcome of the research back then.

I know there are some differences for different locales, but making the word count more precise would involve a lot more filtering/regex that has to be locale specific. This is impractical if the word counting runs all the time. Even now it gets pretty slow for larger posts. The main requirement for it is to be "undetectable" and never slow typing in the editor. I mean, what's good about a word count that interferes with the user being able to type or edit the text :)

Is there another app that shows "dynamic" count that is always visible and updated? Don't think any of the above apps do. Perhaps we should make the WP word counting similar to them?

As far as I remember the app version of Word had a big "stats" dialog showing quite a few statistics besides word count: char count, sentences, images/media, etc. but you had to open it from a submenu.

#6 @pento
14 months ago

Yah, #8068 mentions "Numbers are not considered words.". :-P That seems to differ from the majority behaviour, though.

Pages and Word Online both show the word count at all times, but they both appear to wait until there's a break in typing to recount. For Pages, it just waits until no keys are currently being pressed. Word Online waits about a second from the last key press. The actual count in both is very fast - pasting a 5000 word block gets a count in a fraction of a second.

#7 @iseulde
14 months ago

#30966 was a big change in word count, but it looks like we did not discuss or change counting numbers. I'm fine with counting numbers. To change it, split \u0021-\u0040 into \u0021-\u002F and \u003A-\u0040.

We only have word count and character count which is a setting that can be localised. I the case of numbers, it doesn't make a difference for character count as all characters are counted.

For the other issues:

  • It looks like foo-bar is right.
  • e.g.: Try to move . to connectorRegExp, which will be replaced with spaces? Sounds right to me. Maybe more characters need to move here. Any counter examples? One in your examples: fuzz@baz.blog.
  • I have no strong opinion on &.
  • I'm not sure about @, which comes closer to symbols like % and +. These also translate to a word?
  • Standalone .: looks right to me.
  • $4.99: Would be solved with including numbers.
  • fuzz@baz.blog: Also looks right to me.

Try URLs. 😉

@iseuldebot
14 months ago

Note: See TracTickets for help on using tickets.