Ticket #8759 (accepted task (blessed))

Opened 3 years ago

Last modified 4 days ago

Word count function doesn't work in several languages

Reported by: jim912 Owned by: nacin
Priority: low Milestone: 3.4
Component: I18N Version: 2.7
Severity: normal Keywords: has-patch
Cc: sirzooro

Description

In multi byte language like Chinese and Japanese, etc., the word count function introduced from version 2.6 doesn't work.

I mounted the character count functions instead of the word count for these languages, and inserted it in the admin page.

I propose this function so that WordPress may bring convenience to more users.

Attachments

character_count.diff Download (5.7 KB) - added by jim912 3 years ago.
character_count.2.diff Download (4.0 KB) - added by Nano8Blazex 15 months ago.
patch attempt -> Google Code-in!
8759.diff Download (2.0 KB) - added by nacin 4 days ago.

Change History

jim9123 years ago

  • Owner changed from anonymous to nbachiyski
  • Component changed from Administration to i18n

Looks good, except for the extra option in Settings -> General. It isn't worth to show it in the interface for all of the users, if it applies only to a few.

A plugin can add the checkbox, if needed.

alternative approach: we'd add the option, make it default to word, and the php files that are related to localized versions of WP should then filter its result by using option_count_method or something.

In many cases users just change the locale/add some translation files and don't get the locale.php files.

That's why in this case I'm ok with locale pattern matching. Otherwise the feature can be totally unusable.

  • Keywords needs-patch added; has-patch removed
  • Milestone changed from 2.8 to Future Release

broken patch

Patch needs to be updated to do this automatically based on the locale set currently.

No option needed.

patch attempt -> Google Code-in!

  • Keywords has-patch added; needs-patch removed
  • Keywords 3.2-early gci added
  • Owner changed from nbachiyski to westi
  • Status changed from new to assigned

Related, excerpt handling: #16079.

  • Keywords 3.2-early removed
  • Priority changed from normal to low
  • Milestone changed from Future Release to 3.2
  • Milestone changed from 3.2 to Future Release

Punting due to lack of activity before freeze.

There's a fix for this in WP Multibyte Patch plugin mentioned on  WP Polyglots.

  • Milestone changed from Future Release to 3.4
  • Type changed from enhancement to task (blessed)
  • Version changed from 2.7 to 3.3

I suggest modifying the current zh_CN "algorithm", which will make life a lot easier.

Here's what our current zh_CN-word-count.js does: it removes HTML tags first, then English punctuation marks, AND Chinese punctuation marks. And then it counts all "non-ASCII" characters, now the value of tc should be the number of non-English characters. After that, we use the original word-count.js method to count English words. After the entire process, tc is the number of Chinese characters and English words.

The Chinese word-count.js file can be found at:  http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js

Points worth mentioning:

  • Please consider removing punctuation marks in other languages because counting them doesn't make sense.
  • As for the "Word count: %d" string, I suggest not to make changes to wp-includes/script-loader.php, because translators can simply translate this string to their corresponding meanings when translating -- simply adding a translators' note will do the trick.
  • Our current zh_CN way doesn't consider some particular languages which should be counted as words but are not included in the ASCII set, like French (see examples below).
  • Naming: the names of variables in the current zh_CN-word-count.js are not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-naming them.

Some testing:

I tested the zh_CN script with some test strings in multiple languages, and it turns out zh_CN script can handle most cases. Therefore I suggest modifying based on the zh_CN js file.

Chinese + English: PASS

欢迎使用 WordPress。

tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.

English only: PASS

Who says programmers don't have a sense of humor.

tc = 9.

Japanese + English: PASS

ログイン/ログアウト、管理、フィードと WordPress のリンク

tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese punctuation marks, 1 English word.

Burmese + English: ???

ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊ သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊ အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !

tc = 304 -- I need someone from Myanmar to help me out...

French: FAIL

Le français est une langue romane parlée sur plusieurs continents, principalement en Afrique.

tc = 15, actually there are only 13 French words: the problem may be caused by ç and é characters, which is not included by ASCII therefore counted as single characters. What we need to do is to change the t.SettingsAsian.count, currently:

/[^\u0000-\u007F]/g

to suit languages like French.

comment:16 follow-up: ↓ 18   SergeyBiryukov8 weeks ago

  • Version changed from 3.3 to 2.7

Version number indicates when the bug was initially introduced/reported.

Nice!

We need to be very careful about speed here.

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

comment:18 in reply to: ↑ 16   jiehanzheng8 weeks ago

Replying to SergeyBiryukov:

Version number indicates when the bug was initially introduced/reported.

Thanks, man!

Replying to nacin:

Nice!

We need to be very careful about speed here.

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

I guess the whitelist wouldn't get too much longer because some Asian languages share the same set of punctuation marks with Chinese, as more use the same as English. We may consider asking l10n maintainers about their own special punctuation marks.

Last edited 8 weeks ago by jiehanzheng (previous) (diff)

Perhaps best would be to have several sets of regex there and choose the right one by looking at the lang setting (or body class/JS var). Thinking we can ask the localization teams to come up with the best regex for their language.

Another option would be to make the regex "pluggable". That would give a bit more flexibility to localizations but would require outputting a JS object with the replacement regex. As of 3.3 this can be done by using localize_script() for word-count.js.

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

Given the little support in PHP for determining these things, we could always add a small subset which supports most languages, followed by a translatable string for translators to specify additional characters which should be considered punctuation?

followed by a translatable string for translators to specify additional characters which should be considered punctuation?

Bingo.

jiehanzheng's method of summing up number of multibyte characters and English words sounds odd to me. I'm -1 on removing punctuation, too. Why don't we simply count the character length?

ログイン/ログアウト、管理、フィードと WordPress のリンク

should be counted as 34 characters. It looks more natural at least for Japanese usage.

Last edited 8 weeks ago by tenpura (previous) (diff)

comment:23 follow-up: ↓ 24   dd328 weeks ago

Why don't we simply count the character length?

Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.

comment:24 in reply to: ↑ 23   tenpura8 weeks ago

Replying to dd32:

Why don't we simply count the character length?

Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.

I know. I meant that for character counts basis locale. The right regex should be chosen by lang setting like azaozz said (or as a user option).

comment:25 follow-up: ↓ 26   sirzooro8 weeks ago

  • Cc sirzooro added

+1 for this. Please make it pluggable - I would like to count chars in different way (simple strlen() after stripping HTML tags and trimming).

Last edited 8 weeks ago by sirzooro (previous) (diff)

comment:26 in reply to: ↑ 25   SergeyBiryukov2 weeks ago

Replying to sirzooro:

Please make it pluggable - I would like to count chars in different way

It seems to be possible to deregister the bundled word-count script and register your own:
 http://i18n.trac.wordpress.org/browser/zh_CN/tags/3.3.1/dist/wp-content/languages/zh_CN.php#L276

@SergeyBiryukov: thanks, I will check this link.

  • Keywords gci removed
  • Owner changed from westi to nacin
  • Status changed from assigned to accepted

We essentially need to bridge the gap between the  Chinese approach and the  Japanese approach, and the merge them into core's word-count.js, and make that file understand whether the language requires character counting rather than word counting.

Let's have a discussion on this. I will reach out to both tenpura and jiehanzheng and attempt to schedule an IRC chat in #wordpress-polyglots.

comment:29 follow-up: ↓ 30   jiehanzheng5 days ago

Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).

However, for the English part, we still prefer counting number of words instead of number of letters.

comment:30 in reply to: ↑ 29   nacin5 days ago

Replying to jiehanzheng:

Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).

However, for the English part, we still prefer counting number of words instead of number of letters.

That does make things easier. I do agree that we should try to count the number of English words, rather than total characters, but I wouldn't call it a blocker if we don't. Anyone want to take a stab at a patch?

nacin4 days ago

8759.diff Download started with the Japanese word-count.dev.js and merged it into core. It counts individual English characters, for now. How does it test?

In order to force character-based counting, set 'type' to 'c' in script-loader.php, or use:

add_filter( 'gettext_with_context', function( $translated, $text, $context ) {
	if ( $text == 'words' && $context == 'word count: words or characters?' )
		return 'characters';
	return $translated;
}, 10, 3 );
Note: See TracTickets for help on using tickets.