Make WordPress Core

Opened 15 years ago

Closed 12 years ago

#8759 closed task (blessed) (fixed)

Word count function doesn't work in several languages

Reported by: jim912's profile jim912 Owned by: nacin's profile nacin
Milestone: 3.4 Priority: low
Severity: normal Version: 2.7
Component: I18N Keywords: has-patch needs-testing
Focuses: Cc:

Description

In multi byte language like Chinese and Japanese, etc., the word count function introduced from version 2.6 doesn't work.

I mounted the character count functions instead of the word count for these languages, and inserted it in the admin page.

I propose this function so that WordPress may bring convenience to more
users.

Attachments (7)

character_count.diff (5.7 KB) - added by jim912 15 years ago.
character_count.2.diff (4.0 KB) - added by Nano8Blazex 13 years ago.
patch attempt -> Google Code-in!
8759.diff (2.0 KB) - added by nacin 12 years ago.
8759.2.diff (1.3 KB) - added by tenpura 12 years ago.
counting spaces, too?
8759.3.diff (5.1 KB) - added by jiehanzheng 12 years ago.
Enables translators to control "Count Latin by", "Count Latin spaces", "Count East Asia punctuation marks" separately from their pomo translations. With this patch, requirements of Japanese and Chinese team are met and flexibility is ensured.
8759.4.diff (1.7 KB) - added by jiehanzheng 12 years ago.
Patch that supports only Chinese. The file itself to follow.
word-count.dev.js (1.3 KB) - added by jiehanzheng 12 years ago.
Only supports Chinese

Download all attachments as: .zip

Change History (55)

#1 @Denis-de-Bernardy
15 years ago

  • Component changed from Administration to i18n
  • Owner changed from anonymous to nbachiyski

#2 @nbachiyski
15 years ago

Looks good, except for the extra option in Settings -> General. It isn't worth to show it in the interface for all of the users, if it applies only to a few.

A plugin can add the checkbox, if needed.

#3 @Denis-de-Bernardy
15 years ago

alternative approach: we'd add the option, make it default to word, and the php files that are related to localized versions of WP should then filter its result by using option_count_method or something.

#4 @nbachiyski
15 years ago

In many cases users just change the locale/add some translation files and don't get the locale.php files.

That's why in this case I'm ok with locale pattern matching. Otherwise the feature can be totally unusable.

#5 @Denis-de-Bernardy
15 years ago

  • Keywords needs-patch added; has-patch removed
  • Milestone changed from 2.8 to Future Release

broken patch

#6 @westi
13 years ago

Patch needs to be updated to do this automatically based on the locale set currently.

No option needed.

@Nano8Blazex
13 years ago

patch attempt -> Google Code-in!

#7 @Nano8Blazex
13 years ago

  • Keywords has-patch added; needs-patch removed

#8 @westi
13 years ago

  • Keywords 3.2-early gci added
  • Owner changed from nbachiyski to westi
  • Status changed from new to assigned

#9 @nacin
13 years ago

Related, excerpt handling: #16079.

#10 @westi
13 years ago

  • Keywords 3.2-early removed
  • Milestone changed from Future Release to 3.2
  • Priority changed from normal to low

#11 @jane
13 years ago

  • Milestone changed from 3.2 to Future Release

Punting due to lack of activity before freeze.

#12 @SergeyBiryukov
12 years ago

There's a fix for this in WP Multibyte Patch plugin mentioned on WP Polyglots.

#13 @nacin
12 years ago

  • Milestone changed from Future Release to 3.4

#14 @nacin
12 years ago

  • Type changed from enhancement to task (blessed)

#15 @jiehanzheng
12 years ago

  • Version changed from 2.7 to 3.3

I suggest modifying the current zh_CN "algorithm", which will make life a lot easier.

Here's what our current zh_CN-word-count.js does: it removes HTML tags first, then English punctuation marks, AND Chinese punctuation marks. And then it counts all "non-ASCII" characters, now the value of tc should be the number of non-English characters. After that, we use the original word-count.js method to count English words. After the entire process, tc is the number of Chinese characters and English words.

The Chinese word-count.js file can be found at:
http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js

Points worth mentioning:

  • Please consider removing punctuation marks in other languages because counting them doesn't make sense.
  • As for the "Word count: %d" string, I suggest not to make changes to wp-includes/script-loader.php, because translators can simply translate this string to their corresponding meanings when translating -- simply adding a translators' note will do the trick.
  • Our current zh_CN way doesn't consider some particular languages which should be counted as words but are not included in the ASCII set, like French (see examples below).
  • Naming: the names of variables in the current zh_CN-word-count.js are not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-naming them.

Some testing:

I tested the zh_CN script with some test strings in multiple languages, and it turns out zh_CN script can handle most cases. Therefore I suggest modifying based on the zh_CN js file.

Chinese + English: PASS

欢迎使用 WordPress。

tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.

English only: PASS

Who says programmers don't have a sense of humor.

tc = 9.

Japanese + English: PASS

ログイン/ログアウト、管理、フィードと WordPress のリンク

tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese punctuation marks, 1 English word.

Burmese + English: ???

ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊ သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊ အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !

tc = 304 -- I need someone from Myanmar to help me out...

French: FAIL

Le français est une langue romane parlée sur plusieurs continents, principalement en Afrique.

tc = 15, actually there are only 13 French words: the problem may be caused by ç and é characters, which is not included by ASCII therefore counted as single characters. What we need to do is to change the t.SettingsAsian.count, currently:

/[^\u0000-\u007F]/g

to suit languages like French.

#16 follow-up: @SergeyBiryukov
12 years ago

  • Version changed from 3.3 to 2.7

Version number indicates when the bug was initially introduced/reported.

#17 @nacin
12 years ago

Nice!

We need to be very careful about speed here.

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

#18 in reply to: ↑ 16 @jiehanzheng
12 years ago

Replying to SergeyBiryukov:

Version number indicates when the bug was initially introduced/reported.

Thanks, man!

Replying to nacin:

Nice!

We need to be very careful about speed here.

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

I guess the whitelist wouldn't get too much longer because some Asian languages share the same set of punctuation marks with Chinese, as more use the same as English. We may consider asking l10n maintainers about their own special punctuation marks.

Last edited 12 years ago by jiehanzheng (previous) (diff)

#19 @azaozz
12 years ago

Perhaps best would be to have several sets of regex there and choose the right one by looking at the lang setting (or body class/JS var). Thinking we can ask the localization teams to come up with the best regex for their language.

Another option would be to make the regex "pluggable". That would give a bit more flexibility to localizations but would require outputting a JS object with the replacement regex. As of 3.3 this can be done by using localize_script() for word-count.js.

#20 @dd32
12 years ago

Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?

Given the little support in PHP for determining these things, we could always add a small subset which supports most languages, followed by a translatable string for translators to specify additional characters which should be considered punctuation?

#21 @nacin
12 years ago

followed by a translatable string for translators to specify additional characters which should be considered punctuation?

Bingo.

#22 @tenpura
12 years ago

jiehanzheng's method of summing up number of multibyte characters and English words sounds odd to me. I'm -1 on removing punctuation, too. Why don't we simply count the character length?

ログイン/ログアウト、管理、フィードと WordPress のリンク

should be counted as 34 characters. It looks more natural at least for Japanese usage.

Last edited 12 years ago by tenpura (previous) (diff)

#23 follow-up: @dd32
12 years ago

Why don't we simply count the character length?

Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.

#24 in reply to: ↑ 23 @tenpura
12 years ago

Replying to dd32:

Why don't we simply count the character length?

Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.

I know. I meant that for character counts basis locale. The right regex should be chosen by lang setting like azaozz said (or as a user option).

#25 follow-up: @sirzooro
12 years ago

  • Cc sirzooro added

+1 for this. Please make it pluggable - I would like to count chars in different way (simple strlen() after stripping HTML tags and trimming).

Last edited 12 years ago by sirzooro (previous) (diff)

#26 in reply to: ↑ 25 @SergeyBiryukov
12 years ago

Replying to sirzooro:

Please make it pluggable - I would like to count chars in different way

It seems to be possible to deregister the bundled word-count script and register your own:
http://i18n.trac.wordpress.org/browser/zh_CN/tags/3.3.1/dist/wp-content/languages/zh_CN.php#L276

#27 @sirzooro
12 years ago

@SergeyBiryukov: thanks, I will check this link.

#28 @nacin
12 years ago

  • Keywords gci removed
  • Owner changed from westi to nacin
  • Status changed from assigned to accepted

We essentially need to bridge the gap between the Chinese approach and the Japanese approach, and the merge them into core's word-count.js, and make that file understand whether the language requires character counting rather than word counting.

Let's have a discussion on this. I will reach out to both tenpura and jiehanzheng and attempt to schedule an IRC chat in #wordpress-polyglots.

#29 follow-up: @jiehanzheng
12 years ago

Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).

However, for the English part, we still prefer counting number of words instead of number of letters.

#30 in reply to: ↑ 29 @nacin
12 years ago

Replying to jiehanzheng:

Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).

However, for the English part, we still prefer counting number of words instead of number of letters.

That does make things easier. I do agree that we should try to count the number of English words, rather than total characters, but I wouldn't call it a blocker if we don't. Anyone want to take a stab at a patch?

@nacin
12 years ago

#31 @nacin
12 years ago

8759.diff started with the Japanese word-count.dev.js and merged it into core. It counts individual English characters, for now. How does it test?

#32 @nacin
12 years ago

In order to force character-based counting, set 'type' to 'c' in script-loader.php, or use:

add_filter( 'gettext_with_context', function( $translated, $text, $context ) {
	if ( $text == 'words' && $context == 'word count: words or characters?' )
		return 'characters';
	return $translated;
}, 10, 3 );

#33 @nacin
12 years ago

Trying this out in core. Let's see what feedback it generates.

#34 @nacin
12 years ago

In [19966]:

Allow counting by characters in lieu of a word count, for East Asian languages. First pass. see #8759.

@tenpura
12 years ago

counting spaces, too?

#35 follow-up: @tenpura
12 years ago

In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary. It contains trimming, redundant space removing and non-breaking/double-byte space character counting regexp.

#36 @Nao
12 years ago

tenpura's patch 8759.2.diff correctly worked for me in Japanese. Tested with 3.4 beta 1.

@jiehanzheng
12 years ago

Enables translators to control "Count Latin by", "Count Latin spaces", "Count East Asia punctuation marks" separately from their pomo translations. With this patch, requirements of Japanese and Chinese team are met and flexibility is ensured.

#37 @jiehanzheng
12 years ago

In 8759.3.diff, I made the options "Count Latin by", "Count Latin spaces", "Count East Asia punctuation marks" separately-controllable, which gives more control to translators. For example, the Chinese team could translate the control strings like this:

"Count Latin by" --> "words"
"Count Latin spaces" --> "no"
"Count East Asia punctuation marks" --> "yes"

And the Japanese team could set the strings like this:

"Count Latin by" --> "characters"
"Count Latin spaces" --> "yes"
"Count East Asia punctuation marks" --> "yes"

Known issue:

  • I am unable to match Unicode chars which have five hex digits, maybe due to JavaScript RegExp restrictions? Does anyone have solutions? Although those characters that have five hex digits are used very rarely, and should only affect Chinese characters.

Notes: The Unicode ranges for East Asia characters listed in settingsEastAsia.count are copied from http://www.unicode.org/charts/.

#38 @jiehanzheng
12 years ago

  • Keywords needs-testing added

#39 in reply to: ↑ 35 ; follow-ups: @nacin
12 years ago

Replying to tenpura:

In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary.

Could you explain that?

Replying to jiehanzheng:

With this patch, requirements of Japanese and Chinese team are met and flexibility is ensured.

This seems like it is presenting a solution, rather than solving a particular problem.

As of right now, 3.4 is going to be released with [19966]. What in particular needs to be done that would make this okay for 3.4 for both the Chinese and Japanese teams? By that, I mean, not any worse than 3.3. Best I can tell, the current script is pretty close to what was in the localized builds.

#40 in reply to: ↑ 39 ; follow-up: @jiehanzheng
12 years ago

Replying to nacin:

This seems like it is presenting a solution, rather than solving a particular problem.

As of right now, 3.4 is going to be released with [19966]. What in particular needs to be done that would make this okay for 3.4 for both the Chinese and Japanese teams? By that, I mean, not any worse than 3.3. Best I can tell, the current script is pretty close to what was in the localized builds.

Please understand that I am trying to solve a particular problem: the patch 8759.diff does not work for Chinese.

Needless to say, what needs to be done for the Chinese team is to make word-count.js count Latin part by words, not characters. 8759.diff and 8759.2.diff indeed solve the problem the Japanese team is facing, but they do not work for Chinese.

Last edited 12 years ago by jiehanzheng (previous) (diff)

#41 in reply to: ↑ 40 ; follow-up: @nacin
12 years ago

Replying to jiehanzheng:

what needs to be done for the Chinese team is to make word-count.js count Latin part by words, not characters.

That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js

#42 in reply to: ↑ 41 ; follow-up: @jiehanzheng
12 years ago

Replying to nacin:

That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js

Yes you could say that, except that Chinese punctuation marks do not need to be removed anymore. Do you want me to submit a new version (counting Chinese punctuation marks and also better variable naming) of this, without Japanese support?

As for what exactly we want, please see this comment. Thanks.

#43 in reply to: ↑ 42 @nacin
12 years ago

Replying to jiehanzheng:

Replying to nacin:

That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js

Yes you could say that, except that Chinese punctuation marks do not need to be removed anymore. Do you want me to submit a new version (counting Chinese punctuation marks and also better variable naming) of this, without Japanese support?

Sure, that coud be helpful.

As for what exactly we want, please see this comment. Thanks.

Okay. I was asking from the code perspective.

@jiehanzheng
12 years ago

Patch that supports only Chinese. The file itself to follow.

@jiehanzheng
12 years ago

Only supports Chinese

#44 in reply to: ↑ 39 @tenpura
12 years ago

Replying to nacin:

Replying to tenpura:

In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary.

Could you explain that?

What we (Japanese) need is simple mb_strlen() like character counting (with the functionality to remove spaces that web browsers ignore when rendering HTML). Thus 8759.2.diff counts "a b c " as 5 characters. In contrast, 8759.diff counts "a b c " as 3 characters which is far different from what we expected and useless.

I noticed that sirzooro's needs seem exactly the same as ours (his language is not East Asian's, is it?). For that reason, 8759.2.diff's character counting method should be called something like "general character counting" rather than "East Asian's" to avoid confusion.

#45 @westi
12 years ago

We should close the for 3.4 and raise a new ticket for 3.5 for further enhancements.

What we have in 3.4 is better than 3.3

#46 @westi
12 years ago

  • Resolution set to fixed
  • Status changed from accepted to closed

Follow on ticket for future enhancements - #20738

Closing this as done for 3.4

#47 follow-up: @jiehanzheng
12 years ago

  • Resolution fixed deleted
  • Status changed from closed to reopened

This should NOT be marked "fixed" since the statement "Word count function doesn't work in several languages" still holds true.

#48 in reply to: ↑ 47 @westi
12 years ago

  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to jiehanzheng:

This should NOT be marked "fixed" since the statement "Word count function doesn't work in several languages" still holds true.

This is marked fixed because this is as fixed as it is going to be for 3.4

We have a new ticket for future enhancements as detailed above.

Note: See TracTickets for help on using tickets.