Ticket #8759 (accepted task (blessed))
Word count function doesn't work in several languages
| Reported by: |
|
Owned by: |
|
|---|---|---|---|
| Priority: | low | Milestone: | 3.4 |
| Component: | I18N | Version: | 2.7 |
| Severity: | normal | Keywords: | has-patch |
| Cc: | sirzooro |
Description
In multi byte language like Chinese and Japanese, etc., the word count function introduced from version 2.6 doesn't work.
I mounted the character count functions instead of the word count for these languages, and inserted it in the admin page.
I propose this function so that WordPress may bring convenience to more users.
Attachments
Change History
- Owner changed from anonymous to nbachiyski
- Component changed from Administration to i18n
comment:2
nbachiyski — 3 years ago
Looks good, except for the extra option in Settings -> General. It isn't worth to show it in the interface for all of the users, if it applies only to a few.
A plugin can add the checkbox, if needed.
alternative approach: we'd add the option, make it default to word, and the php files that are related to localized versions of WP should then filter its result by using option_count_method or something.
comment:4
nbachiyski — 3 years ago
In many cases users just change the locale/add some translation files and don't get the locale.php files.
That's why in this case I'm ok with locale pattern matching. Otherwise the feature can be totally unusable.
- Keywords needs-patch added; has-patch removed
- Milestone changed from 2.8 to Future Release
broken patch
Patch needs to be updated to do this automatically based on the locale set currently.
No option needed.
Nano8Blazex — 15 months ago
-
attachment
character_count.2.diff
added
patch attempt -> Google Code-in!
- Keywords 3.2-early gci added
- Owner changed from nbachiyski to westi
- Status changed from new to assigned
comment:10
westi — 11 months ago
- Keywords 3.2-early removed
- Priority changed from normal to low
- Milestone changed from Future Release to 3.2
comment:11
jane — 10 months ago
- Milestone changed from 3.2 to Future Release
Punting due to lack of activity before freeze.
comment:12
SergeyBiryukov — 4 months ago
There's a fix for this in WP Multibyte Patch plugin mentioned on WP Polyglots.
comment:15
jiehanzheng — 2 months ago
- Version changed from 2.7 to 3.3
I suggest modifying the current zh_CN "algorithm", which will make life a lot easier.
Here's what our current zh_CN-word-count.js does: it removes HTML tags first, then English punctuation marks, AND Chinese punctuation marks. And then it counts all "non-ASCII" characters, now the value of tc should be the number of non-English characters. After that, we use the original word-count.js method to count English words. After the entire process, tc is the number of Chinese characters and English words.
The Chinese word-count.js file can be found at: http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js
Points worth mentioning:
- Please consider removing punctuation marks in other languages because counting them doesn't make sense.
- As for the "Word count: %d" string, I suggest not to make changes to wp-includes/script-loader.php, because translators can simply translate this string to their corresponding meanings when translating -- simply adding a translators' note will do the trick.
- Our current zh_CN way doesn't consider some particular languages which should be counted as words but are not included in the ASCII set, like French (see examples below).
- Naming: the names of variables in the current zh_CN-word-count.js are not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-naming them.
Some testing:
I tested the zh_CN script with some test strings in multiple languages, and it turns out zh_CN script can handle most cases. Therefore I suggest modifying based on the zh_CN js file.
Chinese + English: PASS
欢迎使用 WordPress。
tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.
English only: PASS
Who says programmers don't have a sense of humor.
tc = 9.
Japanese + English: PASS
ログイン/ログアウト、管理、フィードと WordPress のリンク
tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese punctuation marks, 1 English word.
Burmese + English: ???
ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊ သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊ အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !
tc = 304 -- I need someone from Myanmar to help me out...
French: FAIL
Le français est une langue romane parlée sur plusieurs continents, principalement en Afrique.
tc = 15, actually there are only 13 French words: the problem may be caused by ç and é characters, which is not included by ASCII therefore counted as single characters. What we need to do is to change the t.SettingsAsian.count, currently:
/[^\u0000-\u007F]/g
to suit languages like French.
comment:16
follow-up:
↓ 18
SergeyBiryukov — 2 months ago
- Version changed from 3.3 to 2.7
Version number indicates when the bug was initially introduced/reported.
comment:17
nacin — 2 months ago
Nice!
We need to be very careful about speed here.
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
comment:18
in reply to:
↑ 16
jiehanzheng — 2 months ago
Replying to SergeyBiryukov:
Version number indicates when the bug was initially introduced/reported.
Thanks, man!
Replying to nacin:
Nice!
We need to be very careful about speed here.
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
I guess the whitelist wouldn't get too much longer because some Asian languages share the same set of punctuation marks with Chinese, as more use the same as English. We may consider asking l10n maintainers about their own special punctuation marks.
comment:19
azaozz — 2 months ago
Perhaps best would be to have several sets of regex there and choose the right one by looking at the lang setting (or body class/JS var). Thinking we can ask the localization teams to come up with the best regex for their language.
Another option would be to make the regex "pluggable". That would give a bit more flexibility to localizations but would require outputting a JS object with the replacement regex. As of 3.3 this can be done by using localize_script() for word-count.js.
comment:20
dd32 — 2 months ago
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
Given the little support in PHP for determining these things, we could always add a small subset which supports most languages, followed by a translatable string for translators to specify additional characters which should be considered punctuation?
comment:21
nacin — 2 months ago
followed by a translatable string for translators to specify additional characters which should be considered punctuation?
Bingo.
comment:22
tenpura — 2 months ago
jiehanzheng's method of summing up number of multibyte characters and English words sounds odd to me. I'm -1 on removing punctuation, too. Why don't we simply count the character length?
ログイン/ログアウト、管理、フィードと WordPress のリンク
should be counted as 34 characters. It looks more natural at least for Japanese usage.
comment:23
follow-up:
↓ 24
dd32 — 2 months ago
Why don't we simply count the character length?
Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.
comment:24
in reply to:
↑ 23
tenpura — 2 months ago
Replying to dd32:
Why don't we simply count the character length?
Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.
I know. I meant that for character counts basis locale. The right regex should be chosen by lang setting like azaozz said (or as a user option).
comment:25
follow-up:
↓ 26
sirzooro — 2 months ago
- Cc sirzooro added
+1 for this. Please make it pluggable - I would like to count chars in different way (simple strlen() after stripping HTML tags and trimming).
comment:26
in reply to:
↑ 25
SergeyBiryukov — 4 weeks ago
Replying to sirzooro:
Please make it pluggable - I would like to count chars in different way
It seems to be possible to deregister the bundled word-count script and register your own:
http://i18n.trac.wordpress.org/browser/zh_CN/tags/3.3.1/dist/wp-content/languages/zh_CN.php#L276
comment:27
sirzooro — 3 weeks ago
@SergeyBiryukov: thanks, I will check this link.
comment:28
nacin — 2 weeks ago
- Keywords gci removed
- Owner changed from westi to nacin
- Status changed from assigned to accepted
We essentially need to bridge the gap between the Chinese approach and the Japanese approach, and the merge them into core's word-count.js, and make that file understand whether the language requires character counting rather than word counting.
Let's have a discussion on this. I will reach out to both tenpura and jiehanzheng and attempt to schedule an IRC chat in #wordpress-polyglots.
comment:29
follow-up:
↓ 30
jiehanzheng — 2 weeks ago
Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).
However, for the English part, we still prefer counting number of words instead of number of letters.
comment:30
in reply to:
↑ 29
nacin — 2 weeks ago
Replying to jiehanzheng:
Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).
However, for the English part, we still prefer counting number of words instead of number of letters.
That does make things easier. I do agree that we should try to count the number of English words, rather than total characters, but I wouldn't call it a blocker if we don't. Anyone want to take a stab at a patch?
comment:31
nacin — 2 weeks ago
8759.diff started with the Japanese word-count.dev.js and merged it into core. It counts individual English characters, for now. How does it test?
comment:32
nacin — 2 weeks ago
In order to force character-based counting, set 'type' to 'c' in script-loader.php, or use:
add_filter( 'gettext_with_context', function( $translated, $text, $context ) {
if ( $text == 'words' && $context == 'word count: words or characters?' )
return 'characters';
return $translated;
}, 10, 3 );
comment:33
nacin — 40 hours ago
Trying this out in core. Let's see what feedback it generates.
comment:34
nacin — 40 hours ago
In [19966]:
