Opened 16 years ago
Closed 12 years ago
#8759 closed task (blessed) (fixed)
Word count function doesn't work in several languages
Reported by: | jim912 | Owned by: | nacin |
---|---|---|---|
Milestone: | 3.4 | Priority: | low |
Severity: | normal | Version: | 2.7 |
Component: | I18N | Keywords: | has-patch needs-testing |
Focuses: | Cc: |
Description
In multi byte language like Chinese and Japanese, etc., the word count function introduced from version 2.6 doesn't work.
I mounted the character count functions instead of the word count for these languages, and inserted it in the admin page.
I propose this function so that WordPress may bring convenience to more
users.
Attachments (7)
Change History (55)
#1
@
15 years ago
- Component changed from Administration to i18n
- Owner changed from anonymous to nbachiyski
#3
@
15 years ago
alternative approach: we'd add the option, make it default to word, and the php files that are related to localized versions of WP should then filter its result by using option_count_method or something.
#4
@
15 years ago
In many cases users just change the locale/add some translation files and don't get the locale.php files.
That's why in this case I'm ok with locale pattern matching. Otherwise the feature can be totally unusable.
#5
@
15 years ago
- Keywords needs-patch added; has-patch removed
- Milestone changed from 2.8 to Future Release
broken patch
#6
@
14 years ago
Patch needs to be updated to do this automatically based on the locale set currently.
No option needed.
#8
@
14 years ago
- Keywords 3.2-early gci added
- Owner changed from nbachiyski to westi
- Status changed from new to assigned
#10
@
14 years ago
- Keywords 3.2-early removed
- Milestone changed from Future Release to 3.2
- Priority changed from normal to low
#11
@
13 years ago
- Milestone changed from 3.2 to Future Release
Punting due to lack of activity before freeze.
#12
@
13 years ago
There's a fix for this in WP Multibyte Patch plugin mentioned on WP Polyglots.
#15
@
13 years ago
- Version changed from 2.7 to 3.3
I suggest modifying the current zh_CN "algorithm", which will make life a lot easier.
Here's what our current zh_CN-word-count.js does: it removes HTML tags first, then English punctuation marks, AND Chinese punctuation marks. And then it counts all "non-ASCII" characters, now the value of tc should be the number of non-English characters. After that, we use the original word-count.js method to count English words. After the entire process, tc is the number of Chinese characters and English words.
The Chinese word-count.js file can be found at:
http://i18n.svn.wordpress.org/zh_CN/tags/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js
Points worth mentioning:
- Please consider removing punctuation marks in other languages because counting them doesn't make sense.
- As for the "Word count: %d" string, I suggest not to make changes to wp-includes/script-loader.php, because translators can simply translate this string to their corresponding meanings when translating -- simply adding a translators' note will do the trick.
- Our current zh_CN way doesn't consider some particular languages which should be counted as words but are not included in the ASCII set, like French (see examples below).
- Naming: the names of variables in the current zh_CN-word-count.js are not accurate (e.g. settingsWestern and settingsAsian) -- I suggest re-naming them.
Some testing:
I tested the zh_CN script with some test strings in multiple languages, and it turns out zh_CN script can handle most cases. Therefore I suggest modifying based on the zh_CN js file.
Chinese + English: PASS
欢迎使用 WordPress。
tc = 5, 4 Chinese chars, 1 English word, 1 Chinese punctuation.
English only: PASS
Who says programmers don't have a sense of humor.
tc = 9.
Japanese + English: PASS
ログイン/ログアウト、管理、フィードと WordPress のリンク
tc = 21, 20 Japanese chars, 1 English punctuation mark (slash), 2 Japanese punctuation marks, 1 English word.
Burmese + English: ???
ကူညီပံ့ပိုးမှု ဖိုရမ်များသို့ ေမးခွန်းများ/အေြဖများ/အြကံြပုချက်များ ေရးြခင်းနှင့် လမ်းညွှန်ချက်စာတမ်းများေရးြခင်း၊ ဘာသာြပန်ြခင်း၊ သံုးသူအြမင်ပိုင်းဆိုင်ရာ ဒီဇိုင်းြပုလုပ်ြခင်း၊ ဘီတာများကို စမ်းသပ်ြခင်း၊ အမှားများကိုြပင်ြခင်း၊ အမှားများကို သတင်းပို့ြခင်းတို့အတွက် WordPress မှ လူများ ပိုမိုလိုအပ်လျှက်ရှိပါသည်။ ပါဝင်ေဆာင်ရွက်လိုက်ပါ !
tc = 304 -- I need someone from Myanmar to help me out...
French: FAIL
Le français est une langue romane parlée sur plusieurs continents, principalement en Afrique.
tc = 15, actually there are only 13 French words: the problem may be caused by ç and é characters, which is not included by ASCII therefore counted as single characters. What we need to do is to change the t.SettingsAsian.count, currently:
/[^\u0000-\u007F]/g
to suit languages like French.
#16
follow-up:
↓ 18
@
13 years ago
- Version changed from 3.3 to 2.7
Version number indicates when the bug was initially introduced/reported.
#17
@
13 years ago
Nice!
We need to be very careful about speed here.
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
#18
in reply to:
↑ 16
@
13 years ago
Replying to SergeyBiryukov:
Version number indicates when the bug was initially introduced/reported.
Thanks, man!
Replying to nacin:
Nice!
We need to be very careful about speed here.
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
I guess the whitelist wouldn't get too much longer because some Asian languages share the same set of punctuation marks with Chinese, as more use the same as English. We may consider asking l10n maintainers about their own special punctuation marks.
#19
@
13 years ago
Perhaps best would be to have several sets of regex there and choose the right one by looking at the lang
setting (or body class/JS var). Thinking we can ask the localization teams to come up with the best regex for their language.
Another option would be to make the regex "pluggable". That would give a bit more flexibility to localizations but would require outputting a JS object with the replacement regex. As of 3.3 this can be done by using localize_script() for word-count.js.
#20
@
13 years ago
Also, if we decide to whitelist punctuation, we'll need to do it for all languages that require character counting. How crazy does that get?
Given the little support in PHP for determining these things, we could always add a small subset which supports most languages, followed by a translatable string for translators to specify additional characters which should be considered punctuation?
#21
@
13 years ago
followed by a translatable string for translators to specify additional characters which should be considered punctuation?
Bingo.
#22
@
13 years ago
jiehanzheng's method of summing up number of multibyte characters and English words sounds odd to me. I'm -1 on removing punctuation, too. Why don't we simply count the character length?
ログイン/ログアウト、管理、フィードと WordPress のリンク
should be counted as 34 characters. It looks more natural at least for Japanese usage.
#23
follow-up:
↓ 24
@
13 years ago
Why don't we simply count the character length?
Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.
#24
in reply to:
↑ 23
@
13 years ago
Replying to dd32:
Why don't we simply count the character length?
Because in some languages, Character counts are useless (For example, English) - Just as English "Word Counts" in Japanese may be useless.
I know. I meant that for character counts basis locale. The right regex should be chosen by lang setting like azaozz said (or as a user option).
#25
follow-up:
↓ 26
@
13 years ago
- Cc sirzooro added
+1 for this. Please make it pluggable - I would like to count chars in different way (simple strlen()
after stripping HTML tags and trimming).
#26
in reply to:
↑ 25
@
13 years ago
Replying to sirzooro:
Please make it pluggable - I would like to count chars in different way
It seems to be possible to deregister the bundled word-count
script and register your own:
http://i18n.trac.wordpress.org/browser/zh_CN/tags/3.3.1/dist/wp-content/languages/zh_CN.php#L276
#28
@
13 years ago
- Keywords gci removed
- Owner changed from westi to nacin
- Status changed from assigned to accepted
We essentially need to bridge the gap between the Chinese approach and the Japanese approach, and the merge them into core's word-count.js, and make that file understand whether the language requires character counting rather than word counting.
Let's have a discussion on this. I will reach out to both tenpura and jiehanzheng and attempt to schedule an IRC chat in #wordpress-polyglots.
#29
follow-up:
↓ 30
@
13 years ago
Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).
However, for the English part, we still prefer counting number of words instead of number of letters.
#30
in reply to:
↑ 29
@
13 years ago
Replying to jiehanzheng:
Sorry I have to make a little change to the Chinese approach here after my quick research. It seems that our users prefer INCLUDING Chinese punctuation marks in the total word count, which will make our work easier (no longer need to remove Chinese punctuation marks).
However, for the English part, we still prefer counting number of words instead of number of letters.
That does make things easier. I do agree that we should try to count the number of English words, rather than total characters, but I wouldn't call it a blocker if we don't. Anyone want to take a stab at a patch?
#32
@
13 years ago
In order to force character-based counting, set 'type' to 'c' in script-loader.php, or use:
add_filter( 'gettext_with_context', function( $translated, $text, $context ) { if ( $text == 'words' && $context == 'word count: words or characters?' ) return 'characters'; return $translated; }, 10, 3 );
#35
follow-up:
↓ 39
@
13 years ago
In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary. It contains trimming, redundant space removing and non-breaking/double-byte space character counting regexp.
#36
@
12 years ago
tenpura's patch 8759.2.diff correctly worked for me in Japanese. Tested with 3.4 beta 1.
@
12 years ago
Enables translators to control "Count Latin by", "Count Latin spaces", "Count East Asia punctuation marks" separately from their pomo translations. With this patch, requirements of Japanese and Chinese team are met and flexibility is ensured.
#37
@
12 years ago
In 8759.3.diff, I made the options "Count Latin by", "Count Latin spaces", "Count East Asia punctuation marks" separately-controllable, which gives more control to translators. For example, the Chinese team could translate the control strings like this:
"Count Latin by" --> "words" "Count Latin spaces" --> "no" "Count East Asia punctuation marks" --> "yes"
And the Japanese team could set the strings like this:
"Count Latin by" --> "characters" "Count Latin spaces" --> "yes" "Count East Asia punctuation marks" --> "yes"
Known issue:
- I am unable to match Unicode chars which have five hex digits, maybe due to JavaScript RegExp restrictions? Does anyone have solutions? Although those characters that have five hex digits are used very rarely, and should only affect Chinese characters.
Notes: The Unicode ranges for East Asia characters listed in settingsEastAsia.count are copied from http://www.unicode.org/charts/.
#39
in reply to:
↑ 35
;
follow-ups:
↓ 40
↓ 44
@
12 years ago
Replying to tenpura:
In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary.
Could you explain that?
Replying to jiehanzheng:
With this patch, requirements of Japanese and Chinese team are met and flexibility is ensured.
This seems like it is presenting a solution, rather than solving a particular problem.
As of right now, 3.4 is going to be released with [19966]. What in particular needs to be done that would make this okay for 3.4 for both the Chinese and Japanese teams? By that, I mean, not any worse than 3.3. Best I can tell, the current script is pretty close to what was in the localized builds.
#40
in reply to:
↑ 39
;
follow-up:
↓ 41
@
12 years ago
Replying to nacin:
This seems like it is presenting a solution, rather than solving a particular problem.
As of right now, 3.4 is going to be released with [19966]. What in particular needs to be done that would make this okay for 3.4 for both the Chinese and Japanese teams? By that, I mean, not any worse than 3.3. Best I can tell, the current script is pretty close to what was in the localized builds.
Please understand that I am trying to solve a particular problem: the patch 8759.diff does not work for Chinese.
Needless to say, what needs to be done for the Chinese team is to make word-count.js count Latin part by words, not characters. 8759.diff and 8759.2.diff indeed solve the problem the Japanese team is facing, but they do not work for Chinese.
#41
in reply to:
↑ 40
;
follow-up:
↓ 42
@
12 years ago
Replying to jiehanzheng:
what needs to be done for the Chinese team is to make word-count.js count Latin part by words, not characters.
That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js
#42
in reply to:
↑ 41
;
follow-up:
↓ 43
@
12 years ago
Replying to nacin:
That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js
Yes you could say that, except that Chinese punctuation marks do not need to be removed anymore. Do you want me to submit a new version (counting Chinese punctuation marks and also better variable naming) of this, without Japanese support?
As for what exactly we want, please see this comment. Thanks.
#43
in reply to:
↑ 42
@
12 years ago
Replying to jiehanzheng:
Replying to nacin:
That is understandable. Is the ideal JS for you guys whatever you bundled in 3.3? http://i18n.svn.wordpress.org/zh_CN/branches/3.3/dist/wp-content/languages/zh_CN-word-count.dev.js
Yes you could say that, except that Chinese punctuation marks do not need to be removed anymore. Do you want me to submit a new version (counting Chinese punctuation marks and also better variable naming) of this, without Japanese support?
Sure, that coud be helpful.
As for what exactly we want, please see this comment. Thanks.
Okay. I was asking from the code perspective.
#44
in reply to:
↑ 39
@
12 years ago
Replying to nacin:
Replying to tenpura:
In 8759.2.diff, I added space character counting capability to word-count.dev.js because it seems necessary.
Could you explain that?
What we (Japanese) need is simple mb_strlen() like character counting (with the functionality to remove spaces that web browsers ignore when rendering HTML). Thus 8759.2.diff counts "a b c " as 5 characters. In contrast, 8759.diff counts "a b c " as 3 characters which is far different from what we expected and useless.
I noticed that sirzooro's needs seem exactly the same as ours (his language is not East Asian's, is it?). For that reason, 8759.2.diff's character counting method should be called something like "general character counting" rather than "East Asian's" to avoid confusion.
#45
@
12 years ago
We should close the for 3.4 and raise a new ticket for 3.5 for further enhancements.
What we have in 3.4 is better than 3.3
#46
@
12 years ago
- Resolution set to fixed
- Status changed from accepted to closed
Follow on ticket for future enhancements - #20738
Closing this as done for 3.4
#47
follow-up:
↓ 48
@
12 years ago
- Resolution fixed deleted
- Status changed from closed to reopened
This should NOT be marked "fixed" since the statement "Word count function doesn't work in several languages" still holds true.
#48
in reply to:
↑ 47
@
12 years ago
- Resolution set to fixed
- Status changed from reopened to closed
Replying to jiehanzheng:
This should NOT be marked "fixed" since the statement "Word count function doesn't work in several languages" still holds true.
This is marked fixed because this is as fixed as it is going to be for 3.4
We have a new ticket for future enhancements as detailed above.
Looks good, except for the extra option in Settings -> General. It isn't worth to show it in the interface for all of the users, if it applies only to a few.
A plugin can add the checkbox, if needed.