Make WordPress Core

Opened 13 years ago

Closed 12 years ago

#16079 closed defect (bug) (fixed)

Automatic excerpts don't work well with Chinese txt (word counting)

Reported by: houshuang's profile houshuang Owned by: nacin's profile nacin
Milestone: 3.4 Priority: normal
Severity: normal Version: 3.0.4
Component: I18N Keywords: has-patch needs-testing commit
Focuses: Cc:

Description

I use the twentyten template on my Chinese blog (http://reganmian.net/boke). For search and category pages, it lists unpredictable amounts of texts for the automated extracts, it seems to me that this is due to the way it counts "words". For example, setting the number of words to 3 (adding a filter in the functions.php of the template), cause two different posts to display widely varying lengths of extracts. I believe this is because the way the_extract function counts words does not work well with Chinese, which is written without spaces. Perhaps offer an option to revert to counting (unicode) characters in this case.

Attachments (6)

Screen shot 2011-01-02 at 3.15.46 PM.png (117.3 KB) - added by houshuang 13 years ago.
Example of different lengths of extracts
Screen shot 2011-01-02 at 3.16.00 PM.png (213.6 KB) - added by houshuang 13 years ago.
Zoomed in
16079.diff (1.3 KB) - added by nacin 12 years ago.
16079.2.diff (1.7 KB) - added by tenpura 12 years ago.
16079.3.diff (4.8 KB) - added by jiehanzheng 12 years ago.
Translators could configure how trimming works through their pomo translations. Supports all the requirements mentioned in this ticket + Chinese. Needs testing.
16079.3.2.diff (5.0 KB) - added by jiehanzheng 12 years ago.
Provides French and Spanish support based on 16079.3.diff, fixes default setting.

Download all attachments as: .zip

Change History (28)

@houshuang
13 years ago

Example of different lengths of extracts

#1 @nacin
13 years ago

Related: #8759.

#2 @markjaquith
13 years ago

  • Milestone changed from Awaiting Review to Future Release

#3 @westi
13 years ago

  • Owner set to westi
  • Status changed from new to assigned

#4 @westi
13 years ago

  • Summary changed from Automatic extracts don't work well with Chinese txt (word counting) to Automatic exceprts don't work well with Chinese txt (word counting)

#5 @andrewspittle
13 years ago

  • Summary changed from Automatic exceprts don't work well with Chinese txt (word counting) to Automatic excerpts don't work well with Chinese txt (word counting)

#6 @nacin
12 years ago

  • Component changed from Template to I18N

#7 @nacin
12 years ago

  • Milestone changed from Future Release to 3.4

@nacin
12 years ago

#8 @nacin
12 years ago

  • Keywords has-patch needs-testing added

#9 @westi
12 years ago

X-Referencing the other related ticket - #8759

#10 @westi
12 years ago

  • Owner changed from westi to nacin

@tenpura
12 years ago

#11 @tenpura
12 years ago

16079.2.diff fixes mainly two things about 16079.diff.

  1. preg_match_all() without u (PCRE_UTF8) modifier destroys UTF-8 multibyte characters.
  2. implode() with ' ' separator chops strings like 'm e a t'.

#12 @Nao
12 years ago

I tested tenpura's patch 16079.2.diff against 3.4 beta 1 Japanese.
With Twenty Eleven theme enabled, the search result page correctly showed trimmed Japanese text at 40 characters.

#13 @jiehanzheng
12 years ago

16079.3.2.diff gives translators power to control every aspect of the way trimming works. Translators may use the same method as stated in my another comment to configure this.

With this patch, translators may decide:

  • whether or not to count Latin part by characters.
  • whether or not to break Latin words apart to fit word count.
  • whether or not to count East Asia punctuation marks.
  • whether or not to count spaces.

This should meet the needs of general English usage (options are default to English usage, if pomo translations are not present), Japanese usage mentioned in this ticket, and Chinese conventions.

Last edited 12 years ago by jiehanzheng (previous) (diff)

@jiehanzheng
12 years ago

Translators could configure how trimming works through their pomo translations. Supports all the requirements mentioned in this ticket + Chinese. Needs testing.

@jiehanzheng
12 years ago

Provides French and Spanish support based on 16079.3.diff, fixes default setting.

#14 follow-ups: @nacin
12 years ago

  • Keywords commit added

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

#15 in reply to: ↑ 14 ; follow-up: @jiehanzheng
12 years ago

Replying to nacin:

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

Because there is no way we can configure 16079.2.diff to make it work for Chinese usage conventions. I might integrate 16079.3.2.diff into our $locale.php and keep that file for Chinese for now.

#16 in reply to: ↑ 15 ; follow-up: @nacin
12 years ago

Replying to jiehanzheng:

Because there is no way we can configure 16079.2.diff to make it work for Chinese usage conventions. I might integrate 16079.3.2.diff into our $locale.php and keep that file for Chinese for now.

Ideally, we avoid $locale.php for most locales in 3.4.

Ideally, what would it look like for Chinese? Not the configure-any-piece patch, but what specific pieces of code are appropriate for Chinese? And did you have any code in 3.3 to support this at all?

#17 in reply to: ↑ 16 @jiehanzheng
12 years ago

Replying to nacin:

Ideally, we avoid $locale.php for most locales in 3.4.

Ideally, what would it look like for Chinese? Not the configure-any-piece patch, but what specific pieces of code are appropriate for Chinese? And did you have any code in 3.3 to support this at all?

Yes I understand and I support the idea that we should prevent the use of $locale.php. I will try my best to make this happen in core without having extra files, but if this is really hard for the dev team and the community, we will have to use $locale.php.

We want to have the automatic excerpt yield the same result as our word-count.js proposal:
http://core.trac.wordpress.org/ticket/8759#comment:30

We did not have automatic excerpt support before. However, now that I've finished it, so I might consider including it into zh_CN.php for now if the updated formatting.php in 3.4 does not have Chinese excerpt capabilities. Thanks.

#18 @sirzooro
12 years ago

  • Cc sirzooro added

#19 @jane
12 years ago

Is this really a blocker for 3.4? Seems like this could be dealt with in 3.5.

#20 in reply to: ↑ 14 @westi
12 years ago

Replying to nacin:

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

We will go with this for 3.4 and should raise a new ticket for 3.5 to cover improving this further.

#21 @westi
12 years ago

In [20859]:

i18n: Update the word splitting we use when trimming strings to build excerpts so that it has support for a character based mode for locales where character splitting is more approproate like Japan.

See #16079 props tenpura.

#22 @westi
12 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

This is done for 3.4, Raise #20739 for future enhancements.

Note: See TracTickets for help on using tickets.