WordPress.org

Make WordPress Core

Opened 3 years ago

Closed 2 years ago

#16079 closed defect (bug) (fixed)

Automatic excerpts don't work well with Chinese txt (word counting)

Reported by: houshuang Owned by: nacin
Milestone: 3.4 Priority: normal
Severity: normal Version: 3.0.4
Component: I18N Keywords: has-patch needs-testing commit
Focuses: Cc:

Description

I use the twentyten template on my Chinese blog (http://reganmian.net/boke). For search and category pages, it lists unpredictable amounts of texts for the automated extracts, it seems to me that this is due to the way it counts "words". For example, setting the number of words to 3 (adding a filter in the functions.php of the template), cause two different posts to display widely varying lengths of extracts. I believe this is because the way the_extract function counts words does not work well with Chinese, which is written without spaces. Perhaps offer an option to revert to counting (unicode) characters in this case.

Attachments (6)

Screen shot 2011-01-02 at 3.15.46 PM.png (117.3 KB) - added by houshuang 3 years ago.
Example of different lengths of extracts
Screen shot 2011-01-02 at 3.16.00 PM.png (213.6 KB) - added by houshuang 3 years ago.
Zoomed in
16079.diff (1.3 KB) - added by nacin 2 years ago.
16079.2.diff (1.7 KB) - added by tenpura 2 years ago.
16079.3.diff (4.8 KB) - added by jiehanzheng 2 years ago.
Translators could configure how trimming works through their pomo translations. Supports all the requirements mentioned in this ticket + Chinese. Needs testing.
16079.3.2.diff (5.0 KB) - added by jiehanzheng 2 years ago.
Provides French and Spanish support based on 16079.3.diff, fixes default setting.

Download all attachments as: .zip

Change History (28)

houshuang3 years ago

Example of different lengths of extracts

comment:1 nacin3 years ago

Related: #8759.

comment:2 markjaquith3 years ago

  • Milestone changed from Awaiting Review to Future Release

comment:3 westi3 years ago

  • Owner set to westi
  • Status changed from new to assigned

comment:4 westi3 years ago

  • Summary changed from Automatic extracts don't work well with Chinese txt (word counting) to Automatic exceprts don't work well with Chinese txt (word counting)

comment:5 andrewspittle3 years ago

  • Summary changed from Automatic exceprts don't work well with Chinese txt (word counting) to Automatic excerpts don't work well with Chinese txt (word counting)

comment:6 nacin2 years ago

  • Component changed from Template to I18N

comment:7 nacin2 years ago

  • Milestone changed from Future Release to 3.4

nacin2 years ago

comment:8 nacin2 years ago

  • Keywords has-patch needs-testing added

comment:9 westi2 years ago

X-Referencing the other related ticket - #8759

comment:10 westi2 years ago

  • Owner changed from westi to nacin

tenpura2 years ago

comment:11 tenpura2 years ago

16079.2.diff fixes mainly two things about 16079.diff.

  1. preg_match_all() without u (PCRE_UTF8) modifier destroys UTF-8 multibyte characters.
  2. implode() with ' ' separator chops strings like 'm e a t'.

comment:12 Nao2 years ago

I tested tenpura's patch 16079.2.diff against 3.4 beta 1 Japanese.
With Twenty Eleven theme enabled, the search result page correctly showed trimmed Japanese text at 40 characters.

comment:13 jiehanzheng2 years ago

16079.3.2.diff gives translators power to control every aspect of the way trimming works. Translators may use the same method as stated in my another comment to configure this.

With this patch, translators may decide:

  • whether or not to count Latin part by characters.
  • whether or not to break Latin words apart to fit word count.
  • whether or not to count East Asia punctuation marks.
  • whether or not to count spaces.

This should meet the needs of general English usage (options are default to English usage, if pomo translations are not present), Japanese usage mentioned in this ticket, and Chinese conventions.

Last edited 2 years ago by jiehanzheng (previous) (diff)

jiehanzheng2 years ago

Translators could configure how trimming works through their pomo translations. Supports all the requirements mentioned in this ticket + Chinese. Needs testing.

jiehanzheng2 years ago

Provides French and Spanish support based on 16079.3.diff, fixes default setting.

comment:14 follow-ups: nacin2 years ago

  • Keywords commit added

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

comment:15 in reply to: ↑ 14 ; follow-up: jiehanzheng2 years ago

Replying to nacin:

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

Because there is no way we can configure 16079.2.diff to make it work for Chinese usage conventions. I might integrate 16079.3.2.diff into our $locale.php and keep that file for Chinese for now.

comment:16 in reply to: ↑ 15 ; follow-up: nacin2 years ago

Replying to jiehanzheng:

Because there is no way we can configure 16079.2.diff to make it work for Chinese usage conventions. I might integrate 16079.3.2.diff into our $locale.php and keep that file for Chinese for now.

Ideally, we avoid $locale.php for most locales in 3.4.

Ideally, what would it look like for Chinese? Not the configure-any-piece patch, but what specific pieces of code are appropriate for Chinese? And did you have any code in 3.3 to support this at all?

comment:17 in reply to: ↑ 16 jiehanzheng2 years ago

Replying to nacin:

Ideally, we avoid $locale.php for most locales in 3.4.

Ideally, what would it look like for Chinese? Not the configure-any-piece patch, but what specific pieces of code are appropriate for Chinese? And did you have any code in 3.3 to support this at all?

Yes I understand and I support the idea that we should prevent the use of $locale.php. I will try my best to make this happen in core without having extra files, but if this is really hard for the dev team and the community, we will have to use $locale.php.

We want to have the automatic excerpt yield the same result as our word-count.js proposal:
http://core.trac.wordpress.org/ticket/8759#comment:30

We did not have automatic excerpt support before. However, now that I've finished it, so I might consider including it into zh_CN.php for now if the updated formatting.php in 3.4 does not have Chinese excerpt capabilities. Thanks.

comment:18 sirzooro2 years ago

  • Cc sirzooro added

comment:19 jane2 years ago

Is this really a blocker for 3.4? Seems like this could be dealt with in 3.5.

comment:20 in reply to: ↑ 14 westi2 years ago

Replying to nacin:

There is a lot of good stuff here, but man, that is a lot. What is wrong with 16079.2.diff for 3.4? It is better than what we have for all locales, yes?

We will go with this for 3.4 and should raise a new ticket for 3.5 to cover improving this further.

comment:21 westi2 years ago

In [20859]:

i18n: Update the word splitting we use when trimming strings to build excerpts so that it has support for a character based mode for locales where character splitting is more approproate like Japan.

See #16079 props tenpura.

comment:22 westi2 years ago

  • Resolution set to fixed
  • Status changed from assigned to closed

This is done for 3.4, Raise #20739 for future enhancements.

Note: See TracTickets for help on using tickets.