Make WordPress Core

Opened 3 years ago

Closed 5 weeks ago

#57987 closed enhancement (maybelater)

Introduce new `wp_word_count` function

Reported by: wildworks's profile wildworks Owned by: pbearne's profile pbearne
Milestone: Priority: normal
Severity: normal Version:
Component: Editor Keywords: has-patch has-unit-tests needs-testing
Focuses: Cc:

Description

In the WordPress core, there is a JavaScript-based WordCounter(https://core.trac.wordpress.org/browser/trunk/src/js/_enqueues/wp/utils/word-count.js) class to count the number of characters appropriately based on language type. My understanding is that this logic is incorporated into the @wordpress/wordcount(https://github.com/WordPress/gutenberg/tree/trunk/packages/wordcount) package in Gutenberg. This utility is used in the document panel of the block editor to calculate the number of characters.

Also, in Gutenberg PR(https://github.com/WordPress/gutenberg/pull/43403), a new "Post Time To Read" block was added. This block shows minutes required to finish reading the post. This block uses the wordcount package to calculate the number of characters in the content from which the time is calculated. However, the exact same PHP-based logic was needed to render this block server-side on the front-end side. Therefore, we added a new wp_word_count() PHP function at the same time we added this block.

Since this block is experimental, the block and this function will not be bundled into the core. However, given the possibility that this block may become less experimental and be incorporated into the core in the future, and given the recent addition of the wp_get_word_count_type() function, we may consider implementing this function in the core.

This PHP function is a direct transfer of the logic of the JavaScript-based WordCounter.count. Unit tests have also been implemented.
https://github.com/WordPress/gutenberg/pull/43403/files#diff-796959a4b78efdd6a2af2ec364be6a4af2566a57041673f2fabbd153bc743cf9

Change History (28)

#1 @sabernhardt
3 years ago

  • Component changed from General to Editor

This ticket was mentioned in PR #4430 on WordPress/wordpress-develop by @wildworks.


3 years ago
#2

  • Keywords has-patch has-unit-tests added

#3 @wildworks
3 years ago

I have attached the patch. This patch includes the function itself and unit tests.

The unit test is based on this Qunit test. The test cases and expected results are exactly the same as for QUnit.:
https://github.com/WordPress/wordpress-develop/blob/24c4d56d517adca38a51ef8ea03bd9886cf6f2d6/tests/qunit/wp-admin/js/word-count.js

The function is based on this JS file. While there are differences in the way regular expressions are written due to language, I believe the logic is the same.: https://github.com/WordPress/wordpress-develop/blob/24c4d56d517adca38a51ef8ea03bd9886cf6f2d6/src/js/_enqueues/wp/utils/word-count.js

If there is anything I am missing, please let me know.

#4 @wildworks
2 years ago

The Post Time To Read block that exists on the Gutenberg plugin depends on this function. I'm hoping this block will ship in WP6.4 and would be happy to move this ticket forward.

#5 @wildworks
18 months ago

As for stabilizing the Post Time Read block and shipping it to the core, it doesn't seem positive from an accessibility standpoint (See this issue: https://github.com/WordPress/gutenberg/issues/53776).

This function is necessary for this block, but if the function itself is not very useful, we might want to close this ticket for now.

This ticket was mentioned in PR #9562 on WordPress/wordpress-develop by @pbearne.


3 months ago
#6

#7 @pbearne
3 months ago

  • Milestone changed from Awaiting Review to 6.9

refreshed the patch so we can merge this for the https://github.com/WordPress/gutenberg/issues/53776 issue

#8 @pbearne
3 months ago

  • Owner set to pbearne
  • Status changed from new to assigned

@wildworks commented on PR #9562:


3 months ago
#9

Closing as a duplicate of #4430

#10 @wildworks
3 months ago

@pbearne, Please do not submit a patch that is exactly the same as a patch that has already been submitted.

#11 @peterwilsoncc
3 months ago

@wildworks It looks like Paul was opening it to resolve the merge conflict and ensure the tests pass. The logs from the original PR are long expired.

#12 @oglekler
3 months ago

  • Keywords needs-testing added

Adding needs-testing.

Function wp_word_count() can be used to count the words in the

$content = apply_filters( 'the_content', get_the_content() );

The logic of this counting can be observed and checked in the PR from Data provider for Unit tests.

#13 @wildworks
3 months ago

Thanks to everyone who contributed to the ticket.

As I commented, the wp_word_count function was needed for the Time to Read block. However, the block has not been decided to ship to the core. Therefore, the wp_word_count function is no longer necessary.

I think we need to audit whether the function itself is useful or not first. Personally, I suggest closing this patch as "Maybe later".

#14 @oglekler
3 months ago

  • Keywords needs-testing removed

#15 @pbearne
3 months ago

The time to read block is planned to be added in the next release https://github.com/WordPress/gutenberg/issues/71026

So please let's add it

#16 @wildworks
3 months ago

The time to read block is planned to be added in the next release
https://github.com/WordPress/gutenberg/issues/71026

I don't think so. See the last paragraph of that issue.

(This list doesn't imply these blocks are ready, we should continue the conversation on each so that we produce the best versions.)

As you can see, I don't think the issue means we'll be shipping the block in the next release.

Please visit GB 53776 and see the accessibility issues that need to be resolved.

#17 @wildworks
6 weeks ago

  • Keywords needs-testing added

Update: We are about to stabilize the Time to Read block and plan to ship to 6.9: https://github.com/WordPress/gutenberg/pull/71588

So I hope to be able to commit this patch in time for the 6.9 beta release, and I look forward to any additional feedback!

The PR is up to date with the conflicts resolved.

cc @mikachan @scruffian

#18 @wildworks
5 weeks ago

Update: We have stabilized the Time to Read block. This block will ship in 6.9, so this ticket should also be addressed in the 6.9 release.

We welcome your feedback on the submitted PR. https://github.com/WordPress/wordpress-develop/pull/4430/

@wildworks commented on PR #4430:


5 weeks ago
#19

@dmsnell Thanks for the feedback! First, let me explain why this function is implemented the way it is.

The Time to Read block needs to dynamically measure the content length on both the client side and the server side. The values ​​must match exactly. In other words, the wp_word_count function is the PHP version of the `count` function 😅 To be more precise, the wp_word_count function is the PHP version of wp.utils.wordcounter.

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

@dmsnell commented on PR #4430:


5 weeks ago
#20

The values ​​must match exactly.

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

Is it possible to completely mimic the logic of the wordCount function using the HTML API? Or is it possible to do the opposite, to incorporate HTML API logic into the wordCount function?

I doubt it would be pragmatic to get a 100% match; I have my doubts that this currently performs a 100% match. In fact, I _know_ that based on the implementation as-is they don’t match, and pointed out one area where they differ.

If we want a 100% match then it would probably be best to request it from the server via API call. JavaScript strings and RegExp are so different than PHP strings and preg_ functions that fully harmonizing is almost impossible without something like a shared module compiled to WASM.

There are efforts to bring the HTML API into JavaScript, but the harder part is going to be lower-level string behaviors.

JavaScript has `Intl.Segmenter` which should give the same word-breaks as IntlBreakIterator::createWordInstance(), however, it’s likely that as new Unicode versions are released, if the word-breaking rules change, PHP will likely lag behind the browsers.

At some point I thought I suggested moving the JS word count function to rely on Intl.Segmenter — it’s so fraught to attempt to count words manually, especially for languages which are segmented on spaces.

It’s my aspiration to build something like .innerText using the HTML API because that’s the best property to lean into in JS for counting words. The problem is that .innerText incorporates CSS styling and gets really complicated.

---

That being said, if we take .innerText in the browser and a reasonable facsimile from the HTML API on the server, then lean on the Intl tools (which are all using the same underlying ICU4C library) then we’ll have counts that should normally agree and also, more importantly, far more closely match the actual word counts from the content than with expensive and complicated regex-based approaches.

One thing I love about relying on the standard functions is that however good or bad it is, it should agree with what you’d get from a browser, from an operating system, from a word processor.

---

Carry on as you wish. I’m happy to share more about these if you are curious, but I don’t need to hold up your good work. Just a personal side-mission of mine to let people know we can have realistic word counts if we want them 😄

@wildworks commented on PR #4430:


5 weeks ago
#21

Is it universally recognized that this is the most important aspect? Why is it more important to give the same count than to give a reasonable count?

It feels odd to see the front-end and editor display differently, even though the displayed content is the same. My guess is that users will see it as a bug.

As Beta1 approaches, we need to decide what direction to go in. What do you suggest is the best way to go about it? Would it be better to refactor the wp_word_count function to use the HTML API before the Beta 1 release?

@dmsnell commented on PR #4430:


5 weeks ago
#22

users will see it as a bug.

It’s already buggy and strange. For long posts though it’s still a reasonable estimate. Though it could be interesting to compare the counts using a word counter and the counts using a regex.

What do you suggest is the best way to go about it?

If you want to pursue something closer to the stronger word counts I’ll support you, but if you want to get this in now I support your call there as well. These things can be made better in the future; I just get an allergic reaction when I see new HTML parsers being introduced 🙃.

@wildworks commented on PR #4430:


5 weeks ago
#23

We've decided to ship the Time to Read and Word Count blocks in 6.9. These blocks require the wp_word_count() function to work.

It would be nice to incorporate the HTML API in the 6.9 release, but to resolve client-side inconsistencies, we need a new REST API endpoint to calculate the word count from the current editor content.

I'm not very familiar with the HTML API, so I'd like to know what the optimal solution is for the 6.9 release 👀

@wildworks commented on PR #4430:


5 weeks ago
#24

Here's what I think is the most prudent approach for the 6.9 release: What do you think?

  • Close this PR.
  • Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.
  • Continually improve that logic using the HTML API.
  • Once the logic is stable and robust, add it to the core as the wp_word_count function.

@dmsnell commented on PR #4430:


5 weeks ago
#25

Sounds good @t-hamano — I don’t know if you need to close this, unless you meant by merging. We can iterate on it after you have something out there working for what you need.

@wildworks commented on PR #4430:


5 weeks ago
#26

  • Incorporate logic equivalent to the current wp_word_count function into the render callback function of the Time To Read block.

I've submitted a Gutenberg PR on this: https://github.com/WordPress/gutenberg/pull/72091

Let's close this PR _without_ committing to core.

#27 @wildworks
5 weeks ago

For now, it's probably best NOT to add the wp_word_count() function to core, and maybe we can close this ticket too.

#28 @wildworks
5 weeks ago

  • Milestone 6.9 deleted
  • Resolution set to maybelater
  • Status changed from assigned to closed

For now, let's close it as maybelater. We can reopen it in the future when we're ready to make this function public.

Note: See TracTickets for help on using tickets.