Make WordPress Core

Opened 12 years ago

Closed 12 years ago

Last modified 7 years ago

#23625 closed feature request (fixed)

Add function to extract or parse chat as structured data

Reported by: wonderboymusic's profile wonderboymusic Owned by:
Milestone: 3.6 Priority: normal
Severity: normal Version:
Component: Post Formats Keywords: has-patch needs-codex
Focuses: Cc:

Description

Scenarios

  1. You have a blob of content representing a chat transcript and optionally some commentary and you want to extract the chat as structured data, and optionally, remove it from the content
  2. You want to extract the chat from a post's content to theme it, with or without the optional commentary, outside of the_content()

Introduce get_the_chat( $id = 0 ) and get_content_chat( &$content, $remove = false )

Here is an example to test:

$chat =<<<CHAT
Scott: Hey, let's chat!
	This is more text
Helen: No.

Nacin: Cool.

Scott: Hey, let's chat!
Helen: No.

Scott: Hey, let's chat!
Helen: No.

Scott: Hey, let's chat!
Helen: No.

This is the commentary on the chat.
CHAT;

print_r( get_content_chat( $chat, true ) );
echo $chat;
exit();

Here is the format of the structured chat data:

$stanzas = array(
      array(
          array(
              'author' => 'Scott',
              'messsage' => "Hey, let's chat!"
          ),
          array(
              'author' => 'Helen',
              'message' => 'No.'
          )
      )
  )

Attachments (17)

23625.diff (2.9 KB) - added by wonderboymusic 12 years ago.
23625.2.diff (2.9 KB) - added by wonderboymusic 12 years ago.
23625.3.diff (2.9 KB) - added by wonderboymusic 12 years ago.
23625.4.diff (3.0 KB) - added by wonderboymusic 12 years ago.
23625.5.diff (3.2 KB) - added by wonderboymusic 12 years ago.
23625.6.diff (3.4 KB) - added by wonderboymusic 12 years ago.
23625-tests.diff (12.2 KB) - added by wonderboymusic 12 years ago.
23625.7.diff (3.7 KB) - added by wonderboymusic 12 years ago.
23625.8.diff (5.2 KB) - added by wonderboymusic 12 years ago.
23625-tests.2.diff (9.2 KB) - added by wonderboymusic 12 years ago.
23625.9.diff (6.1 KB) - added by lancewillett 12 years ago.
23625.10.diff (6.5 KB) - added by wonderboymusic 12 years ago.
23625.11.diff (4.5 KB) - added by obenland 12 years ago.
23625.12.diff (966 bytes) - added by SergeyBiryukov 12 years ago.
23625.13.diff (416 bytes) - added by SergeyBiryukov 12 years ago.
23625.14.diff (1.8 KB) - added by wonderboymusic 12 years ago.
23625.15.diff (2.7 KB) - added by wonderboymusic 12 years ago.

Download all attachments as: .zip

Change History (71)

#1 @wonderboymusic
12 years ago

  • Component changed from General to Post Formats

#2 @wonderboymusic
12 years ago

Refreshed against test scenarios. There is no post_format_compat for posts or themes that explicitly support the chat post format. get_the_chat( $id = 0 ) or get_content_chat( &$content, $remove = false ) can be used by themes that support structured-post-formats instead of calling the_content(). There are at least 8 scenarios that those functions have to consider:

  • One Stanza, no messages are multiline
  • One Stanza, no messages are multiline, followed by commentary
  • One Stanza, contains one of more multiline messages with or without :s that are not usernames
  • One Stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames
  • Multi-stanza, no messages are multiline
  • Multi-stanze, no messages are multiline, followed by commentary
  • Multi-stanza, contains one of more multiline messages with or without :s that are not usernames
  • Multi-stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames

The resulting array() is a list of stanzas which each contain a list of messages, each is an associative array that contains author and message data

#3 @obenland
12 years ago

I was looking at your second diff:

Testing with a copy/paste from Skype was initially unsuccessful, due to the whitespace check in author names. May I ask why they shouldn't contain whitespaces?

I also noticed that Skype prepends every line with a timestamp like so: [3/5/13 11:41:02 AM]. So I added a preg replace before the initial trim of each line to remove that:

$line = preg_replace( '/\[.+\]\s/', '', $line ); // Skype

My regex-skills are lacking at best, so I'm sure this could be optimized. Is this something you'd consider adding in the function?

I found the naming of get_the_chat() confusing, initially I expected a single chat.

This is how I would use it as it currently stands:

<?php
        $chats = get_the_chat();
        $chat = array_shift( $chats );
        foreach ( $chat as $row ) :
                $message = make_clickable( $row['message'] );
?>
        <div class="row">
                <strong class="author"><?php echo $row['author']; ?>:</strong>
                <span class="message"><?php echo $message; ?></span>
        </div>
<?php endforeach; ?>

I hope you find the feedback constructive. Overall I am super psyched about the way you implemented this! Chats come finally to life for me! :)

#4 @wonderboymusic
12 years ago

yeah, the main issue is the delimiter for the start of a new author line, I arbitrarily decided it was "mynamewithnocolonsorwhitespace: ", so I need some help defining that for sure.

If you do array_shift() on the whole chat, you are assuming there is only one stanza, here is what multi-stanza looks like:

Me: Hey.
You: Hi!

Me: Hey.
You: Hi!

Let's keep the use cases coming, for sure

#5 @nacin
12 years ago

  • Keywords needs-unit-tests added

I would use a simple check for names:

^(.*?):

Putting it another way, explode( ':', $line, 2 ).

That's it. If it has a timestamp, then let it be part of the meta info for that line.

This would break in one situation: Gmail's chat output does not repeat the person's name in subsequent lines. If they use a colon, then the first phrase will be caught as the "name". I don't mind that. Note, this is exactly how Tumblr also happens to parse it.

I like the idea of make_clickable() being inside the core function. Let's consider running that once over the entire chat content, rather than on each line, if we can get away with it.

Really needs unit tests!

#6 @wonderboymusic
12 years ago

Patch updated - here are the test cases, Unit Tests are forthcoming:

One Stanza, no messages are multiline

Scott: Hey.
Nacin: Go away.

One Stanza, no messages are multiline, followed by commentary

Scott: Hey.
Nacin: Go away.

Nacin is mean to me.

One Stanza, contains one of more multiline messages with or without :s that are not usernames

Scott: Hey.
I have a question.
Nacin: Go away.

OR

Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.

One Stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames

Scott: Hey.
I have a question.
Nacin: Go away.

Nacin hates color.

OR

Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.

Nacin hates color.

Multi-stanza, no messages are multiline

Scott: Hey.
Nacin: Go away.

Helen: Hey.
Nacin: Go away.

Multi-stanza, no messages are multiline, followed by commentary

Scott: Hey.
Nacin: Go away.

Helen: Hey.
Nacin: Go away.

Nacin appears busy today.

Multi-stanza, contains one of more multiline messages with or without :s that are not usernames

Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.

Helen: Hey.
I have a question: what is your favorite pizza topping?
Nacin: Go away.

Multi-stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames

Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.

Helen: Hey.
I have a question: what is your favorite pizza topping? Or go here http://www.justinbieber.com
Nacin: Go away.

Nacin hates color and pizza.

#7 @wonderboymusic
12 years ago

Patch updated to support optional leading Timestamps a la Skype - made revisions to code based on UNIT TESTS (ding ding ding) which are attached as well

#8 @wonderboymusic
12 years ago

Note: I need to add the Twenty Thirteen chats to the Unit Tests, I saw one that I don't think would pass

#9 @rmccue
12 years ago

Any thoughts on supporting IRC style?

<rmccue> Hi!
<nacin> You suck.

And/or with timestamps.

[08:15:04] <rmccue> Hi!
[08:15:16] <nacin> You suck.

#10 @wonderboymusic
12 years ago

Updated patch matches @rmccue's IRC style chats, Skype chats, stanza-based script-like chats, and the chat used in TwentyThirteen. Also updated Unit Tests, there are 500 lines of them

#11 @MikeHansenMe
12 years ago

  • Cc mdhansen@… added

I copied some code in from #wordpress irc channel and used print_r( get_the_chat( get_the_ID() ) ) to view the chat array. It seems to have broke though. See last message.

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [time] => 
                    [author] => gmorrison
                    [message] => just a sec
                )

            [1] => Array
                (
                    [time] => 
                    [author] => gmorrison
                    [message] => ok
                )

            [2] => Array
                (
                    [time] => 
                    [author] => gmorrison
                    [message] => changing uploads to 777 let it create the folder for that
                )

            [3] => Array
                (
                    [time] => 
                    [author] => gmorrison
                    [message] => so yes it can * daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress <snowfox_ben> here is my code <a href="https://gist.github.com/benschaaf/91179440d1996b48e382" rel="nofollow">https://gist.github.com/benschaaf/91179440d1996b48e382</a>
                )

        )

)
Last edited 12 years ago by MikeHansenMe (previous) (diff)

#12 @wonderboymusic
12 years ago

can you paste the original text so I can add it to Unit Tests? it's the <([^>]+)> - needs to be <([^>\s]+)>

#13 @MikeHansenMe
12 years ago

Yea, meant to include that. Sorry.

<gmorrison> just a sec
<gmorrison> ok
<gmorrison> changing uploads to 777 let it create the folder for that
<gmorrison> so yes it can
* daveluke has quit (Quit: daveluke)
* robmiller (~robmiller@141.0.147.143) has joined #wordpress
<snowfox_ben> here is my code https://gist.github.com/benschaaf/91179440d1996b48e382

#14 @wonderboymusic
12 years ago

Patch updated / Unit Tests updated - 16 tests all pass now

#15 @MikeHansenMe
12 years ago

Looking at the unit test for the example. I am not sure the expected var is correct. Let me elaborate a bit.

'message' => 'so yes it can * daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress' 

in the above message we have the message "so yes it can" then we have notices added as well.

* daveluke has quit (Quit: daveluke)

* robmiller (~robmiller@141.0.147.143) has joined #wordpress

are notices, I am not sure what should be expected. Should it be removed, added as its own message or the users responsibility to remove before publishing. Either way I do not think it should be included as part of the previous message.

#16 @helen
12 years ago

I think we are going a little too far with what we support in core. I think each line/paragraph as a message, with anything before the first colon being a label of sorts for that line (author, whatever you want to call it), is quite enough. This should be extensible to add support for other styles, but in the interest of getting this in and keeping it predictable for users, who tend to adjust their content accordingly once they figure it out or read some docs, let's choose one style as the core style.

#17 follow-up: @alex-ye
12 years ago

I think you go far with all of this :D , Personally i am not like the new functions get_content_chat() and get_the_chat() functions , you just add some junk to the core :)

I love WordPress to more clean , more simple .

#18 in reply to: ↑ 17 ; follow-up: @helen
12 years ago

Replying to alex-ye:

you just add some junk to the core :)

Feel free to propose a junk-free method of consistently handling the chat post format.

#19 follow-up: @MikeHansenMe
12 years ago

I think the new functions are necessary to make post formats useful rather than just labels. I do however think IRC chats may be plugin territory. Because of things like the notices.

#20 in reply to: ↑ 18 @alex-ye
12 years ago

Replying to helen:

Replying to alex-ye:

you just add some junk to the core :)

Feel free to propose a junk-free method of consistently handling the chat post format.

I think we should not at all !! , there are many many cases of chat styles the can broke the previous patches ... I think the user when he use the chat format he only want to bookmark his post as "chat" only ! , no parse content , no complicated staff !!

If some user want to make advance chat format , he will use a plugin or theme that support that very well , and maybe better than yours ..

As I said , I love WordPress to be clean as possible , post-formats is really good feature you make it better with 3.6 ... but don't make it very complicated :)

after all , maybe many people disagree with me , but I like to say my opinion .

#21 @alex-ye
12 years ago

In my opinion , if you want to add some inputs to chat format , you have many options like :

1 - Chat people names ( or what you want )
2 - Chat date
3 - Chat place

... etc

#22 in reply to: ↑ 19 @alex-ye
12 years ago

Replying to MikeHansenMe:

I think the new functions are necessary to make post formats useful rather than just labels. I do however think IRC chats may be plugin territory. Because of things like the notices.

I can't see parsing chat is necessary , Give the user the abillity to write the chat content as the style he want , don't make him confuse about the required style to make get_content_chat() works !!

if you want to make "chat" format is just no label , you can add some simple staff , and the rest let it to the theme or plugin developers :)

The idea of 3.6 as I know is to make editing more easier and give the theme and plugin developers more ability to customize the content .

#23 follow-up: @wonderboymusic
12 years ago

no one has to use the function, it's just there if you want to get the structured data, which apparently you're a huge fan of - thanks for the alternate patch and additional unit tests

Last edited 12 years ago by wonderboymusic (previous) (diff)

#24 in reply to: ↑ 23 @alex-ye
12 years ago

Replying to wonderboymusic:

no one has to use the function, it's just there if you want to get the structured data, which apparently you're a huge fan of

Thank you for reply , So you said that you want to add "unused" functions to core , I am not with that too , In my opinion this thing should be a plugin , we can make it more professional ... etc

WordPress 3.5 removes the bookmark functionality by default , and maybe in future releases we will remove more "unused" stuffs .

So can we stop adding "unused" functionality at least !! , so we will don't have a ticket with title "Remove support for ..." :)

#25 follow-up: @wonderboymusic
12 years ago

theme support

add_theme_support( 'structured-post-formats' );

theme that supports 'structured-post-formats'

$data = get_the_chat();
// do a bunch of stuff

theme that doesn't

the_content();

Come troll in dev chat later if this is ruining your life

#26 in reply to: ↑ 25 @alex-ye
12 years ago

Replying to wonderboymusic:

Come troll in dev chat later if this is ruining your life

Yes it running my life , but the dev chat time doesn't fit me :(

Anyway I just said my opinion feel free to agree or disagree .
and Thank you for your time :)

#27 @helen
12 years ago

Thinking that we should have an actual HTML output function for this as well. Could be another ticket, but just throwing that thought on here while I'm awake.

#28 @markjaquith
12 years ago

We decided in the dev chat to simplify our regex support, and to make it pluggable. Think something along the lines of:

add_chat_detection_formar( $regex_to_detect_this_type, $regex_to_parse_this_type );

Do detection based on the first line that matches one, then have the rest use that type. Have our core regexes use this system so we're on even footing. Someone can add some crazy "parse all the chats" plugin, but we don't need that level of crazy support in core.

Version 0, edited 12 years ago by markjaquith (next)

#29 @wonderboymusic
12 years ago

Updated patch and unit tests

In this patch:

  • Introduce add_chat_detection_format( $name, $newline_regex, $delimiter_regex )
  • Introduce get_content_chat( &$content, $remove = false )
  • Introduce get_the_chat( $id = 0 )

Add chat support for IM and Skype:

add_chat_detection_format( 'IM', '#^([^:]+):#', '#[:]#' );
add_chat_detection_format( 'Skype', '#^(\[.+?\])\s([^:]+):#', '#[:]#' );

#30 @markjaquith
12 years ago

In 23804:

Extract chats as structured data.

  • add_chat_detection_format() — to add a chat regex pattern
  • get_content_chat() — to grab a chat from content
  • get_the_chat() — grab the chat from the current (or passed) post
  • the_chat() — output the chat in formatted HTML
  • paginate_content() — puts the <!--nextpage--> splitting stuff into a function
  • get_paged_content() — grabs a page of raw content, needed to paginate chats properly

see #23625. props wonderboymusic, lancewillett.

#31 @DrewAPicture
12 years ago

  • Keywords needs-codex added

@obenland
12 years ago

#32 @obenland
12 years ago

.11 does some clean up. A few things to explain:

  • Added periods to some comments (might be worth it to standardize that across post-format.php)
  • Moved similar variable declarations together
  • Sets up $newline_regex, $delimiter_regex in regex checker loop.
  • Changed chat class to just 'chat' to be consistent with proposed quote fallback
  • Sanitized author with dashes to avoid cases like class="chat-author-lance willet"
  • Message should not be html escaped, so make_clickable() works.

#33 @Jayjdk
12 years ago

  • Cc kontakt@… added

What about the markup generated by the_chat()? According to HTML5Doctor is the use of <dl> for marking up dialogue appropriate for HTML4 but not for HTML5 (http://html5doctor.com/the-dl-element/)

In the IRC, Mark Jaquith linked to this page:
http://www.w3.org/html/wg/drafts/html/master/common-idioms.html#conversations

"[...] authors are encouraged to mark up conversations using p elements and punctuation. Authors who need to mark the speaker for styling purposes are encouraged to use span or b."

#34 @lancewillett
12 years ago

I'm open to discussing the markup; a definition list still seems the most semantic and simplest method. I'm not taking HTML5Doctor as gospel. :)

Regarding .11 patch from Obenland, the better class value and slug sanitizing make sense -- +1

#35 follow-up: @helen
12 years ago

Managed to get:

Notice: Undefined offset: 1 in .../trunk/wp-includes/post-formats.php on line 534

Trace says:

get_content_chat( $content = 'Otto 12:51 AM: yeah. Oh well. I have pasta! It\'s delicious! :)\r\n\r\nHelen Hou-Sandi 12:51 AM: NOM.\r\ni ate 8 cookies and am ever-so-slightly regretting it', $remove = ??? )

#36 @markjaquith
12 years ago

In 23876:

Chat post format code cleanup.

see #23625. props obenland.

#37 @SergeyBiryukov
12 years ago

23625.12.diff fixes an issue with some UTF-8 characters (previously [12499], [19866], [21862], and [22306]).

Steps to reproduce:

  1. Create a chat post, paste this into content:
    Scott: Hey, let's chat!
    Роман: No.
    
  2. View the post. The second message will not be recognized properly:
    Scott:
    Hey, let's chat! Роман: No.
    

#38 in reply to: ↑ 35 @SergeyBiryukov
12 years ago

Replying to helen:

Managed to get:

Notice: Undefined offset: 1 in .../trunk/wp-includes/post-formats.php on line 534

Confirmed with your example. Appears to be caused by an empty line between the messages.

In current trunk, the notice comes from line 527:
http://core.trac.wordpress.org/browser/trunk/wp-includes/post-formats.php?rev=23876#L527

23625.13.diff fixes this for me.

#39 @markjaquith
12 years ago

In 23880:

Avoid PHP notice on blank chat lines. props SergeyBiryukov. see #23625

#40 @SergeyBiryukov
12 years ago

In 23914:

Check for [\r\n\t ] instead of \s in get_content_chat() to avoid UTF-8 issues. see #23625.

#41 follow-up: @lancewillett
12 years ago

In r23876 we removed esc_html() from the chat message output, but I think it should go back.

Test with:

[3/21/13 9:35:04 AM] Fancy: Hey Lance!
[3/21/13 9:41:23 AM] Pants: This is why I don't chat with you
[3/21/13 9:42:13 AM] Lance: </chat>
[3/21/13 9:42:19 AM] Pants: Hey, I'm not done

The </chat> bit should be escaped in the HTML output, and it's not.

#42 in reply to: ↑ 41 @lancewillett
12 years ago

Replying to lancewillett:

In r23876 we removed esc_html() from the chat message output, but I think it should go back.

Noting that make_clickable() was the reason to remove it.

#43 @Ipstenu
12 years ago

Related to #23947 - currently this is broken in instances where chat names have spaces.

http://test.ipstenu.org/testing-1.2013/

That has the post content as follows:

Nigel Tufnel: The numbers all go to eleven. Look, right across the board, eleven, eleven, eleven and…

Marti DiBergi: Oh, I see. And most amps go up to ten?

Nigel Tufnel: Exactly.

Marti DiBergi: Does that mean it’s louder? Is it any louder?

Nigel Tufnel: Well, it’s one louder, isn’t it? It’s not ten. You see, most blokes, you know, will be playing at ten. You’re on ten here, all the way up, all the way up, all the way up, you’re on ten on your guitar. Where can you go from there? Where?

Marti DiBergi: I don’t know.

Nigel Tufnel: Nowhere. Exactly. What we do is, if we need that extra push over the cliff, you know what we do?

Marti DiBergi: Put it up to eleven.

Nigel Tufnel: Eleven. Exactly. One louder.

Marti DiBergi: Why don’t you just make ten louder and make ten be the top number and make that a little louder?

Nigel Tufnel: These go to eleven.

But nothing shows.

#44 @MikeHansenMe
12 years ago

I have been having a bit of trouble with it this morning as well.
The transcript I used (google chat):

3:54 PM me: http://i.imgur.com/qjjlkGm.gif
3:55 PM Jason: I am so confused
  I will have to watch it a few times to see how they get the extra piece :)
3:56 PM me: me too cant figure it out probably watched it 20 times now
 Jason: but now I know I can steal a square out of my wife's candy bar and get away with it :D
 me: yes!
3:57 PM Jason: now I am going to have to buy her one to try it out

The result:

3:
    54 PM me: http://i.imgur.com/qjjlkGm.gif
3:
    55 PM Jason: I am so confused I will have to watch it a few times to see how they get the extra piece :)
3:
    56 PM me: me too cant figure it out probably watched it 20 times now
Jason:
    but now I know I can steal a square out of my wife's candy bar and get away with it :D
me:
    yes!
3:
    57 PM Jason: now I am going to have to buy her one to try it out 

I know this may be because Google chat is not supported but I think "IM" needs to be more clear as to what it includes. I also ran into a situation like @ipstenu where the output just went blank. I think if it gets hung up or cannot recognize a chat style it should output everything not nothing. I also got an undefined index on line 529 in post formats.

$author_match = empty( $matches[2] ) ? $matches[1] : $matches[2];

#45 @Jayjdk
12 years ago

There's also an error if you have text before the chat, like an introduction.

Introduction to a Skype chat.

[28-03-2013 15:54:03] Jesper: Lorem ipsum dolor sit amet
[28-03-2013 16:07:00] Jakob: Consectetuer adipiscing elit
[28-03-2013 16:07:18] Jesper: Maecenas porttitor congue massa
[28-03-2013 16:07:28] Jakob: Fusce posuere, magna sed.

It gives a notice and no content is displayed. The notice is:

Notice: Undefined offset: 1 in [...]\wp-includes\post-formats.php on line 529

It's the same line as MikeHansenMe in comment:44.

#46 @wonderboymusic
12 years ago

.14.diff fixes Skype parsing when the chat occurs after that start of the post content. I wouldn't complain too loudly if someone committed 23625-tests.2.diff​ to Unit Tests, por favor

#47 @Jayjdk
12 years ago

attachment:23625.14.diff fixes the notice and the content is displayed again. But it removes any content outside of the chat, I don't know if that is intentional?

#48 @wonderboymusic
12 years ago

In 1258/tests:

Add chat Unit Tests. See #23625.

#49 @wonderboymusic
12 years ago

  • Keywords needs-unit-tests removed

.15.diff covers more cases and will parse the chat / remove from content, regardless of where it is in the content or if there is text before / after it

#50 @wonderboymusic
12 years ago

#23946 was marked as a duplicate.

#51 @markjaquith
12 years ago

In 23978:

Fix some Skype chat parsing issues.

props wonderboymusic. see #23625.

#52 @wonderboymusic
12 years ago

  • Resolution set to fixed
  • Status changed from new to closed

New tickets, please

#53 @nacin
11 years ago

In 24598:

Remove wp_parse_post_content(), get_paged_content(), paginate_content() from 3.6, and remove the new $id parameters for get_the_content() and the_content().

The content parsing functions are good abstractions, but are no longer needed by core and are too closely tied to legacy globals, rather than paving a new path.

For get_the_content() and the_content(), this only worsens the function prototype. It muddies theme-specific display (more links, etc) with filtered content. apply_filters( 'the_content', $post->post_content ) is sufficient practice for now.

see #24330, [24301]. see #23625, [23804].

This ticket was mentioned in Slack in #core-editor by swissspidy. View the logs.


7 years ago

Note: See TracTickets for help on using tickets.