#23625 closed feature request (fixed)
Add function to extract or parse chat as structured data
Reported by: | wonderboymusic | Owned by: | |
---|---|---|---|
Milestone: | 3.6 | Priority: | normal |
Severity: | normal | Version: | |
Component: | Post Formats | Keywords: | has-patch needs-codex |
Focuses: | Cc: |
Description
Scenarios
- You have a blob of content representing a chat transcript and optionally some commentary and you want to extract the chat as structured data, and optionally, remove it from the content
- You want to extract the chat from a post's content to theme it, with or without the optional commentary, outside of
the_content()
Introduce get_the_chat( $id = 0 )
and get_content_chat( &$content, $remove = false )
Here is an example to test:
$chat =<<<CHAT Scott: Hey, let's chat! This is more text Helen: No. Nacin: Cool. Scott: Hey, let's chat! Helen: No. Scott: Hey, let's chat! Helen: No. Scott: Hey, let's chat! Helen: No. This is the commentary on the chat. CHAT; print_r( get_content_chat( $chat, true ) ); echo $chat; exit();
Here is the format of the structured chat data:
$stanzas = array( array( array( 'author' => 'Scott', 'messsage' => "Hey, let's chat!" ), array( 'author' => 'Helen', 'message' => 'No.' ) ) )
Attachments (17)
Change History (71)
#3
@
12 years ago
I was looking at your second diff:
Testing with a copy/paste from Skype was initially unsuccessful, due to the whitespace check in author names. May I ask why they shouldn't contain whitespaces?
I also noticed that Skype prepends every line with a timestamp like so: [3/5/13 11:41:02 AM]
. So I added a preg replace before the initial trim of each line to remove that:
$line = preg_replace( '/\[.+\]\s/', '', $line ); // Skype
My regex-skills are lacking at best, so I'm sure this could be optimized. Is this something you'd consider adding in the function?
I found the naming of get_the_chat()
confusing, initially I expected a single chat.
This is how I would use it as it currently stands:
<?php $chats = get_the_chat(); $chat = array_shift( $chats ); foreach ( $chat as $row ) : $message = make_clickable( $row['message'] ); ?> <div class="row"> <strong class="author"><?php echo $row['author']; ?>:</strong> <span class="message"><?php echo $message; ?></span> </div> <?php endforeach; ?>
I hope you find the feedback constructive. Overall I am super psyched about the way you implemented this! Chats come finally to life for me! :)
#4
@
12 years ago
yeah, the main issue is the delimiter for the start of a new author line, I arbitrarily decided it was "mynamewithnocolonsorwhitespace: ", so I need some help defining that for sure.
If you do array_shift()
on the whole chat, you are assuming there is only one stanza, here is what multi-stanza looks like:
Me: Hey. You: Hi! Me: Hey. You: Hi!
Let's keep the use cases coming, for sure
#5
@
12 years ago
- Keywords needs-unit-tests added
I would use a simple check for names:
^(.*?):
Putting it another way, explode( ':', $line, 2 )
.
That's it. If it has a timestamp, then let it be part of the meta info for that line.
This would break in one situation: Gmail's chat output does not repeat the person's name in subsequent lines. If they use a colon, then the first phrase will be caught as the "name". I don't mind that. Note, this is exactly how Tumblr also happens to parse it.
I like the idea of make_clickable() being inside the core function. Let's consider running that once over the entire chat content, rather than on each line, if we can get away with it.
Really needs unit tests!
#6
@
12 years ago
Patch updated - here are the test cases, Unit Tests are forthcoming:
One Stanza, no messages are multiline
Scott: Hey.
Nacin: Go away.
One Stanza, no messages are multiline, followed by commentary
Scott: Hey.
Nacin: Go away.
Nacin is mean to me.
One Stanza, contains one of more multiline messages with or without :s that are not usernames
Scott: Hey.
I have a question.
Nacin: Go away.
OR
Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.
One Stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames
Scott: Hey.
I have a question.
Nacin: Go away.
Nacin hates color.
OR
Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.
Nacin hates color.
Multi-stanza, no messages are multiline
Scott: Hey.
Nacin: Go away.
Helen: Hey.
Nacin: Go away.
Multi-stanza, no messages are multiline, followed by commentary
Scott: Hey.
Nacin: Go away.
Helen: Hey.
Nacin: Go away.
Nacin appears busy today.
Multi-stanza, contains one of more multiline messages with or without :s that are not usernames
Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.
Helen: Hey.
I have a question: what is your favorite pizza topping?
Nacin: Go away.
Multi-stanza, contains one of more multiline messages with or without :s that are not usernames, followed by commentary with or without :s that are not usernames
Scott: Hey.
I have a question: what is your favorite color?
Nacin: Go away.
Helen: Hey.
I have a question: what is your favorite pizza topping? Or go here http://www.justinbieber.com
Nacin: Go away.
Nacin hates color and pizza.
#7
@
12 years ago
Patch updated to support optional leading Timestamps a la Skype - made revisions to code based on UNIT TESTS (ding ding ding) which are attached as well
#8
@
12 years ago
Note: I need to add the Twenty Thirteen chats to the Unit Tests, I saw one that I don't think would pass
#9
@
12 years ago
Any thoughts on supporting IRC style?
<rmccue> Hi! <nacin> You suck.
And/or with timestamps.
[08:15:04] <rmccue> Hi! [08:15:16] <nacin> You suck.
#10
@
12 years ago
Updated patch matches @rmccue's IRC style chats, Skype chats, stanza-based script-like chats, and the chat used in TwentyThirteen. Also updated Unit Tests, there are 500 lines of them
#11
@
12 years ago
- Cc mdhansen@… added
I copied some code in from #wordpress irc channel and used print_r( get_the_chat( get_the_ID() ) ) to view the chat array. It seems to have broke though. See last message.
Array ( [0] => Array ( [0] => Array ( [time] => [author] => gmorrison [message] => just a sec ) [1] => Array ( [time] => [author] => gmorrison [message] => ok ) [2] => Array ( [time] => [author] => gmorrison [message] => changing uploads to 777 let it create the folder for that ) [3] => Array ( [time] => [author] => gmorrison [message] => so yes it can * daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress <snowfox_ben> here is my code <a href="https://gist.github.com/benschaaf/91179440d1996b48e382" rel="nofollow">https://gist.github.com/benschaaf/91179440d1996b48e382</a> ) ) )
#12
@
12 years ago
can you paste the original text so I can add it to Unit Tests? it's the <([^>]+)>
- needs to be <([^>\s]+)>
#13
@
12 years ago
Yea, meant to include that. Sorry.
<gmorrison> just a sec <gmorrison> ok <gmorrison> changing uploads to 777 let it create the folder for that <gmorrison> so yes it can * daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress <snowfox_ben> here is my code https://gist.github.com/benschaaf/91179440d1996b48e382
#15
@
12 years ago
Looking at the unit test for the example. I am not sure the expected var is correct. Let me elaborate a bit.
'message' => 'so yes it can * daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress'
in the above message we have the message "so yes it can" then we have notices added as well.
* daveluke has quit (Quit: daveluke) * robmiller (~robmiller@141.0.147.143) has joined #wordpress
are notices, I am not sure what should be expected. Should it be removed, added as its own message or the users responsibility to remove before publishing. Either way I do not think it should be included as part of the previous message.
#16
@
12 years ago
I think we are going a little too far with what we support in core. I think each line/paragraph as a message, with anything before the first colon being a label of sorts for that line (author, whatever you want to call it), is quite enough. This should be extensible to add support for other styles, but in the interest of getting this in and keeping it predictable for users, who tend to adjust their content accordingly once they figure it out or read some docs, let's choose one style as the core style.
#17
follow-up:
↓ 18
@
12 years ago
I think you go far with all of this :D , Personally i am not like the new functions get_content_chat() and get_the_chat() functions , you just add some junk to the core :)
I love WordPress to more clean , more simple .
#18
in reply to:
↑ 17
;
follow-up:
↓ 20
@
12 years ago
Replying to alex-ye:
you just add some junk to the core :)
Feel free to propose a junk-free method of consistently handling the chat post format.
#19
follow-up:
↓ 22
@
12 years ago
I think the new functions are necessary to make post formats useful rather than just labels. I do however think IRC chats may be plugin territory. Because of things like the notices.
#20
in reply to:
↑ 18
@
12 years ago
Replying to helen:
Replying to alex-ye:
you just add some junk to the core :)
Feel free to propose a junk-free method of consistently handling the chat post format.
I think we should not at all !! , there are many many cases of chat styles the can broke the previous patches ... I think the user when he use the chat format he only want to bookmark his post as "chat" only ! , no parse content , no complicated staff !!
If some user want to make advance chat format , he will use a plugin or theme that support that very well , and maybe better than yours ..
As I said , I love WordPress to be clean as possible , post-formats is really good feature you make it better with 3.6 ... but don't make it very complicated :)
after all , maybe many people disagree with me , but I like to say my opinion .
#21
@
12 years ago
In my opinion , if you want to add some inputs to chat format , you have many options like :
1 - Chat people names ( or what you want )
2 - Chat date
3 - Chat place
... etc
#22
in reply to:
↑ 19
@
12 years ago
Replying to MikeHansenMe:
I think the new functions are necessary to make post formats useful rather than just labels. I do however think IRC chats may be plugin territory. Because of things like the notices.
I can't see parsing chat is necessary , Give the user the abillity to write the chat content as the style he want , don't make him confuse about the required style to make get_content_chat() works !!
if you want to make "chat" format is just no label , you can add some simple staff , and the rest let it to the theme or plugin developers :)
The idea of 3.6 as I know is to make editing more easier and give the theme and plugin developers more ability to customize the content .
#23
follow-up:
↓ 24
@
12 years ago
no one has to use the function, it's just there if you want to get the structured data, which apparently you're a huge fan of - thanks for the alternate patch and additional unit tests
#24
in reply to:
↑ 23
@
12 years ago
Replying to wonderboymusic:
no one has to use the function, it's just there if you want to get the structured data, which apparently you're a huge fan of
Thank you for reply , So you said that you want to add "unused" functions to core , I am not with that too , In my opinion this thing should be a plugin , we can make it more professional ... etc
WordPress 3.5 removes the bookmark functionality by default , and maybe in future releases we will remove more "unused" stuffs .
So can we stop adding "unused" functionality at least !! , so we will don't have a ticket with title "Remove support for ..." :)
#25
follow-up:
↓ 26
@
12 years ago
theme support
add_theme_support( 'structured-post-formats' );
theme that supports 'structured-post-formats'
$data = get_the_chat(); // do a bunch of stuff
theme that doesn't
the_content();
Come troll in dev chat later if this is ruining your life
#26
in reply to:
↑ 25
@
12 years ago
Replying to wonderboymusic:
Come troll in dev chat later if this is ruining your life
Yes it running my life , but the dev chat time doesn't fit me :(
Anyway I just said my opinion feel free to agree or disagree .
and Thank you for your time :)
#27
@
12 years ago
Thinking that we should have an actual HTML output function for this as well. Could be another ticket, but just throwing that thought on here while I'm awake.
#28
@
12 years ago
We decided in the dev chat to simplify our regex support, and to make it pluggable. Think something along the lines of:
add_chat_detection_format( $regex_to_detect_this_format, $regex_to_parse_this_format );
Do detection based on the first line that matches one, then have the rest use that type. Have our core regexes use this system so we're on even footing. Someone can add some crazy "parse all the chats" plugin, but we don't need that level of crazy support in core.
#29
@
12 years ago
Updated patch and unit tests
In this patch:
- Introduce
add_chat_detection_format( $name, $newline_regex, $delimiter_regex )
- Introduce
get_content_chat( &$content, $remove = false )
- Introduce
get_the_chat( $id = 0 )
Add chat support for IM and Skype:
add_chat_detection_format( 'IM', '#^([^:]+):#', '#[:]#' ); add_chat_detection_format( 'Skype', '#^(\[.+?\])\s([^:]+):#', '#[:]#' );
#32
@
12 years ago
.11 does some clean up. A few things to explain:
- Added periods to some comments (might be worth it to standardize that across post-format.php)
- Moved similar variable declarations together
- Sets up
$newline_regex
,$delimiter_regex
in regex checker loop. - Changed chat class to just 'chat' to be consistent with proposed quote fallback
- Sanitized author with dashes to avoid cases like
class="chat-author-lance willet"
- Message should not be html escaped, so make_clickable() works.
#33
@
12 years ago
- Cc kontakt@… added
What about the markup generated by the_chat()
? According to HTML5Doctor is the use of <dl>
for marking up dialogue appropriate for HTML4 but not for HTML5 (http://html5doctor.com/the-dl-element/)
In the IRC, Mark Jaquith linked to this page:
http://www.w3.org/html/wg/drafts/html/master/common-idioms.html#conversations
"[...] authors are encouraged to mark up conversations using
p
elements and punctuation. Authors who need to mark the speaker for styling purposes are encouraged to usespan
orb
."
#34
@
12 years ago
I'm open to discussing the markup; a definition list still seems the most semantic and simplest method. I'm not taking HTML5Doctor as gospel. :)
Regarding .11 patch from Obenland, the better class value and slug sanitizing make sense -- +1
#35
follow-up:
↓ 38
@
12 years ago
Managed to get:
Notice: Undefined offset: 1 in .../trunk/wp-includes/post-formats.php on line 534
Trace says:
get_content_chat( $content = 'Otto 12:51 AM: yeah. Oh well. I have pasta! It\'s delicious! :)\r\n\r\nHelen Hou-Sandi 12:51 AM: NOM.\r\ni ate 8 cookies and am ever-so-slightly regretting it', $remove = ??? )
#37
@
12 years ago
23625.12.diff fixes an issue with some UTF-8 characters (previously [12499], [19866], [21862], and [22306]).
Steps to reproduce:
- Create a chat post, paste this into content:
Scott: Hey, let's chat! Роман: No.
- View the post. The second message will not be recognized properly:
Scott: Hey, let's chat! Роман: No.
#38
in reply to:
↑ 35
@
12 years ago
Replying to helen:
Managed to get:
Notice: Undefined offset: 1 in .../trunk/wp-includes/post-formats.php on line 534
Confirmed with your example. Appears to be caused by an empty line between the messages.
In current trunk, the notice comes from line 527:
http://core.trac.wordpress.org/browser/trunk/wp-includes/post-formats.php?rev=23876#L527
23625.13.diff fixes this for me.
#41
follow-up:
↓ 42
@
12 years ago
In r23876 we removed esc_html()
from the chat message output, but I think it should go back.
Test with:
[3/21/13 9:35:04 AM] Fancy: Hey Lance! [3/21/13 9:41:23 AM] Pants: This is why I don't chat with you [3/21/13 9:42:13 AM] Lance: </chat> [3/21/13 9:42:19 AM] Pants: Hey, I'm not done
The </chat>
bit should be escaped in the HTML output, and it's not.
- Visual output is empty: http://cl.ly/image/0P112c2D3R1b
- The text is not escaped in the HTML output: http://cl.ly/image/3x0A2X3L051n
#42
in reply to:
↑ 41
@
12 years ago
Replying to lancewillett:
In r23876 we removed
esc_html()
from the chat message output, but I think it should go back.
Noting that make_clickable()
was the reason to remove it.
#43
@
12 years ago
Related to #23947 - currently this is broken in instances where chat names have spaces.
http://test.ipstenu.org/testing-1.2013/
That has the post content as follows:
Nigel Tufnel: The numbers all go to eleven. Look, right across the board, eleven, eleven, eleven and… Marti DiBergi: Oh, I see. And most amps go up to ten? Nigel Tufnel: Exactly. Marti DiBergi: Does that mean it’s louder? Is it any louder? Nigel Tufnel: Well, it’s one louder, isn’t it? It’s not ten. You see, most blokes, you know, will be playing at ten. You’re on ten here, all the way up, all the way up, all the way up, you’re on ten on your guitar. Where can you go from there? Where? Marti DiBergi: I don’t know. Nigel Tufnel: Nowhere. Exactly. What we do is, if we need that extra push over the cliff, you know what we do? Marti DiBergi: Put it up to eleven. Nigel Tufnel: Eleven. Exactly. One louder. Marti DiBergi: Why don’t you just make ten louder and make ten be the top number and make that a little louder? Nigel Tufnel: These go to eleven.
But nothing shows.
#44
@
12 years ago
I have been having a bit of trouble with it this morning as well.
The transcript I used (google chat):
3:54 PM me: http://i.imgur.com/qjjlkGm.gif 3:55 PM Jason: I am so confused I will have to watch it a few times to see how they get the extra piece :) 3:56 PM me: me too cant figure it out probably watched it 20 times now Jason: but now I know I can steal a square out of my wife's candy bar and get away with it :D me: yes! 3:57 PM Jason: now I am going to have to buy her one to try it out
The result:
3: 54 PM me: http://i.imgur.com/qjjlkGm.gif 3: 55 PM Jason: I am so confused I will have to watch it a few times to see how they get the extra piece :) 3: 56 PM me: me too cant figure it out probably watched it 20 times now Jason: but now I know I can steal a square out of my wife's candy bar and get away with it :D me: yes! 3: 57 PM Jason: now I am going to have to buy her one to try it out
I know this may be because Google chat is not supported but I think "IM" needs to be more clear as to what it includes. I also ran into a situation like @ipstenu where the output just went blank. I think if it gets hung up or cannot recognize a chat style it should output everything not nothing. I also got an undefined index on line 529 in post formats.
$author_match = empty( $matches[2] ) ? $matches[1] : $matches[2];
#45
@
12 years ago
There's also an error if you have text before the chat, like an introduction.
Introduction to a Skype chat. [28-03-2013 15:54:03] Jesper: Lorem ipsum dolor sit amet [28-03-2013 16:07:00] Jakob: Consectetuer adipiscing elit [28-03-2013 16:07:18] Jesper: Maecenas porttitor congue massa [28-03-2013 16:07:28] Jakob: Fusce posuere, magna sed.
It gives a notice and no content is displayed. The notice is:
Notice: Undefined offset: 1 in [...]\wp-includes\post-formats.php on line 529
It's the same line as MikeHansenMe in comment:44.
#46
@
12 years ago
.14.diff fixes Skype parsing when the chat occurs after that start of the post content. I wouldn't complain too loudly if someone committed 23625-tests.2.diff to Unit Tests, por favor
#47
@
12 years ago
attachment:23625.14.diff fixes the notice and the content is displayed again. But it removes any content outside of the chat, I don't know if that is intentional?
#48
@
12 years ago
In 1258/tests:
Refreshed against test scenarios. There is no
post_format_compat
for posts or themes that explicitly support thechat
post format.get_the_chat( $id = 0 )
orget_content_chat( &$content, $remove = false )
can be used by themes that supportstructured-post-formats
instead of callingthe_content()
. There are at least 8 scenarios that those functions have to consider::
s that are not usernames:
s that are not usernames, followed by commentary with or without:
s that are not usernames:
s that are not usernames:
s that are not usernames, followed by commentary with or without:
s that are not usernamesThe resulting
array()
is a list of stanzas which each contain a list of messages, each is an associative array that containsauthor
andmessage
data