我有一个pandas数据框,其中包含两个人(客户和服务台操作员)之间的网络聊天实例。
当客户进入会话时,总是在网络聊天的第一行中宣布客户名称。
示例1:
在:df['log'][0]
退出:[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\'m looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks[14:44:38] James has exited the session.
示例2:
在:df['log'][1]
退出:[09:01:25] You are joining a chat with Roy Andrews[09:01:25] Roy Andrews: I\'m asking on behalf of partner whether she is able to still claim warranty on a coffee machine she purchased a year and a half ago? [09:02:00] Jamie: Thank you for contacting us. Could I start by asking for the type of coffee machine purchased please, and whether she still has a receipt?[09:02:23] Roy Andrews: BRX0403, she no longer has a receipt.[09:05:30] Jamie: Thank you, my interpretation is that she would not be able to claim as she is no longer under warranty. [09:08:46] Jamie: for more information on our product warranty policy please see www.brandx.com/warranty-policy/information[09:09:13] Roy Andrews: Thanks for the links, I will let her know.[09:09:15] Roy Andrews has exited the session.
由于不同的客户使用网络聊天服务,因此聊天中的名称始终会有所不同。
客户可以输入具有一个或多个名称的聊天。例:
James
Ravi
Roy Andrews
。
要求:
我想将客户聊天的所有实例(例如,由James
和Roy Andrews
进行的聊天)从df['log']
列中分离到新列df[text_analysis]
中。
在上面的示例1中,它看起来像:
在:df['text_analysis][0]
退出:[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I\'m looking to find out more about the services and products you offer.[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:44:12] James: Thanks
编辑:
最佳解决方案将按照上面的示例提取子字符串,并省略最终时间戳记[14:44:38] James has exited the session.
。
到目前为止我已经尝试过:
我使用以下方法将客户名称从df['log']
列提取到名为df['names']
的新列中:
df['names'] = df['log'].apply(lambda x: x.split(' ')[7].split('[')[0])
我想在df['names']
熊猫函数中使用str.split()
列中的名称-类似于:
df['log'].str.split(df['names'])
但是,这不起作用,并且如果在这种情况下确实发生了拆分,我认为它将无法正确拆分客户,并且服务运营商会分开聊天。
我还尝试将名称合并到正则表达式类型的解决方案中:
df['log'].str.extract('([^.]*{}[^.]*)').format(df['log']))
但是这也不起作用(因为我猜测.extract()
不支持格式。
任何帮助将不胜感激。
答案 0 :(得分:0)
使用regex
,longs
是第一段:
import re
re.match(r'^.*(?=\[)', longs).group()
结果:
"[14:40:48] You are joining a chat with James[14:40:48] James: Hello, I'm looking to find out more about the services and products you offer.[14:41:05] Greg: Thank you for contacting us. [17:41:14] Greg: Could I start by asking what services lines or products you are interested in knowing more about, please?[14:41:23] James: I would like to know more about your gardening and guttering service.[14:43:20] James: hello?[14:43:32] Greg: thank you, for more information on those please visit www.example.com/more_examples.[14:44:12] James: Thanks"
您可以将此正则表达式功能打包到您的数据框中:
df['text_analysis'] = df['log'].apply(lambda x: re.match(r'^.*(?=\[)', x).group())
说明:正则表达式字符串'^.*(?=\[)'
的含义是:从^
开始,匹配任意数量的任何字符.*
,以[
结尾,但不包括{{1} }。由于正则表达式匹配最长的字符串,因此它将从头到尾(?=\[)
为止,并且不包括[
。
可以通过以下方式提取单独的行:
[
输出:
import re
customerspeak = re.findall(r'(?<=\[(?:\d{2}:){2}\d{2}\]) James:[^\[]*', s)
如果您希望它们在同一行中,则可以[" James: Hello, I'm looking to find out more about the services and products you offer.",
' James: I would like to know more about your gardening and guttering service.',
' James: hello?',
' James: Thanks']