我有一个30000 KB的大型数据集(保存为" pandas" dataFrame)专家和用户之间的聊天对话。
每行代表专家或用户发送的消息。我想测量用户发送的第二条消息和专家的第二条消息之间的时间。
(注意有时是专家和用户类型并发送一组连续的消息,这些消息应作为一条大消息进行处理,并注意到有时某些数据丢失,例如 - sessionId 111中的消息0)< / p>
例如:在sessionId 222中,我想测量索引3和索引4之间的时间(在这种情况下是22分钟)
这里是以列表形式呈现的数据:
import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = [[222.0, 0.0, 'user', '12/6/2017 20:12'],
[222.0, 1.0, 'user', '12/6/2017 20:41'],
[222.0, 2.0, 'expert', '12/6/2017 21:15'],
[222.0, 3.0, 'user', '12/6/2017 21:45'],
[222.0, 4.0, 'expert', '12/6/2017 22:07'],
[222.0, 5.0, 'expert', '12/6/2017 23:36'],
[443.0, 0.0, 'user', '12/6/2017 20:41'],
[443.0, 1.0, 'expert', '12/6/2017 21:15'],
[443.0, 2.0, 'user', '12/6/2017 21:45'],
[111.0, 1.0, 'expert', '12/6/2017 21:45'],
[111.0, 2.0, 'user', '12/6/2017 22:07'],
[243.0, 0.0, 'user', '12/6/2017 20:12'],
[243.0, 1.0, 'expert', '12/6/2017 20:41'],
[243.0, 2.0, 'user', '12/6/2017 21:15'],
[243.0, 3.0, 'expert', '12/6/2017 21:45'],
[243.0, 4.0, 'user', '12/6/2017 22:07'],
[243.0, 5.0, 'expert', '12/6/2017 23:36'],
[243.0, 6.0, 'user', '12/7/2017 0:05'],
[243.0, 7.0, 'user', '12/7/2017 0:58'],
[243.0, 8.0, 'user', '12/7/2017 0:58']]
我尝试了什么:
一个。使用pd.groupby&#34; sessionId&#34;和&#34;互动&#34;
湾创建userType的新列向下移动1行
℃。将原始userType与移位的userType进行比较,找出不匹配
d。在每个第三个不匹配中 - 找到不匹配消息和上一个消息之间的时间(交互)。
您能告诉我这种方法的工作实例或其他方法吗?
答案 0 :(得分:0)
转换时间戳(如果尚未完成)
df['timestamp'] = pd.to_datetime(df.timestamp)
#Order by session and time before taking differences.
df.sort_values('sessionId','timestamp', inplace = True)
df['delta_time'] = df.groupby('sessionId').timestamp.diff()
结果:
sessionId interaction userType timestamp delta_time
9 111.0 1.0 expert 2017-12-06 21:45:00
10 111.0 2.0 user 2017-12-06 22:07:00 0 days 00:22:00.000000000
0 222.0 0.0 user 2017-12-06 20:12:00
1 222.0 1.0 user 2017-12-06 20:41:00 0 days 00:29:00.000000000
2 222.0 2.0 expert 2017-12-06 21:15:00 0 days 00:34:00.000000000
3 222.0 3.0 user 2017-12-06 21:45:00 0 days 00:30:00.000000000
4 222.0 4.0 expert 2017-12-06 22:07:00 0 days 00:22:00.000000000
5 222.0 5.0 expert 2017-12-06 23:36:00 0 days 01:29:00.000000000
11 243.0 0.0 user 2017-12-06 20:12:00
12 243.0 1.0 expert 2017-12-06 20:41:00 0 days 00:29:00.000000000
13 243.0 2.0 user 2017-12-06 21:15:00 0 days 00:34:00.000000000
14 243.0 3.0 expert 2017-12-06 21:45:00 0 days 00:30:00.000000000
15 243.0 4.0 user 2017-12-06 22:07:00 0 days 00:22:00.000000000
16 243.0 5.0 expert 2017-12-06 23:36:00 0 days 01:29:00.000000000
17 243.0 6.0 user 2017-12-07 00:05:00 0 days 00:29:00.000000000
18 243.0 7.0 user 2017-12-07 00:58:00 0 days 00:53:00.000000000
19 243.0 8.0 user 2017-12-07 00:58:00 0 days 00:00:00.000000000
6 443.0 0.0 user 2017-12-06 20:41:00
7 443.0 1.0 expert 2017-12-06 21:15:00 0 days 00:34:00.000000000
8 443.0 2.0 user 2017-12-06 21:45:00 0 days 00:30:00.000000000
另一种方法是为下一条消息创建一个新列,如下所示:
df['nextMessage'] = df.groupby('sessionId').timestamp.shift(-1)
df['deltaTime'] = df.nextMessage- df.timestamp
结果
sessionId interaction, userType timestamp nextMessage deltaTime
9 111.0 1.0 expert 2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
10 111.0 2.0 user 2017-12-06 22:07:00
0 222.0 0.0 user 2017-12-06 20:12:00 2017-12-06 20:41:00 0 days 00:29:00.000000000
1 222.0 1.0 user 2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
2 222.0 2.0 expert 2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
3 222.0 3.0 user 2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
4 222.0 4.0 expert 2017-12-06 22:07:00 2017-12-06 23:36:00 0 days 01:29:00.000000000
5 222.0 5.0 expert 2017-12-06 23:36:00
11 243.0 0.0 user 2017-12-06 20:12:00 2017-12-06 20:41:00 0 days 00:29:00.000000000
12 243.0 1.0 expert 2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
13 243.0 2.0 user 2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
14 243.0 3.0 expert 2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
15 243.0 4.0 user 2017-12-06 22:07:00 2017-12-06 23:36:00 0 days 01:29:00.000000000
16 243.0 5.0 expert 2017-12-06 23:36:00 2017-12-07 00:05:00 0 days 00:29:00.000000000
17 243.0 6.0 user 2017-12-07 00:05:00 2017-12-07 00:58:00 0 days 00:53:00.000000000
18 243.0 7.0 user 2017-12-07 00:58:00 2017-12-07 00:58:00 0 days 00:00:00.000000000
19 243.0 8.0 user 2017-12-07 00:58:00
6 443.0 0.0 user 2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
7 443.0 1.0 expert 2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
8 443.0 2.0 user 2017-12-06 21:45:00