将后续记录中的时间戳与pandas进行比较

时间:2018-01-04 16:45:37

标签: python database pandas dataframe pandas-groupby

我有一个30000 KB的大型数据集(保存为" pandas" dataFrame)专家和用户之间的聊天对话。

每行代表专家或用户发送的消息。我想测量用户发送的第二条消息和专家的第二条消息之间的时间。

(注意有时是专家和用户类型并发送一组连续的消息,这些消息应作为一条大消息进行处理,并注意到有时某些数据丢失,例如 - sessionId 111中的消息0)< / p>

例如:在sessionId 222中,我想测量索引3和索引4之间的时间(在这种情况下是22分钟)

this

这里是以列表形式呈现的数据:

import datetime
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = [[222.0, 0.0, 'user', '12/6/2017 20:12'],
 [222.0, 1.0, 'user', '12/6/2017 20:41'],
 [222.0, 2.0, 'expert', '12/6/2017 21:15'],
 [222.0, 3.0, 'user', '12/6/2017 21:45'],
 [222.0, 4.0, 'expert', '12/6/2017 22:07'],
 [222.0, 5.0, 'expert', '12/6/2017 23:36'],
 [443.0, 0.0, 'user', '12/6/2017 20:41'],
 [443.0, 1.0, 'expert', '12/6/2017 21:15'],
 [443.0, 2.0, 'user', '12/6/2017 21:45'],
 [111.0, 1.0, 'expert', '12/6/2017 21:45'],
 [111.0, 2.0, 'user', '12/6/2017 22:07'],
 [243.0, 0.0, 'user', '12/6/2017 20:12'],
 [243.0, 1.0, 'expert', '12/6/2017 20:41'],
 [243.0, 2.0, 'user', '12/6/2017 21:15'],
 [243.0, 3.0, 'expert', '12/6/2017 21:45'],
 [243.0, 4.0, 'user', '12/6/2017 22:07'],
 [243.0, 5.0, 'expert', '12/6/2017 23:36'],
 [243.0, 6.0, 'user', '12/7/2017 0:05'],
 [243.0, 7.0, 'user', '12/7/2017 0:58'],
 [243.0, 8.0, 'user', '12/7/2017 0:58']]

我尝试了什么:

一个。使用pd.groupby&#34; sessionId&#34;和&#34;互动&#34;

湾创建userType的新列向下移动1行

℃。将原始userType与移位的userType进行比较,找出不匹配

d。在每个第三个不匹配中 - 找到不匹配消息和上一个消息之间的时间(交互)。

您能告诉我这种方法的工作实例或其他方法吗?

1 个答案:

答案 0 :(得分:0)

转换时间戳(如果尚未完成)

df['timestamp'] = pd.to_datetime(df.timestamp)

#Order by session and time before taking differences.
df.sort_values('sessionId','timestamp', inplace = True)

df['delta_time'] = df.groupby('sessionId').timestamp.diff()

结果:

    sessionId   interaction userType    timestamp   delta_time
9   111.0   1.0 expert  2017-12-06 21:45:00 
10  111.0   2.0 user    2017-12-06 22:07:00 0 days 00:22:00.000000000
0   222.0   0.0 user    2017-12-06 20:12:00 
1   222.0   1.0 user    2017-12-06 20:41:00 0 days 00:29:00.000000000
2   222.0   2.0 expert  2017-12-06 21:15:00 0 days 00:34:00.000000000
3   222.0   3.0 user    2017-12-06 21:45:00 0 days 00:30:00.000000000
4   222.0   4.0 expert  2017-12-06 22:07:00 0 days 00:22:00.000000000
5   222.0   5.0 expert  2017-12-06 23:36:00 0 days 01:29:00.000000000
11  243.0   0.0 user    2017-12-06 20:12:00 
12  243.0   1.0 expert  2017-12-06 20:41:00 0 days 00:29:00.000000000
13  243.0   2.0 user    2017-12-06 21:15:00 0 days 00:34:00.000000000
14  243.0   3.0 expert  2017-12-06 21:45:00 0 days 00:30:00.000000000
15  243.0   4.0 user    2017-12-06 22:07:00 0 days 00:22:00.000000000
16  243.0   5.0 expert  2017-12-06 23:36:00 0 days 01:29:00.000000000
17  243.0   6.0 user    2017-12-07 00:05:00 0 days 00:29:00.000000000
18  243.0   7.0 user    2017-12-07 00:58:00 0 days 00:53:00.000000000
19  243.0   8.0 user    2017-12-07 00:58:00 0 days 00:00:00.000000000
6   443.0   0.0 user    2017-12-06 20:41:00 
7   443.0   1.0 expert  2017-12-06 21:15:00 0 days 00:34:00.000000000
8   443.0   2.0 user    2017-12-06 21:45:00 0 days 00:30:00.000000000

另一种方法是为下一条消息创建一个新列,如下所示:

df['nextMessage'] = df.groupby('sessionId').timestamp.shift(-1)
df['deltaTime'] = df.nextMessage- df.timestamp

结果

sessionId   interaction,    userType    timestamp   nextMessage deltaTime
9   111.0   1.0 expert  2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
10  111.0   2.0 user    2017-12-06 22:07:00     
0   222.0   0.0 user    2017-12-06 20:12:00 2017-12-06 20:41:00 0 days 00:29:00.000000000
1   222.0   1.0 user    2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
2   222.0   2.0 expert  2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
3   222.0   3.0 user    2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
4   222.0   4.0 expert  2017-12-06 22:07:00 2017-12-06 23:36:00 0 days 01:29:00.000000000
5   222.0   5.0 expert  2017-12-06 23:36:00     
11  243.0   0.0 user    2017-12-06 20:12:00 2017-12-06 20:41:00 0 days 00:29:00.000000000
12  243.0   1.0 expert  2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
13  243.0   2.0 user    2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
14  243.0   3.0 expert  2017-12-06 21:45:00 2017-12-06 22:07:00 0 days 00:22:00.000000000
15  243.0   4.0 user    2017-12-06 22:07:00 2017-12-06 23:36:00 0 days 01:29:00.000000000
16  243.0   5.0 expert  2017-12-06 23:36:00 2017-12-07 00:05:00 0 days 00:29:00.000000000
17  243.0   6.0 user    2017-12-07 00:05:00 2017-12-07 00:58:00 0 days 00:53:00.000000000
18  243.0   7.0 user    2017-12-07 00:58:00 2017-12-07 00:58:00 0 days 00:00:00.000000000
19  243.0   8.0 user    2017-12-07 00:58:00     
6   443.0   0.0 user    2017-12-06 20:41:00 2017-12-06 21:15:00 0 days 00:34:00.000000000
7   443.0   1.0 expert  2017-12-06 21:15:00 2017-12-06 21:45:00 0 days 00:30:00.000000000
8   443.0   2.0 user    2017-12-06 21:45:00