根据列值计算熊猫的TimeDiff

时间:2020-02-05 22:10:39

标签: python sql pandas

具有这样的数据框:

enter image description here

理想的结果是获得聚合ID,其起始和结束之间的时间差如下所示:

enter image description here

尝试了简单的分组和比较,但不起作用:

df[df['Name'] == 'Start'].groupby('ID')['Time']-\
df[df['Name'] == 'End'].groupby('ID')['Time']

如何在熊猫中完成这项任务?谢谢!

2 个答案:

答案 0 :(得分:1)

一种可能的解决方案是像这样将表自身连接起来:

df_start = df[df['Name'] == 'Start']
df_end = df[df['Name'] == 'End']
df_merge = df_start.merge(df_end, on='id', suffixes=('_start', '_end'))
df_merge['diff'] = df_merge['Time_end'] - df_merge['Time_start']
print(df_merge.to_string())

输出:

   id Name_start          Time_start Name_end            Time_end            diff
0   1      Start 2017-11-02 12:00:14      End 2017-11-07 22:45:13 5 days 10:44:59
1   2      Start 2018-01-28 06:53:09      End 2018-02-05 13:31:14 8 days 06:38:05

答案 1 :(得分:1)

你在这里。

生成数据:

df = pd.DataFrame({'ID':[1, 1,2, 2],
    'Name': ['Start', 'End', 'Start', 'End'],
    'Time': [pd.datetime(2020, 1,1,0,1,0), pd.datetime(2020, 1,2,0,0,0),
             pd.datetime(2020, 1,1,0,0,0), pd.datetime(2020, 1,2,0,0,0)]})

获取TimeDelta:

df_agg = df[df['Name'] == 'Start'].reset_index()[['ID', 'Time']]
df_agg = df_agg.rename(columns={"Time": "Start"})
df_agg['End'] = df[df['Name'] == 'End'].reset_index()['Time']
df_agg['TimeDelta'] = df_agg['End'] - df_agg['Start']

以天为单位获取timediff作为十进制值,例如您的示例:

df_agg['TimeDiff_days'] = df_agg['TimeDelta'] / np.timedelta64(1,'D')
df_agg

结果:

    ID  Start               End         TimeDelta       TimeDiff_days
0   1   2020-01-01 00:01:00 2020-01-02  0 days 23:59:00 0.999306
1   2   2020-01-01 00:00:00 2020-01-02  1 days 00:00:00 1.000000