具有这样的数据框:
理想的结果是获得聚合ID,其起始和结束之间的时间差如下所示:
尝试了简单的分组和比较,但不起作用:
df[df['Name'] == 'Start'].groupby('ID')['Time']-\
df[df['Name'] == 'End'].groupby('ID')['Time']
如何在熊猫中完成这项任务?谢谢!
答案 0 :(得分:1)
一种可能的解决方案是像这样将表自身连接起来:
df_start = df[df['Name'] == 'Start']
df_end = df[df['Name'] == 'End']
df_merge = df_start.merge(df_end, on='id', suffixes=('_start', '_end'))
df_merge['diff'] = df_merge['Time_end'] - df_merge['Time_start']
print(df_merge.to_string())
输出:
id Name_start Time_start Name_end Time_end diff
0 1 Start 2017-11-02 12:00:14 End 2017-11-07 22:45:13 5 days 10:44:59
1 2 Start 2018-01-28 06:53:09 End 2018-02-05 13:31:14 8 days 06:38:05
答案 1 :(得分:1)
你在这里。
生成数据:
df = pd.DataFrame({'ID':[1, 1,2, 2],
'Name': ['Start', 'End', 'Start', 'End'],
'Time': [pd.datetime(2020, 1,1,0,1,0), pd.datetime(2020, 1,2,0,0,0),
pd.datetime(2020, 1,1,0,0,0), pd.datetime(2020, 1,2,0,0,0)]})
获取TimeDelta:
df_agg = df[df['Name'] == 'Start'].reset_index()[['ID', 'Time']]
df_agg = df_agg.rename(columns={"Time": "Start"})
df_agg['End'] = df[df['Name'] == 'End'].reset_index()['Time']
df_agg['TimeDelta'] = df_agg['End'] - df_agg['Start']
以天为单位获取timediff作为十进制值,例如您的示例:
df_agg['TimeDiff_days'] = df_agg['TimeDelta'] / np.timedelta64(1,'D')
df_agg
结果:
ID Start End TimeDelta TimeDiff_days
0 1 2020-01-01 00:01:00 2020-01-02 0 days 23:59:00 0.999306
1 2 2020-01-01 00:00:00 2020-01-02 1 days 00:00:00 1.000000