我有一个简单的数据框:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00
4 15:05:00 15:15:00
我想要做的是对最后两行进行单位处理,因为它们属于相同的ID(ID=4
),最后一行的时间包含在倒数第二行的时间内。
我想要的输出是:
ID Stime Etime
1 13:00:00 13:15:00
1 14:00:00 14:15:00
2 15:00:00 15:42:00
3 13:00:00 13:25:00
4 15:00:00 15:15:00
答案 0 :(得分:1)
def setup(df):
td = df.Stime - df.Etime.shift()
td = td.apply(lambda x: x.total_seconds() > 1)
td.iloc[0] = True
return td.cumsum()
def collapse(df):
df_ = df.iloc[0, :]
df_.loc['Stime'] = df.Stime.min()
df_.loc['Etime'] = df.Etime.max()
return df_
df['group id'] = df.groupby('ID').apply(setup).values
gbcols = ['ID', 'group id']
fcols = ['ID', 'Stime', 'Etime']
print df.groupby(gbcols)[fcols].apply(collapse).reset_index(drop=True)
ID Stime Etime
0 1 2016-05-30 13:00:00 2016-05-30 13:15:00
1 1 2016-05-30 14:00:00 2016-05-30 14:15:00
2 2 2016-05-30 15:00:00 2016-05-30 15:42:00
3 3 2016-05-30 13:00:00 2016-05-30 13:25:00
4 4 2016-05-30 15:00:00 2016-05-30 15:15:00