时间范围内的熊猫数据框过滤器行

时间:2019-12-02 08:07:46

标签: python pandas dataframe

我有一个像这样的数据框对象:

 Date              ID           Delta
2019-10-16 16:43:46 BA9565P     0 days 00:00:00
2019-10-17 05:28:36 BA9565P     0 days 12:44:50
2019-10-16 16:43:13 BA9565X     0 days 00:00:00
2019-10-17 03:26:52 BA9565X     0 days 10:43:39
2019-10-10 19:17:17 BABRGNR     0 days 00:00:00
2019-10-12 19:43:56 BABRGNR     2 days 00:26:39
2019-10-31 00:48:52 BABRGR8     0 days 00:00:00
2019-11-01 14:33:41 BABRGR8     1 days 13:44:49

如果相同的ID相隔3天之内,那么我只需要最新的结果。但是,如果同一ID相隔3天以上,那么我想保留两个记录。到目前为止,我已经做到了。

df2 = df[df.duplicated(['ID'], keep = False)][['Date', 'ID']]
df2["Date"] = pd.to_datetime(df2["Date"])
df2["Delta"] = df2.groupby(['ID']).diff() 
df2["Delta"] = df2["Delta"].fillna(datetime.timedelta(seconds=0))

但是我不确定应该如何继续。我尝试过:

df2["Delta2"] = (df2["Delta"] < datetime.timedelta(days=3)

对于该组中的第一个元素,条件是否为True(无论是否在3天内)。

df2.groupby(['ID']).filter(lambda x: ((x["Delta"]<datetime.timedelta(days=3)) & \
                                             (x["Delta"] != datetime.timedelta(seconds=0))).any())

同样,由于.diff()始终为第一个元素返回“ NaT”,因此也存在类似的问题。有没有办法访问组中的最后一个元素?还是有比使用groupby()。diff()更好的方法?

1 个答案:

答案 0 :(得分:2)

如果每个组之间的差异更像3 days,那么解决方案将选择组中的所有行,否则所有其他组的最后一行:

print (df)
                 Date       ID            Delta
0 2019-10-16 16:43:46  BA9565P  0 days 00:00:00
1 2019-10-17 05:28:36  BA9565P  0 days 12:44:50
2 2019-10-16 16:43:13  BA9565X  0 days 00:00:00
3 2019-10-20 03:26:52  BA9565X  0 days 10:43:39 <-chnaged data sample to 2019-10-20
4 2019-10-10 19:17:17  BABRGNR  0 days 00:00:00
5 2019-10-12 19:43:56  BABRGNR  2 days 00:26:39
6 2019-10-31 00:48:52  BABRGR8  0 days 00:00:00
7 2019-11-01 14:33:41  BABRGR8  1 days 13:44:49

#if not sorted dates
#df = df.sort_values(['ID','Date'])
df2 = df[df.duplicated(['ID'], keep = False)]
#get differences
df2["Delta"] = df2.groupby(['ID'])['Date'].diff().fillna(pd.Timedelta(0))
#compare by 3 days
mask = df2["Delta"] < pd.Timedelta(days=3)
#test if all Trues per groups
mask1 = mask.groupby(df2['ID']).transform('all')
#get last row per ID
mask2 = ~df2["ID"].duplicated(keep='last')

#filtering
df2 = df2[~mask1 | mask2]
print (df2)
                 Date       ID           Delta
1 2019-10-17 05:28:36  BA9565P 0 days 12:44:50
2 2019-10-16 16:43:13  BA9565X 0 days 00:00:00
3 2019-10-20 03:26:52  BA9565X 3 days 10:43:39
5 2019-10-12 19:43:56  BABRGNR 2 days 00:26:39
7 2019-11-01 14:33:41  BABRGR8 1 days 13:44:49