假设我有一个像这样的数据集:
id_police id_sinistre datesurv
0 p123 s120 01/01/2018
1 p123 s121 03/01/2018
2 p123 s122 05/05/2018
3 p222 s123 04/05/2018
4 p222 s124 02/12/2018
5 p433 s125 07/08/2018
6 p433 s126 08/09/2018
7 p433 s127 10/10/2018
我的目标是为每行查找最近6个月内id_police
的最后一次出现:
id_police id_sinistre datesurv occ
0 p123 s120 01/01/2018 0
1 p123 s121 03/01/2018 1
2 p123 s122 05/05/2018 2
3 p222 s123 04/05/2018 0
4 p222 s124 02/12/2018 0
5 p433 s125 07/08/2018 0
6 p433 s126 08/09/2018 1
7 p433 s127 10/10/2018 2
我认为我将需要.duplicated
或.groupby
,但是我不确定如何使用它们……预先感谢您的帮助!
答案 0 :(得分:3)
如果应该将6个月简化为6 * 30天,请使用带有diff
的自定义lambda函数,按值和最后的累计总和进行比较:
df['datesurv'] = pd.to_datetime(df['datesurv'], dayfirst=True)
df = df.sort_values(['id_police','datesurv'])
f = lambda x: (x.diff().dt.days < 30 * 6).cumsum()
df['occ'] = df.groupby('id_police')['datesurv'].apply(f)
print (df)
id_police id_sinistre datesurv occ
0 p123 s120 2018-01-01 0
1 p123 s121 2018-01-03 1
2 p123 s122 2018-05-05 2
3 p222 s123 2018-05-04 0
4 p222 s124 2018-12-02 0
5 p433 s125 2018-08-07 0
6 p433 s126 2018-09-08 1
7 p433 s127 2018-10-10 2
答案 1 :(得分:3)
另一种选择是GroupBy
datesurv
,然后使用pd.Grouper
创建6个月的小组并参加cumcount
:
df.datesurv = pd.to_datetime(df.datesurv, format='%d/%m/%Y')
g = pd.Grouper(key='datesurv', freq='6MS')
df.assign(occ=df.groupby(['id_police', g]).cumcount())
id_police id_sinistre datesurv occ
0 p123 s120 2018-01-01 0
1 p123 s121 2018-01-03 1
2 p123 s122 2018-05-05 2
3 p222 s123 2018-05-04 0
4 p222 s124 2018-12-02 0
5 p433 s125 2018-08-07 0
6 p433 s126 2018-09-08 1
7 p433 s127 2018-10-10 2