对于以下类型的表:
Name Date Score
James 02/2011 70
James 03/2011 72
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
Jacob 01/2011 70
Jacob 02/2011 70
Jacob 03/2011 60
Jacob 04/2011 80
Jacob 05/2011 70
Jacob 06/2011 70
Jacob 08/2011 70
Jacob 10/2011 60
Jacob 11/2011 60
Jacob 12/2011 70
Jacob 02/2012 80
我想按名称应用以下规则:
1)如果连续日期之间的差异超过6个月,则删除与前一行关联的所有行。
因此,对于詹姆斯来说,前三排:
James 02/2011 70
James 03/2011 72 (difference between 03/2011, 10/2011 is 7 months)
James 10/2011 60
我们在这里删除前两行,结果行为
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
但在这里,
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75 (difference between 03/2012, 11/2012 is 8 months)
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
所以我们摆脱了
James 10/2011 60
James 12/2011 50
James 01/2012 40
James 02/2012 60
James 03/2012 75
并且只获得
James 11/2012 70
James 12/2012 70
James 01/2013 30
James 02/2013 20
James 04/2013 60
James 06/2013 80
2)对所有名称申请1)后,更改日期以保持最后日期不变,连续日期之间的差异始终为1个月。
因此,对于雅各布,我们最初有
Jacob 01/2011 70
Jacob 02/2011 70
Jacob 03/2011 60
Jacob 04/2011 80
Jacob 05/2011 70
Jacob 06/2011 70
Jacob 08/2011 70
Jacob 10/2011 60
Jacob 11/2011 60
Jacob 12/2011 70
Jacob 02/2012 80
但结果行为
Jacob 04/2011 70
Jacob 05/2011 70
Jacob 06/2011 60
Jacob 07/2011 80
Jacob 08/2011 70
Jacob 09/2011 70
Jacob 10/2011 70
Jacob 11/2011 60
Jacob 12/2011 60
Jacob 01/2012 70
Jacob 02/2012 80
因此,我想要的结果表是:
Name Date Score
James 01/2013 70
James 02/2013 70
James 03/2013 30
James 04/2013 20
James 05/2013 60
James 06/2013 80
Jacob 04/2011 70
Jacob 05/2011 70
Jacob 06/2011 60
Jacob 07/2011 80
Jacob 08/2011 70
Jacob 09/2011 70
Jacob 10/2011 70
Jacob 11/2011 60
Jacob 12/2011 60
Jacob 01/2012 70
Jacob 02/2012 80
非常感谢任何帮助。
答案 0 :(得分:1)
我打破了台阶。 #转换为datetime df.Date = pd.to_datetime(df.Date,格式= '%M /%Y')
# create new key for groupby , then we juts need the last group
df['Newkey']=df.groupby('Name').apply(lambda x : (x.Date.diff()/np.timedelta64(6, 'M')).gt(1).cumsum()).sort_index(level=1).values
# filter out the only keep the last subgroup base one each name
df1=df.loc[df.Newkey==df.groupby('Name').Newkey.transform('max')]
# create the new date by using the len of the group and the max date value from that group
df1['NewDate']=np.concatenate(df1.groupby('Name',sort=False).apply(lambda x : pd.date_range(end=x['Date'].max(),periods=len(x['Date']),freq='MS').strftime('%m/%Y')).values)
df1
Out[281]:
Name Date Score Newkey NewDate
7 James 2012-11-01 70 2 01/2013
8 James 2012-12-01 70 2 02/2013
9 James 2013-01-01 30 2 03/2013
10 James 2013-02-01 20 2 04/2013
11 James 2013-04-01 60 2 05/2013
12 James 2013-06-01 80 2 06/2013
13 Jacob 2011-01-01 70 0 04/2011
14 Jacob 2011-02-01 70 0 05/2011
15 Jacob 2011-03-01 60 0 06/2011
16 Jacob 2011-04-01 80 0 07/2011
17 Jacob 2011-05-01 70 0 08/2011
18 Jacob 2011-06-01 70 0 09/2011
19 Jacob 2011-08-01 70 0 10/2011
20 Jacob 2011-10-01 60 0 11/2011
21 Jacob 2011-11-01 60 0 12/2011
22 Jacob 2011-12-01 70 0 01/2012
23 Jacob 2012-02-01 80 0 02/2012