通过相同的列值

时间:2018-05-03 22:39:54

标签: python pandas time-series

对于以下类型的表:

Name       Date        Score
James      02/2011      70
James      03/2011      72   
James      10/2011      60
James      12/2011      50
James      01/2012      40
James      02/2012      60
James      03/2012      75
James      11/2012      70
James      12/2012      70
James      01/2013      30
James      02/2013      20
James      04/2013      60
James      06/2013      80
Jacob      01/2011      70
Jacob      02/2011      70
Jacob      03/2011      60
Jacob      04/2011      80
Jacob      05/2011      70
Jacob      06/2011      70
Jacob      08/2011      70
Jacob      10/2011      60
Jacob      11/2011      60
Jacob      12/2011      70
Jacob      02/2012      80

我想按名称应用以下规则:

1)如果连续日期之间的差异超过6个月,则删除与前一行关联的所有行。

因此,对于詹姆斯来说,前三排:

James      02/2011      70
James      03/2011      72 (difference between 03/2011, 10/2011 is 7 months)
James      10/2011      60     

我们在这里删除前两行,结果行为

James      10/2011      60
James      12/2011      50
James      01/2012      40
James      02/2012      60
James      03/2012      75
James      11/2012      70
James      12/2012      70
James      01/2013      30
James      02/2013      20
James      04/2013      60
James      06/2013      80

但在这里,

James      10/2011      60
James      12/2011      50
James      01/2012      40
James      02/2012      60
James      03/2012      75 (difference between 03/2012, 11/2012 is 8 months)
James      11/2012      70
James      12/2012      70
James      01/2013      30
James      02/2013      20
James      04/2013      60
James      06/2013      80

所以我们摆脱了

James      10/2011      60
James      12/2011      50
James      01/2012      40
James      02/2012      60
James      03/2012      75

并且只获得

James      11/2012      70
James      12/2012      70
James      01/2013      30
James      02/2013      20
James      04/2013      60
James      06/2013      80 

2)对所有名称申请1)后,更改日期以保持最后日期不变,连续日期之间的差异始终为1个月。

因此,对于雅各布,我们最初有

Jacob      01/2011      70
Jacob      02/2011      70
Jacob      03/2011      60
Jacob      04/2011      80
Jacob      05/2011      70
Jacob      06/2011      70
Jacob      08/2011      70
Jacob      10/2011      60
Jacob      11/2011      60
Jacob      12/2011      70
Jacob      02/2012      80

但结果行为

Jacob      04/2011      70
Jacob      05/2011      70
Jacob      06/2011      60
Jacob      07/2011      80
Jacob      08/2011      70
Jacob      09/2011      70
Jacob      10/2011      70
Jacob      11/2011      60
Jacob      12/2011      60
Jacob      01/2012      70
Jacob      02/2012      80

因此,我想要的结果表是:

Name       Date        Score
James      01/2013      70
James      02/2013      70
James      03/2013      30
James      04/2013      20
James      05/2013      60
James      06/2013      80
Jacob      04/2011      70
Jacob      05/2011      70
Jacob      06/2011      60
Jacob      07/2011      80
Jacob      08/2011      70
Jacob      09/2011      70
Jacob      10/2011      70
Jacob      11/2011      60
Jacob      12/2011      60
Jacob      01/2012      70
Jacob      02/2012      80

非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

我打破了台阶。     #转换为datetime     df.Date = pd.to_datetime(df.Date,格式= '%M /%Y')

# create new key for groupby , then we juts need the last group
df['Newkey']=df.groupby('Name').apply(lambda x : (x.Date.diff()/np.timedelta64(6, 'M')).gt(1).cumsum()).sort_index(level=1).values

# filter out the only keep the last subgroup base one each name 
df1=df.loc[df.Newkey==df.groupby('Name').Newkey.transform('max')]

# create the new date by using the len of the group and the max date value from that group
df1['NewDate']=np.concatenate(df1.groupby('Name',sort=False).apply(lambda x : pd.date_range(end=x['Date'].max(),periods=len(x['Date']),freq='MS').strftime('%m/%Y')).values)
df1
Out[281]: 
     Name       Date  Score  Newkey  NewDate
7   James 2012-11-01     70       2  01/2013
8   James 2012-12-01     70       2  02/2013
9   James 2013-01-01     30       2  03/2013
10  James 2013-02-01     20       2  04/2013
11  James 2013-04-01     60       2  05/2013
12  James 2013-06-01     80       2  06/2013
13  Jacob 2011-01-01     70       0  04/2011
14  Jacob 2011-02-01     70       0  05/2011
15  Jacob 2011-03-01     60       0  06/2011
16  Jacob 2011-04-01     80       0  07/2011
17  Jacob 2011-05-01     70       0  08/2011
18  Jacob 2011-06-01     70       0  09/2011
19  Jacob 2011-08-01     70       0  10/2011
20  Jacob 2011-10-01     60       0  11/2011
21  Jacob 2011-11-01     60       0  12/2011
22  Jacob 2011-12-01     70       0  01/2012
23  Jacob 2012-02-01     80       0  02/2012