我有一个包含Groups,两个日期和值的数据框。
我想要一个数据帧的子集,它为每个GRP保留所有行的唯一B_DATE值。如果每个组中存在重复的B_DATE值,我希望保留具有最大A_DATE值的行。
所以,如果我的初始数据框是:
GRP A_DATE B_DATE VALUE
A 12/31/2012 2/19/2014 546.2
A 12/31/2013 2/19/2014 543.7
A 3/31/2013 4/30/2014 473.3
A 3/31/2014 4/30/2014 472.5
A 6/30/2013 7/30/2014 528.7
A 6/30/2014 7/30/2014 531.5
A 9/30/2013 10/30/2014 529
A 9/30/2014 10/30/2014 546.7
A 12/31/2014 2/18/2015 573.5
A 3/31/2015 4/30/2015 458.7
A 6/30/2015 7/30/2015 519.5
B 3/31/2014 7/7/2015 1329
B 12/31/2014 7/7/2015 1683
B 3/31/2015 7/7/2015 1361
B 6/30/2014 8/13/2015 1452
B 6/30/2015 8/13/2015 1429
B 9/30/2014 10/29/2015 1488
B 9/30/2015 10/29/2015 1595
B 12/31/2015 2/16/2016 1763
B 3/31/2016 4/28/2016 1548
我希望结果看起来像这样:
GRP A_DATE B_DATE VALUE
A 12/31/2013 2/19/2014 543.7
A 3/31/2014 4/30/2014 472.5
A 6/30/2014 7/30/2014 531.5
A 9/30/2014 10/30/2014 546.7
A 12/31/2014 2/18/2015 573.5
A 3/31/2015 4/30/2015 458.7
A 6/30/2015 7/30/2015 519.5
B 3/31/2015 7/7/2015 1361
B 6/30/2015 8/13/2015 1429
B 9/30/2015 10/29/2015 1595
B 12/31/2015 2/16/2016 1763
B 3/31/2016 4/28/2016 1548
我知道如何通过繁琐的循环和使用argmax()来做到这一点。然而,想知道是否有一种“干净”,高效,Pythonic的方式来接近。
提前致谢。
答案 0 :(得分:2)
让我们使用sort_values
和drop_duplicates
:
df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
.drop_duplicates(subset=['GRP','B_DATE'])
输出:
GRP A_DATE B_DATE VALUE
7 A 9/30/2014 10/30/2014 546.7
10 A 6/30/2015 7/30/2015 519.5
5 A 6/30/2014 7/30/2014 531.5
9 A 3/31/2015 4/30/2015 458.7
3 A 3/31/2014 4/30/2014 472.5
8 A 12/31/2014 2/18/2015 573.5
1 A 12/31/2013 2/19/2014 543.7
17 B 9/30/2015 10/29/2015 1595.0
15 B 6/30/2015 8/13/2015 1429.0
19 B 3/31/2016 4/28/2016 1548.0
13 B 3/31/2015 7/7/2015 1361.0
18 B 12/31/2015 2/16/2016 1763.0
并添加sort_index
以取回原始订单:
df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
.drop_duplicates(subset=['GRP','B_DATE']).sort_index()
GRP A_DATE B_DATE VALUE
1 A 12/31/2013 2/19/2014 543.7
3 A 3/31/2014 4/30/2014 472.5
5 A 6/30/2014 7/30/2014 531.5
7 A 9/30/2014 10/30/2014 546.7
8 A 12/31/2014 2/18/2015 573.5
9 A 3/31/2015 4/30/2015 458.7
10 A 6/30/2015 7/30/2015 519.5
13 B 3/31/2015 7/7/2015 1361.0
15 B 6/30/2015 8/13/2015 1429.0
17 B 9/30/2015 10/29/2015 1595.0
18 B 12/31/2015 2/16/2016 1763.0
19 B 3/31/2016 4/28/2016 1548.0
答案 1 :(得分:1)
我认为你想分组B_DATE
和'GRP'汇总最后一个值,即
df['A_DATE'] = pd.to_datetime(df['A_DATE'])
df['B_DATE'] = pd.to_datetime(df['B_DATE'])
ndf = df.groupby(['GRP',df['B_DATE']]).agg('last').reset_index()
GRP B_DATE A_DATE VALUE 0 A 2014-02-19 2013-12-31 543.7 1 A 2014-04-30 2014-03-31 472.5 2 A 2014-07-30 2014-06-30 531.5 3 A 2014-10-30 2014-09-30 546.7 4 A 2015-02-18 2014-12-31 573.5 5 A 2015-04-30 2015-03-31 458.7 6 A 2015-07-30 2015-06-30 519.5 7 B 2015-07-07 2015-03-31 1361.0 8 B 2015-08-13 2015-06-30 1429.0 9 B 2015-10-29 2015-09-30 1595.0 10 B 2016-02-16 2015-12-31 1763.0 11 B 2016-04-28 2016-03-31 1548.0