我有一个包含5个不同列的数据框。我的实际问题是使用max()对特定字段进行分组,然后返回满足此条件的行。
示例(我放置了代码和数据框的打印屏幕):
A = pd.DataFrame([[datetime(2005,1,1), datetime(2005,1,2), 1240, 1234, 12],\
[datetime(2005,1,1), datetime(2005,1,2), 1250, 1235, 13],
[datetime(2005,1,1), datetime(2005,1,3), 1230, 1235, 12],
[datetime(2005,1,1), datetime(2005,1,3), 1240, 1235, 13],
[datetime(2005,1,1), datetime(2005,1,4), 1240, 1235, 12],
[datetime(2005,1,1), datetime(2005,1,5), 1240, 1235, 13],
[datetime(2005,1,1), datetime(2005,1,5), 1240, 1233, 11],
[datetime(2005,1,1), datetime(2005,1,6), 1240, 1235, 14]], \
columns=['quote_date', 'expiration', 'strike', 'price', 'var']).set_index(['quote_date', 'expiration', 'strike'])
如果我按 strike 分组,则只会得到 quote_date , expiration 和 strike :>
A.reset_index().groupby(by = ['quote_date', 'expiration'])['strike'].max()
目标是获取以下数据框:
答案 0 :(得分:1)
使用DataFrameGroupBy.idxmax
处理默认索引,因此必要的第一步reset_index
:
A = A.reset_index()
df = A.loc[A.groupby(by = ['quote_date', 'expiration'])['strike'].idxmax()]
print (df)
quote_date expiration strike price var
1 2005-01-01 2005-01-02 1250 1235 13
3 2005-01-01 2005-01-03 1240 1235 13
4 2005-01-01 2005-01-04 1240 1235 12
5 2005-01-01 2005-01-05 1240 1235 13
7 2005-01-01 2005-01-06 1240 1235 14
为MultiIndex
添加set_index
:
A = A.reset_index()
df = (A.loc[A.groupby(by = ['quote_date', 'expiration'])['strike'].idxmax()]
.set_index(['quote_date','expiration']))
print (df)
strike price var
quote_date expiration
2005-01-01 2005-01-02 1250 1235 13
2005-01-03 1240 1235 13
2005-01-04 1240 1235 12
2005-01-05 1240 1235 13
2005-01-06 1240 1235 14
另一种解决方案:
df = (A.sort_values('var', ascending=False)
.reset_index(level=['strike'])
.groupby(by = ['quote_date', 'expiration'])
.first()
)
print (df)
strike price var
quote_date expiration
2005-01-01 2005-01-02 1250 1235 13
2005-01-03 1240 1235 13
2005-01-04 1240 1235 12
2005-01-05 1240 1235 13
2005-01-06 1240 1235 14