按列的最大值分组,并返回完整的行

时间:2018-08-20 08:04:08

标签: python pandas

我有一个包含5个不同列的数据框。我的实际问题是使用max()对特定字段进行分组,然后返回满足此条件的行。

示例(我放置了代码和数据框的打印屏幕):

A = pd.DataFrame([[datetime(2005,1,1), datetime(2005,1,2),  1240, 1234, 12],\
      [datetime(2005,1,1), datetime(2005,1,2), 1250, 1235, 13], 
      [datetime(2005,1,1), datetime(2005,1,3), 1230, 1235, 12],
      [datetime(2005,1,1), datetime(2005,1,3), 1240, 1235, 13],
      [datetime(2005,1,1), datetime(2005,1,4), 1240, 1235, 12],
      [datetime(2005,1,1), datetime(2005,1,5), 1240, 1235, 13],
      [datetime(2005,1,1), datetime(2005,1,5), 1240, 1233, 11],
      [datetime(2005,1,1), datetime(2005,1,6), 1240, 1235, 14]], \
     columns=['quote_date', 'expiration', 'strike', 'price', 'var']).set_index(['quote_date', 'expiration', 'strike'])

enter image description here

如果我按 strike 分组,则只会得到 quote_date expiration strike

A.reset_index().groupby(by = ['quote_date', 'expiration'])['strike'].max()

enter image description here

目标是获取以下数据框:

enter image description here

1 个答案:

答案 0 :(得分:1)

使用DataFrameGroupBy.idxmax处理默认索引,因此必要的第一步reset_index

A = A.reset_index()
df = A.loc[A.groupby(by = ['quote_date', 'expiration'])['strike'].idxmax()]
print (df)
  quote_date expiration  strike  price  var
1 2005-01-01 2005-01-02    1250   1235   13
3 2005-01-01 2005-01-03    1240   1235   13
4 2005-01-01 2005-01-04    1240   1235   12
5 2005-01-01 2005-01-05    1240   1235   13
7 2005-01-01 2005-01-06    1240   1235   14

MultiIndex添加set_index

A = A.reset_index()
df = (A.loc[A.groupby(by = ['quote_date', 'expiration'])['strike'].idxmax()]
       .set_index(['quote_date','expiration']))
print (df)
                       strike  price  var
quote_date expiration                    
2005-01-01 2005-01-02    1250   1235   13
           2005-01-03    1240   1235   13
           2005-01-04    1240   1235   12
           2005-01-05    1240   1235   13
           2005-01-06    1240   1235   14

另一种解决方案:

df = (A.sort_values('var', ascending=False)
       .reset_index(level=['strike'])
       .groupby(by = ['quote_date', 'expiration'])
       .first()
       )
print (df)
                       strike  price  var
quote_date expiration                    
2005-01-01 2005-01-02    1250   1235   13
           2005-01-03    1240   1235   13
           2005-01-04    1240   1235   12
           2005-01-05    1240   1235   13
           2005-01-06    1240   1235   14