pandas dataframe select row by full multiindex entry and some column entry

时间:2017-06-03 13:37:11

标签: python pandas

I have a multi indexed DataFrame in pandas and i would like to select a row by the following condition:

Let's say we have columns 'a','b','c' and indicies 'i1','i2'

print(df)

           a   b   c
i1  i2  
10 2.0    10  34   ..
   2.0    11  45   ..
   2.0    12  36   ..
20 2.0    10  15   ..
   2.0    18  34   ..
   2.0    16  46   ..

Now I would like to select for each unique muliindex entry the row where column 'a' becomes maximal. So I wrote

for entry in df.index.unique():
    max_a = df.loc[entry,'a'].max()

and now I would like to select that row and append it to another dataframe, let's say dfout

dfout=dfout.append(df[(df.index.values == entry) & (df['fi'] == max_a)])

This raises invalid type comparison, probably because I try to compare tuples, not sure... Can anyone explain to me, how I would be able to select exactly that row the correct way? Maybe there is even a much nicer way to select all these max('a') rows for all unique entries of df.index.

edit:

df.index.values is of type numpy.ndarray

entry is of type tuple

Maybe this helps for answering my question.

3 个答案:

答案 0 :(得分:1)

您的数据存在问题,因为您希望识别每个唯一索引项目的最大a行,但您的索引不是唯一的。

我通常会这样做

df.loc[df.groupby(['i1', 'i2']).a.idxmax()]

但请查看idxmax结果

df.groupby(level=['i1', 'i2']).a.idxmax()

i1  i2 
10  2.0    (10, 2.0)
20  2.0    (20, 2.0)
Name: a, dtype: object

由于索引不是唯一的,loc调用只会再次返回所有索引。

df.loc[df.groupby(level=['i1', 'i2']).a.idxmax()]

         a   b   c
i1 i2             
10 2.0  10  34  ..
   2.0  11  45  ..
   2.0  12  36  ..
20 2.0  10  15  ..
   2.0  18  34  ..
   2.0  16  46  ..

所以...我们需要制作一个独特的索引,这样这项技术才能运作

选项1
reset_index
我可以将带有重置索引的数据框分配给一个新变量并使用loc但我知道我的新索引与位置相同所以我继续使用iloc

df.iloc[df.reset_index().groupby(['i1', 'i2']).a.idxmax()]

         a   b   c
i1 i2             
10 2.0  12  36  ..
20 2.0  18  34  ..

选项2
cumcount
在现有索引中添加另一个级别以使其唯一。

d1 = df.set_index(df.groupby(level=['i1', 'i2']).cumcount(), append=True)
d1.loc[d1.groupby(level=['i1', 'i2']).a.idxmax()].reset_index(-1, drop=True)

         a   b   c
i1 i2             
10 2.0  12  36  ..
20 2.0  18  34  ..

在我看来,选项1更漂亮。

答案 1 :(得分:0)

我从pandas MultiIndex文档页面中获取了示例数据框,并执行了以下操作

def mklbl(prefix,n):
        return ["%s%s" % (prefix,i)  for i in range(n)]


miindex = pd.MultiIndex.from_product([mklbl('A',4),
                                       mklbl('B',2),
                                       mklbl('C',4),
                                       mklbl('D',2)])


micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
                                        ('b','foo'),('b','bah')],
                                       names=['lvl0', 'lvl1'])


dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
                     index=miindex,
                     columns=micolumns).sort_index().sort_index(axis=1)

dfmi.index = dfmi.index.droplevel(3)
dfmi_ = dfmi.reset_index()
dfmi_.columns = dfmi_.columns.droplevel(1)

现在数据框看起来与您的示例大致相同

>> dfmi_.head()
lvl0 level_0 level_1 level_2   a   a   b   b
0         A0      B0      C0   1   0   3   2
1         A0      B0      C0   5   4   7   6
2         A0      B0      C1   9   8  11  10
3         A0      B0      C1  13  12  15  14
4         A0      B0      C2  17  16  19  18

现在您可以执行groupbyidxmax来获取每组的最大索引

>> idxmax = dfmi_.groupby('level_0')['a'].idxmax()
>> dfmi_.loc[idxmax]
lvl0 level_0 level_1 level_2    a    a    b    b
15        A0      B1      C3   61   60   63   62
15        A0      B1      C3   61   60   63   62
31        A1      B1      C3  125  124  127  126
31        A1      B1      C3  125  124  127  126
47        A2      B1      C3  189  188  191  190
47        A2      B1      C3  189  188  191  190
63        A3      B1      C3  253  252  255  254
63        A3      B1      C3  253  252  255  254

答案 2 :(得分:0)

这似乎有效

import pandas as pd

dfout = pd.DataFrame()

for entry in df.index.unique():
    max_a = df.loc[entry,'a'].max()
    dftemp = df.loc[entry,:].copy() # not sure if the copy is necessary
    dftemp = dftemp[dftemp['a'] == max_a]
    dfout = dfout.append(dftemp)

但这真的感觉更像是一个解决方案,而不是如何用

选择一行的解决方案
# pseudo code:
multiindex entry == (1,2,...,n)

# pseudo code:
column 'keyword' entry == max_a