I have a multi indexed DataFrame
in pandas
and i would like to select a row by the following condition:
Let's say we have columns 'a','b','c'
and indicies 'i1','i2'
print(df)
a b c
i1 i2
10 2.0 10 34 ..
2.0 11 45 ..
2.0 12 36 ..
20 2.0 10 15 ..
2.0 18 34 ..
2.0 16 46 ..
Now I would like to select for each unique muliindex entry the row where column 'a'
becomes maximal. So I wrote
for entry in df.index.unique():
max_a = df.loc[entry,'a'].max()
and now I would like to select that row and append it to another dataframe, let's say dfout
dfout=dfout.append(df[(df.index.values == entry) & (df['fi'] == max_a)])
This raises invalid type comparison
, probably because I try to compare tuples, not sure...
Can anyone explain to me, how I would be able to select exactly that row the correct way? Maybe there is even a much nicer way to select all these max('a')
rows for all unique entries of df.index
.
edit:
df.index.values
is of type numpy.ndarray
entry
is of type tuple
Maybe this helps for answering my question.
答案 0 :(得分:1)
您的数据存在问题,因为您希望识别每个唯一索引项目的最大a
行,但您的索引不是唯一的。
我通常会这样做
df.loc[df.groupby(['i1', 'i2']).a.idxmax()]
但请查看idxmax
结果
df.groupby(level=['i1', 'i2']).a.idxmax()
i1 i2
10 2.0 (10, 2.0)
20 2.0 (20, 2.0)
Name: a, dtype: object
由于索引不是唯一的,loc
调用只会再次返回所有索引。
df.loc[df.groupby(level=['i1', 'i2']).a.idxmax()]
a b c
i1 i2
10 2.0 10 34 ..
2.0 11 45 ..
2.0 12 36 ..
20 2.0 10 15 ..
2.0 18 34 ..
2.0 16 46 ..
所以...我们需要制作一个独特的索引,这样这项技术才能运作
选项1
reset_index
我可以将带有重置索引的数据框分配给一个新变量并使用loc
但我知道我的新索引与位置相同所以我继续使用iloc
df.iloc[df.reset_index().groupby(['i1', 'i2']).a.idxmax()]
a b c
i1 i2
10 2.0 12 36 ..
20 2.0 18 34 ..
选项2
cumcount
在现有索引中添加另一个级别以使其唯一。
d1 = df.set_index(df.groupby(level=['i1', 'i2']).cumcount(), append=True)
d1.loc[d1.groupby(level=['i1', 'i2']).a.idxmax()].reset_index(-1, drop=True)
a b c
i1 i2
10 2.0 12 36 ..
20 2.0 18 34 ..
在我看来,选项1更漂亮。
答案 1 :(得分:0)
我从pandas MultiIndex文档页面中获取了示例数据框,并执行了以下操作
def mklbl(prefix,n):
return ["%s%s" % (prefix,i) for i in range(n)]
miindex = pd.MultiIndex.from_product([mklbl('A',4),
mklbl('B',2),
mklbl('C',4),
mklbl('D',2)])
micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'),
('b','foo'),('b','bah')],
names=['lvl0', 'lvl1'])
dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))),
index=miindex,
columns=micolumns).sort_index().sort_index(axis=1)
dfmi.index = dfmi.index.droplevel(3)
dfmi_ = dfmi.reset_index()
dfmi_.columns = dfmi_.columns.droplevel(1)
现在数据框看起来与您的示例大致相同
>> dfmi_.head()
lvl0 level_0 level_1 level_2 a a b b
0 A0 B0 C0 1 0 3 2
1 A0 B0 C0 5 4 7 6
2 A0 B0 C1 9 8 11 10
3 A0 B0 C1 13 12 15 14
4 A0 B0 C2 17 16 19 18
现在您可以执行groupby
和idxmax
来获取每组的最大索引
>> idxmax = dfmi_.groupby('level_0')['a'].idxmax()
>> dfmi_.loc[idxmax]
lvl0 level_0 level_1 level_2 a a b b
15 A0 B1 C3 61 60 63 62
15 A0 B1 C3 61 60 63 62
31 A1 B1 C3 125 124 127 126
31 A1 B1 C3 125 124 127 126
47 A2 B1 C3 189 188 191 190
47 A2 B1 C3 189 188 191 190
63 A3 B1 C3 253 252 255 254
63 A3 B1 C3 253 252 255 254
答案 2 :(得分:0)
这似乎有效
import pandas as pd
dfout = pd.DataFrame()
for entry in df.index.unique():
max_a = df.loc[entry,'a'].max()
dftemp = df.loc[entry,:].copy() # not sure if the copy is necessary
dftemp = dftemp[dftemp['a'] == max_a]
dfout = dfout.append(dftemp)
但这真的感觉更像是一个解决方案,而不是如何用
选择一行的解决方案# pseudo code:
multiindex entry == (1,2,...,n)
和
# pseudo code:
column 'keyword' entry == max_a