在数据框的每一行中获取前n个值和它们出现的列的名称

时间:2016-11-05 00:48:52

标签: python pandas dataframe top-n

我有一个像这样的数据框:

df = pd.DataFrame({'a':[1,2,1],'b':[4,6,0],'c':[0,4,8]})
+---+---+---+
| a | b | c |
+---+---+---+
| 1 | 4 | 0 |
+---+---+---+
| 2 | 6 | 4 |
+---+---+---+
| 1 | 0 | 8 |
+---+---+---+

对于每一行,我需要(两者)' n' (在这种情况下为两个)最高值和相应的列按降序排列:

row 1: 'b':4,'a':1
row 2: 'b':6,'c':4
row 3: 'c':8,'a':1

1 个答案:

答案 0 :(得分:1)

以下两种方式均适用于来自Find names of top-n highest-value columns in each pandas dataframe row

的@ unutbu答案

1)在每行上使用带有.apply(lambda ...)的Python Decorate-Sort-Undecorate来插入列名,执行np.argsort,保留top-n,重新格式化答案。 (我认为这更清洁了。)

import numpy as np
# First we apply Decorate-Sort row-wise to our df...
tmp = df.apply(lambda row: sorted(zip(df.columns, row), key=lambda cv: -cv[1]), axis=1)
        a       b       c
0  (b, 4)  (a, 1)  (c, 0)
1  (b, 6)  (c, 4)  (a, 2)
2  (c, 8)  (a, 1)  (b, 0)

# Slice the top-n columns within each row...
tmp = tmp.ix[:,0:nlargest]

# then your result (as a pandas DataFrame) is...
np.array(tmp)
array([[('b', 4), ('a', 1)],
       [('b', 6), ('c', 4)],
       [('c', 8), ('a', 1)]], dtype=object)
# ... or as a list of rows is
tmp.values.tolist()
#... and you can insert the row-indices 0,1,2 with 
zip(tmp.index, tmp.values.tolist())
[(0, [('b', 4), ('a', 1), ('c', 0)]), (1, [('b', 6), ('c', 4), ('a', 2)]), (2, [('c', 8), ('a', 1), ('b', 0)])]

2)获取topnlocs的矩阵,如下所示,然后将其用于重新索引到df.columns和df.values,并组合该输出。

import numpy as np

nlargest = 2
topnlocs = np.argsort(-df.values, axis=1)[:, 0:nlargest]
# ... now you can use topnlocs to reindex both into df.columns, and df.values, then reformat/combine them somehow
# however it's painful trying to apply that NumPy array of indices back to df or df.values,

请参阅How to get away with a multidimensional index in pandas