如何扫描pandas数据帧中所有大于某个值的值并返回与该值对应的行号和列号?

时间:2016-07-24 12:31:18

标签: python pandas dataframe

我有一个问题,我有如下所示的巨大数据集(Correl Coef矩阵)

  A           B           C           D           E
A 1,          0.413454352,0.615350574,0.479720098,0.34261232
B 0.413454352,1,          0.568124328,0.316543449,0.361164436
C 0.615350574,0.568124328,1,          0.633182519,0.790921334
D 0.479720098,0.316543449,0.633182519,1,          0.450248008
E 0.34261232, 0.361164436,0.790921334,0.450248008,1

我想获取此数据框中的所有值,其中单元格值大于0.6,它应该与行名称和列名一起使用,如下所示

   row_name col_name value
1  A        C        0.61
2  C        A        0.61
3  C        D        0.63
3  C        E        0.79
4  D        C        0.63
5  E        C        0.79

如果我们也可以忽略(A,C)或(C,A)..它会好得多。

我知道我可以使用for循环来实现它,但该方法对于大型数据集效率不高。

2 个答案:

答案 0 :(得分:3)

更新:使用@Divakar's solutionhis hints

In [186]: df = pd.DataFrame(np.triu(df, 1), columns=df.columns, index=df.index)

In [187]: df
Out[187]:
     A         B         C         D         E
A  0.0  0.413454  0.615351  0.479720  0.342612
B  0.0  0.000000  0.568124  0.316543  0.361164
C  0.0  0.000000  0.000000  0.633183  0.790921
D  0.0  0.000000  0.000000  0.000000  0.450248
E  0.0  0.000000  0.000000  0.000000  0.000000

In [188]: df[df > 0.6].stack().reset_index()
Out[188]:
  level_0 level_1         0
0       A       C  0.615351
1       C       D  0.633183
2       C       E  0.790921

OLD回答:

In [96]: df[df > 0.6]
Out[96]:
          A    B         C         D         E
A  1.000000  NaN  0.615351       NaN       NaN
B       NaN  1.0       NaN       NaN       NaN
C  0.615351  NaN  1.000000  0.633183  0.790921
D       NaN  NaN  0.633183  1.000000       NaN
E       NaN  NaN  0.790921       NaN  1.000000

In [97]: df[df > 0.6].stack()
Out[97]:
A  A    1.000000
   C    0.615351
B  B    1.000000
C  A    0.615351
   C    1.000000
   D    0.633183
   E    0.790921
D  C    0.633183
   D    1.000000
E  C    0.790921
   E    1.000000
dtype: float64

或:

In [99]: df[df > 0.6].stack().reset_index()
Out[99]:
   level_0 level_1         0
0        A       A  1.000000
1        A       C  0.615351
2        B       B  1.000000
3        C       A  0.615351
4        C       C  1.000000
5        C       D  0.633183
6        C       E  0.790921
7        D       C  0.633183
8        D       D  1.000000
9        E       C  0.790921
10       E       E  1.000000

数据集:

In [100]: df
Out[100]:
          A         B         C         D         E
A  1.000000  0.413454  0.615351  0.479720  0.342612
B  0.413454  1.000000  0.568124  0.316543  0.361164
C  0.615351  0.568124  1.000000  0.633183  0.790921
D  0.479720  0.316543  0.633183  1.000000  0.450248
E  0.342612  0.361164  0.790921  0.450248  1.000000

答案 1 :(得分:3)

这是基于NumPy的方法 -

# Extract values and row, column names
arr = df.values
index_names = df.index
col_names = df.columns

#  Get indices where such threshold is crossed; avoid diagonal elems
R,C = np.where(np.triu(arr,1)>0.6)

# Arrange those in columns and put out as a dataframe
out_arr = np.column_stack((index_names[R],col_names[C],arr[R,C]))
df_out = pd.DataFrame(out_arr,columns=[['row_name','col_name','value']])

示例运行 -

In [139]: df
Out[139]: 
          A         B         C         D         E
P  1.000000  0.031388  0.263606  0.121490  0.628969
Q  0.031388  1.000000  0.963510  0.497828  0.955238
R  0.263606  0.963510  1.000000  0.917935  0.520522
S  0.121490  0.497828  0.917935  1.000000  0.728386
T  0.628969  0.955238  0.520522  0.728386  1.000000

In [140]: df_out
Out[140]: 
  row_name col_name     value
0        P        E  0.628969
1        Q        C   0.96351
2        Q        E  0.955238
3        R        D  0.917935
4        S        E  0.728386