我有一个问题,我有如下所示的巨大数据集(Correl Coef矩阵)
A B C D E
A 1, 0.413454352,0.615350574,0.479720098,0.34261232
B 0.413454352,1, 0.568124328,0.316543449,0.361164436
C 0.615350574,0.568124328,1, 0.633182519,0.790921334
D 0.479720098,0.316543449,0.633182519,1, 0.450248008
E 0.34261232, 0.361164436,0.790921334,0.450248008,1
我想获取此数据框中的所有值,其中单元格值大于0.6,它应该与行名称和列名一起使用,如下所示
row_name col_name value
1 A C 0.61
2 C A 0.61
3 C D 0.63
3 C E 0.79
4 D C 0.63
5 E C 0.79
如果我们也可以忽略(A,C)或(C,A)..它会好得多。
我知道我可以使用for循环来实现它,但该方法对于大型数据集效率不高。
答案 0 :(得分:3)
更新:使用@Divakar's solution和his hints:
In [186]: df = pd.DataFrame(np.triu(df, 1), columns=df.columns, index=df.index)
In [187]: df
Out[187]:
A B C D E
A 0.0 0.413454 0.615351 0.479720 0.342612
B 0.0 0.000000 0.568124 0.316543 0.361164
C 0.0 0.000000 0.000000 0.633183 0.790921
D 0.0 0.000000 0.000000 0.000000 0.450248
E 0.0 0.000000 0.000000 0.000000 0.000000
In [188]: df[df > 0.6].stack().reset_index()
Out[188]:
level_0 level_1 0
0 A C 0.615351
1 C D 0.633183
2 C E 0.790921
OLD回答:
In [96]: df[df > 0.6]
Out[96]:
A B C D E
A 1.000000 NaN 0.615351 NaN NaN
B NaN 1.0 NaN NaN NaN
C 0.615351 NaN 1.000000 0.633183 0.790921
D NaN NaN 0.633183 1.000000 NaN
E NaN NaN 0.790921 NaN 1.000000
In [97]: df[df > 0.6].stack()
Out[97]:
A A 1.000000
C 0.615351
B B 1.000000
C A 0.615351
C 1.000000
D 0.633183
E 0.790921
D C 0.633183
D 1.000000
E C 0.790921
E 1.000000
dtype: float64
或:
In [99]: df[df > 0.6].stack().reset_index()
Out[99]:
level_0 level_1 0
0 A A 1.000000
1 A C 0.615351
2 B B 1.000000
3 C A 0.615351
4 C C 1.000000
5 C D 0.633183
6 C E 0.790921
7 D C 0.633183
8 D D 1.000000
9 E C 0.790921
10 E E 1.000000
数据集:
In [100]: df
Out[100]:
A B C D E
A 1.000000 0.413454 0.615351 0.479720 0.342612
B 0.413454 1.000000 0.568124 0.316543 0.361164
C 0.615351 0.568124 1.000000 0.633183 0.790921
D 0.479720 0.316543 0.633183 1.000000 0.450248
E 0.342612 0.361164 0.790921 0.450248 1.000000
答案 1 :(得分:3)
这是基于NumPy的方法 -
# Extract values and row, column names
arr = df.values
index_names = df.index
col_names = df.columns
# Get indices where such threshold is crossed; avoid diagonal elems
R,C = np.where(np.triu(arr,1)>0.6)
# Arrange those in columns and put out as a dataframe
out_arr = np.column_stack((index_names[R],col_names[C],arr[R,C]))
df_out = pd.DataFrame(out_arr,columns=[['row_name','col_name','value']])
示例运行 -
In [139]: df
Out[139]:
A B C D E
P 1.000000 0.031388 0.263606 0.121490 0.628969
Q 0.031388 1.000000 0.963510 0.497828 0.955238
R 0.263606 0.963510 1.000000 0.917935 0.520522
S 0.121490 0.497828 0.917935 1.000000 0.728386
T 0.628969 0.955238 0.520522 0.728386 1.000000
In [140]: df_out
Out[140]:
row_name col_name value
0 P E 0.628969
1 Q C 0.96351
2 Q E 0.955238
3 R D 0.917935
4 S E 0.728386