我的结果有些问题:
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
来自我的相关矩阵:
dataCorr = data.corr(method='pearson')
我将此矩阵转换为列:
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
在我删除矩阵的对角线之后:
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]
但我仍然有重复的对
level_0 level_1 0
LiftPushSpeed RT1EntranceSpeed 0.881714
RT1EntranceSpeed LiftPushSpeed 0.881714
如何避免这个问题?
答案 0 :(得分:2)
您可以将较低的三角形值转换为NaN
s,stack
将其删除:
np.random.seed(12)
data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
0 1 2 3 4 5
0 11 6 17 2 3 3
1 12 16 17 5 13 2
2 11 10 0 8 12 13
3 18 3 4 3 1 0
4 18 18 16 6 13 9
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
0 1 2 3 4 5
0 NaN 0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN NaN 0.486901 0.567216 0.914260 0.403469
2 NaN NaN NaN -0.412853 0.157747 -0.354012
3 NaN NaN NaN NaN 0.823628 0.858918
4 NaN NaN NaN NaN NaN 0.635730
5 NaN NaN NaN NaN NaN NaN
#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
level_0 level_1 0
0 1 3 0.567216
1 1 4 0.914260
2 3 4 0.823628
3 3 5 0.858918
4 4 5 0.635730
<强>详细强>:
print (np.tril(np.ones(dataCorr.shape)))
[[ 1. 0. 0. 0. 0. 0.]
[ 1. 1. 0. 0. 0. 0.]
[ 1. 1. 1. 0. 0. 0.]
[ 1. 1. 1. 1. 0. 0.]
[ 1. 1. 1. 1. 1. 0.]
[ 1. 1. 1. 1. 1. 1.]]
答案 1 :(得分:0)
虽然你已经删除了对角线元素,但我担心你的所有代码目前都要做。
为了解决重复问题,我在排序后连接了两列,然后过滤掉了重复项,之后删除了连接列。
这是一个完整的例子 -
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))
dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]
# filtering out lower/upper triangular duplicates
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)
print(dataCorr)