如何从熊猫的相关性中删除重复项?

时间:2018-01-23 06:15:53

标签: python pandas correlation data-manipulation

我的结果有些问题:

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()
dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

来自我的相关矩阵:

dataCorr = data.corr(method='pearson')

我将此矩阵转换为列:

dataCorr = dataCorr[abs(dataCorr) >= 0.7].stack().reset_index()

在我删除矩阵的对角线之后:

dataCorr = dataCorr[dataCorr.level_0!=dataCorr.level_1]

但我仍然有重复的对

level_0             level_1             0
LiftPushSpeed       RT1EntranceSpeed    0.881714
RT1EntranceSpeed    LiftPushSpeed       0.881714

如何避免这个问题?

2 个答案:

答案 0 :(得分:2)

您可以将较低的三角形值转换为NaN s,stack将其删除:

np.random.seed(12)

data = pd.DataFrame(np.random.randint(20, size=(5,6)))
print (data)
    0   1   2  3   4   5
0  11   6  17  2   3   3
1  12  16  17  5  13   2
2  11  10   0  8  12  13
3  18   3   4  3   1   0
4  18  18  16  6  13   9

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr.mask(np.tril(np.ones(dataCorr.shape)).astype(np.bool))
print (dataCorr)
    0         1         2         3         4         5
0 NaN  0.042609 -0.041656 -0.113998 -0.173011 -0.201122
1 NaN       NaN  0.486901  0.567216  0.914260  0.403469
2 NaN       NaN       NaN -0.412853  0.157747 -0.354012
3 NaN       NaN       NaN       NaN  0.823628  0.858918
4 NaN       NaN       NaN       NaN       NaN  0.635730
5 NaN       NaN       NaN       NaN       NaN       NaN

#in your data change 0.5 to 0.7
dataCorr = dataCorr[abs(dataCorr) >= 0.5].stack().reset_index()
print (dataCorr)
   level_0  level_1         0
0        1        3  0.567216
1        1        4  0.914260
2        3        4  0.823628
3        3        5  0.858918
4        4        5  0.635730

<强>详细

print (np.tril(np.ones(dataCorr.shape)))
[[ 1.  0.  0.  0.  0.  0.]
 [ 1.  1.  0.  0.  0.  0.]
 [ 1.  1.  1.  0.  0.  0.]
 [ 1.  1.  1.  1.  0.  0.]
 [ 1.  1.  1.  1.  1.  0.]
 [ 1.  1.  1.  1.  1.  1.]]

答案 1 :(得分:0)

虽然你已经删除了对角线元素,但我担心你的所有代码目前都要做。

为了解决重复问题,我在排序后连接了两列,然后过滤掉了重复项,之后删除了连接列。

这是一个完整的例子 -

import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'))

dataCorr = data.corr(method='pearson')
dataCorr = dataCorr[abs(dataCorr) >= 0.01].stack().reset_index()
dataCorr = dataCorr[dataCorr['level_0'].astype(str)!=dataCorr['level_1'].astype(str)]

# filtering out lower/upper triangular duplicates 
dataCorr['ordered-cols'] = dataCorr.apply(lambda x: '-'.join(sorted([x['level_0'],x['level_1']])),axis=1)
dataCorr = dataCorr.drop_duplicates(['ordered-cols'])
dataCorr.drop(['ordered-cols'], axis=1, inplace=True)

print(dataCorr)