两个特征矩阵的有效成对相关

时间:2015-11-11 12:06:12

标签: python numpy matrix vectorization correlation

在Python中,我需要找到矩阵A中所有要素与矩阵B中所有要素之间的成对相关性。特别是,我很有兴趣找到AB中给定特征与np.corrcoef中所有特征相关的最强Pearson相关性。我不关心最强的相关性是正面的还是负面的。

我使用下面的两个循环和scipy完成了一个低效的实现。但是,我想使用A或其他类似的方法来有效地计算它。矩阵B的形状为40000x400,find_max_absolute_corr(A,B)的形状为40000x1440。我有效地尝试这种方法可以在下面看作方法ValueError: all the input array dimensions except for the concatenation axis must match exactly。但是,它失败并出现以下错误:

import numpy as np from scipy.stats import pearsonr def find_max_absolute_corr(A, B): """ Finds for each feature in `A` the highest Pearson correlation across all features in `B`. """ max_corr_A = np.zeros((A.shape[1])) for A_col in range(A.shape[1]): print "Calculating {}/{}.".format(A_col+1, A.shape[1]) metric = A[:,A_col] pearson = np.corrcoef(B, metric, rowvar=0) # takes negative correlations into account as well min_p = min(pearson) max_p = max(pearson) max_corr_A[A_col] = max_absolute(min_p, max_p) return max_corr_A def max_absolute(min_p, max_p): if np.isnan(min_p) or np.isnan(max_p): raise ValueError("NaN correlation.") if abs(max_p) > abs(min_p): return max_p else: return min_p if __name__ == '__main__': A = np.array( [[10, 8.04, 9.14, 7.46], [8, 6.95, 8.14, 6.77], [13, 7.58, 8.74, 12.74], [9, 8.81, 8.77, 7.11], [11, 8.33, 9.26, 7.81]]) B = np.array( [[-14, -9.96, 8.10, 8.84, 8, 7.04], [-6, -7.24, 6.13, 6.08, 5, 5.25], [-4, -4.26, 3.10, 5.39, 8, 5.56], [-12, -10.84, 9.13, 8.15, 5, 7.91], [-7, -4.82, 7.26, 6.42, 8, 6.89]]) # simple, inefficient method for A_col in range(A.shape[1]): high_corr = 0 for B_col in range(B.shape[1]): corr,_ = pearsonr(A[:,A_col], B[:,B_col]) high_corr = max_absolute(high_corr, corr) print high_corr # -0.161314601631 # 0.956781516149 # 0.621071009239 # -0.421539304112 # efficient method max_corr_A = find_max_absolute_corr(A, B) print max_corr_A # [-0.161314601631, # 0.956781516149, # 0.621071009239, # -0.421539304112]

HTML
<span class="close">X</span>
<div class="box">
  <input type="text">
  <input type="text">
  <input type="text">
</div>

CSS
.box {
  width: 0;
  height: 0;
  background: #e2e2e2;
  border-radius: 50%;
  padding: 30px;
  transition: all 300ms ease 600ms; /*This is the only thing You need to change*/
  overflow: hidden;
}
input {
  opacity: 0;
  border: 1px solid #fff;
  transition: all 300ms ease 300ms; /*input delay + input transition time 300 + 300 = 600*/
  display: block;
  width: 100%;
  margin-bottom: 10px;
  padding: 5px;
  box-sizing: border-box;
}
.box.animated {
  border-radius: 0;
  width: 400px;
  height: 400px;
  padding: 20px;
}
.box.animated > input {
  opacity: 1;
}

.close{
  cursor:pointer;
  position: absolute;
  top:0px;
  font-weight:bolder;
  font-size:1.5rem;
  display:none;
}

JQuery
$(document).ready(function () {

  $('.box').on('click', function () {
    $(this).addClass('animated');
    $('.close').css({display:"block"});
  });

  $('.close').on('click', function () {
    $('.box').removeClass('animated');
    $(this).css({display:"none"});
  });

});

2 个答案:

答案 0 :(得分:6)

似乎scipy.stats.pearsonr遵循应用于A&amp;的列方对的Pearson Correlation Coefficient Formula的定义。 B -

enter image description here

根据该公式,您可以轻松地进行矢量化,因为来自AB的列的成对计算彼此独立。这是使用broadcasting -

的一个矢量化解决方案
# Get number of rows in either A or B
N = B.shape[0]

# Store columnw-wise in A and B, as they would be used at few places
sA = A.sum(0)
sB = B.sum(0)

# Basically there are four parts in the formula. We would compute them one-by-one
p1 = N*np.einsum('ij,ik->kj',A,B)
p2 = sA*sB[:,None]
p3 = N*((B**2).sum(0)) - (sB**2)
p4 = N*((A**2).sum(0)) - (sA**2)

# Finally compute Pearson Correlation Coefficient as 2D array 
pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))

# Get the element corresponding to absolute argmax along the columns 
out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]

示例运行 -

1)输入:

In [12]: A
Out[12]: 
array([[ 10.  ,   8.04,   9.14,   7.46],
       [  8.  ,   6.95,   8.14,   6.77],
       [ 13.  ,   7.58,   8.74,  12.74],
       [  9.  ,   8.81,   8.77,   7.11],
       [ 11.  ,   8.33,   9.26,   7.81]])

In [13]: B
Out[13]: 
array([[-14.  ,  -9.96,   8.1 ,   8.84,   8.  ,   7.04],
       [ -6.  ,  -7.24,   6.13,   6.08,   5.  ,   5.25],
       [ -4.  ,  -4.26,   3.1 ,   5.39,   8.  ,   5.56],
       [-12.  , -10.84,   9.13,   8.15,   5.  ,   7.91],
       [ -7.  ,  -4.82,   7.26,   6.42,   8.  ,   6.89]])

2)原始循环代码运行 -

In [14]: high_corr_out = np.zeros(A.shape[1])
    ...: for A_col in range(A.shape[1]): 
    ...:     high_corr = 0
    ...:     for B_col in range(B.shape[1]):
    ...:         corr,_ = pearsonr(A[:,A_col], B[:,B_col])
    ...:         high_corr = max_absolute(high_corr, corr)
    ...:     high_corr_out[A_col] = high_corr
    ...:     

In [15]: high_corr_out
Out[15]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

3)建议的代码运行 -

In [16]: N = B.shape[0]
    ...: sA = A.sum(0)
    ...: sB = B.sum(0)
    ...: p1 = N*np.einsum('ij,ik->kj',A,B)
    ...: p2 = sA*sB[:,None]
    ...: p3 = N*((B**2).sum(0)) - (sB**2)
    ...: p4 = N*((A**2).sum(0)) - (sA**2)
    ...: pcorr = ((p1 - p2)/np.sqrt(p4*p3[:,None]))
    ...: out = pcorr[np.nanargmax(np.abs(pcorr),axis=0),np.arange(pcorr.shape[1])]
    ...: 

In [17]: pcorr # Pearson Correlation Coefficient array
Out[17]: 
array([[ 0.41895565, -0.5910935 , -0.40465987,  0.5818286 ],
       [ 0.66609445, -0.41950457,  0.02450215,  0.64028344],
       [-0.64953314,  0.65669916,  0.30836196, -0.85127779],
       [-0.41917583,  0.59043266,  0.40364532, -0.58144102],
       [ 0.8067843 ,  0.07947386,  0.74016181,  0.53165395],
       [-0.1613146 ,  0.95678152,  0.62107101, -0.4215393 ]])

In [18]: out # elements corresponding to absolute argmax along columns
Out[18]: array([ 0.8067843 ,  0.95678152,  0.74016181, -0.85127779])

运行时测试 -

In [36]: A = np.random.rand(4000,40)

In [37]: B = np.random.rand(4000,144)

In [38]: np.allclose(org_app(A,B),proposed_app(A,B))
Out[38]: True

In [39]: %timeit org_app(A,B) # Original approach
1 loops, best of 3: 1.35 s per loop

In [40]: %timeit proposed_app(A,B) # Proposed vectorized approach
10 loops, best of 3: 39.1 ms per loop

答案 1 :(得分:1)

从个人经验中加入上述答案,

p1 = N*np.dot(B.T,A)

相比,

对我来说更快

p1 = N*np.einsum('ij,ik->kj',A,B)

当A和B是大维矩阵时尤其如此。