我想计算一个向量与Python中数组的每一行之间的Pearson相关系数(假设为numpy和/或scipy)。由于实际数据阵列的大小和存储器约束,将不可能使用标准相关矩阵计算功能。这是我的天真实施:
import numpy as np
import scipy.stats as sps
np.random.seed(0)
def correlateOneWithMany(one, many):
"""Return Pearson's correlation coef of 'one' with each row of 'many'."""
pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
pr_arr[:] = np.nan
for row_num in np.arange(many.shape[0]):
pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
return pr_arr
obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))
pr = correlateOneWithMany(X[0, :], X)
%timeit correlateOneWithMany(X[0, :], X)
# 10 loops, best of 3: 38.9 ms per loop
任何关于加速这一点的想法都将非常感谢!
答案 0 :(得分:1)
模块scipy.spatial.distance
实现“相关距离”,它只是减去相关系数的一个。您可以使用函数cdist
来计算一对多距离,并通过从1中减去结果来获得相关系数。
以下是您的脚本的修改版本,其中包括使用cdist
计算相关系数:
import numpy as np
import scipy.stats as sps
from scipy.spatial.distance import cdist
np.random.seed(0)
def correlateOneWithMany(one, many):
"""Return Pearson's correlation coef of 'one' with each row of 'many'."""
pr_arr = np.zeros((many.shape[0], 2), dtype=np.float64)
pr_arr[:] = np.nan
for row_num in np.arange(many.shape[0]):
pr_arr[row_num, :] = sps.pearsonr(one, many[row_num, :])
return pr_arr
obs, varz = 10 ** 3, 500
X = np.random.uniform(size=(obs, varz))
pr = correlateOneWithMany(X[0, :], X)
c = 1 - cdist(X[0:1, :], X, metric='correlation')[0]
print(np.allclose(c, pr[:, 0]))
定时:
In [133]: %timeit correlateOneWithMany(X[0, :], X)
10 loops, best of 3: 37.7 ms per loop
In [134]: %timeit 1 - cdist(X[0:1, :], X, metric='correlation')[0]
1000 loops, best of 3: 1.11 ms per loop