有谁知道如何从python中的一个非常大的稀疏矩阵计算相关矩阵?基本上,我正在寻找类似numpy.corrcoef
的东西,它可以在scipy稀疏矩阵上工作。
答案 0 :(得分:9)
您可以从协方差矩阵中直接计算相关系数,如下所示:
import numpy as np
from scipy import sparse
def sparse_corrcoef(A, B=None):
if B is not None:
A = sparse.vstack((A, B), format='csr')
A = A.astype(np.float64)
n = A.shape[1]
# Compute the covariance matrix
rowsum = A.sum(1)
centering = rowsum.dot(rowsum.T.conjugate()) / n
C = (A.dot(A.T.conjugate()) - centering) / (n - 1)
# The correlation coefficients are given by
# C_{i,j} / sqrt(C_{i} * C_{j})
d = np.diag(C)
coeffs = C / np.sqrt(np.outer(d, d))
return coeffs
检查它是否正常:
# some smallish sparse random matrices
a = sparse.rand(100, 100000, density=0.1, format='csr')
b = sparse.rand(100, 100000, density=0.1, format='csr')
coeffs1 = sparse_corrcoef(a, b)
coeffs2 = np.corrcoef(a.todense(), b.todense())
print(np.allclose(coeffs1, coeffs2))
# True
计算协方差矩阵C
所需的内存量将在很大程度上取决于A
(和B
的稀疏性结构,如果给出的话)。例如,如果A
是仅包含非零值的单列的(m, n)
矩阵,那么C
将是(n, n)
矩阵所有非零值。如果n
很大,那么就内存消耗而言,这可能是个坏消息。
答案 1 :(得分:6)
只使用numpy:
import numpy as np
C=((A.T*A -(sum(A).T*sum(A)/N))/(N-1)).todense()
V=np.sqrt(np.mat(np.diag(C)).T*np.mat(np.diag(C)))
COV = np.divide(C,V+1e-119)
答案 2 :(得分:0)
不幸的是,Alt's answer对我来说没有效果。给np.sqrt
函数的值大多为负值,因此得到的协方差值为nan。
我也无法使用ali_m's answer,因为我的矩阵太大了,无法将centering = rowsum.dot(rowsum.T.conjugate()) / n
矩阵放入内存中(我的矩阵尺寸为:3.5 * 10 ^ 6 x 33)
相反,我使用scikit-learn's StandardScaler
计算标准稀疏矩阵,然后使用乘法获得相关矩阵。
from sklearn.preprocessing import StandardScaler
def compute_sparse_correlation_matrix(A):
scaler = StandardScaler(with_mean=False)
scaled_A = scaler.fit_transform(A) # Assuming A is a CSR or CSC matrix
corr_matrix = (1/scaled_A.shape[0]) * (scaled_A.T @ scaled_A)
return corr_matrix
我相信这种方法比其他提到的方法更快,更可靠。此外,它还保留了输入矩阵的稀疏模式。
答案 3 :(得分:0)
我提出了一个并行运行的稀疏矩阵的答案。而不是返回巨大的相关矩阵,而是返回在检查所有字段的正和负Pearson相关性之后要保留的字段特征掩码。
我还尝试使用以下策略来最小化计算:
通过保留标记为要删除的列的全局列表并跳过此类列的进一步相关性计算,这可能会进一步加速,因为列将无序执行。但是,我对python中的竞争条件了解不足,无法今晚实现。
显然,与返回整个相关矩阵相比,返回一个列掩码将使代码处理更大的数据集。
使用此功能检查每列:
def get_corr_row(idx_num, sp_mat, thresh):
# slice the column at idx_num
cols = sp_mat.shape[1]
x = sp_mat[:,idx_num].toarray().ravel()
start = idx_num + 1
# Now slice each column to the right of idx_num
for i in range(start, cols):
y = sp_mat[:,i].toarray().ravel()
# Check the pearson correlation
corr, pVal = pearsonr(x,y)
# Pearson ranges from -1 to 1.
# We check both positive and negative correlations >= thresh using abs(corr)
if abs(corr) >= thresh:
# stop checking after finding the 1st correlation > thresh
return False
# Mark column at idx_num for removal in the mask
return True
并行运行列级相关性检查:
from joblib import Parallel, delayed
import multiprocessing
def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
# we must make sure the matrix is in csc format
# before we start doing all these column slices!
sp_mat = sp_mat.tocsc()
cols = sp_mat.shape[1]
if n_jobs == -1:
# Process the work on all available CPU cores
num_cores = multiprocessing.cpu_count()
else:
# Process the work on the specified number of CPU cores
num_cores = n_jobs
# Return a mask of all columns to keep by calling get_corr_row()
# once for each column in the matrix
return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))
常规用法:
#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95)
# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]