我想计算两个矩阵A和B的马修相关系数。循环遍历A的列,并计算该列和矩阵B的所有2000行的MCC,然后取最大索引。代码是:
import numpy as np
import pandas as pd
from sklearn.metrics import matthews_corrcoef as mcc
A = pd.read_csv('A.csv', squeeze=True)
B = pd.read_csv('B.csv', squeeze=True)
ind = {}
for col in A:
ind[col] = np.argmax(list(mcc(B.iloc[i], A[col]) for i in range(2000)))
print(ind[col])
我的问题是,这确实花费很长时间(每列一秒钟)。我看到R中几乎相同的代码运行得更快(例如5秒)。怎么会这样?我可以改善我的Python代码吗?
R代码:
A <- as.matrix(read.csv(file='A.csv'))
B <- t(as.matrix(read.csv(file='B.csv', check.names = FALSE)))
library('mccr')
C <- rep(NA, ncol(A))
for (query in 1:ncol(A)) {
mcc <- sapply(1:ncol(B), function(i)
mccr(A[, query], B[, i]))
C[query] <- which.max(mcc)
}
答案 0 :(得分:1)
也许在 python 中使用 numpy 和 dot 产品试试这个
def compute_mcc(true_labels, pred_labels):
"""Compute matthew's correlation coefficient.
:param true_labels: 2D integer array (features x samples)
:param pred_labels: 2D integer array (features x samples)
:return: mcc (samples1 x samples2)
"""
# prep inputs for confusion matrix calculations
pred_labels_1 = pred_labels == 1; pred_labels_0 = pred_labels == 0
true_labels_1 = true_labels == 1; true_labels_0 = true_labels == 0
# dot product of binary matrices
confusion_dot = lambda a,b: np.dot(a.T.astype(int), b.astype(int)).T
TP = confusion_dot(pred_labels_1, true_labels_1)
TN = confusion_dot(pred_labels_0, true_labels_0)
FP = confusion_dot(pred_labels_1, true_labels_0)
FN = confusion_dot(pred_labels_0, true_labels_1)
mcc = (TP * TN) - (FP * FN)
denom = np.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
# avoid dividing by 0
denom[denom == 0] = 1
return mcc / denom