更快地计算特殊相关距离矩阵

时间:2014-08-01 09:30:00

标签: python matrix pandas distance correlation

我想使用Pearson相关距离建立距离矩阵。 我首先尝试了scipy.spatial.distance.pdist(df,'correlation'),这对我的5000行* 20特征数据集非常快。

由于我想构建一个推荐器,我想略微改变距离,只考虑两个用户对NaN不同的特征。实际上,当scipy.spatial.distance.pdist(df,'correlation')遇到任何值为float(&#39; nan&#39;)的特征时,dist_mat = [] d = df.shape[1] for i,row_i in enumerate(df.itertuples()): for j,row_j in enumerate(df.itertuples()): if i<j: print(i,j) ind = [False if (math.isnan(row_i[t+1]) or math.isnan(row_j[t+1])) else True for t in range(d)] dist_mat.append(scipy.spatial.distance.correlation([row_i[t] for t in ind],[row_j[t] for t in ind])) 会输出NaN。

这是我的代码,df是我的5000 * 20 pandas DataFrame

scipy.spatial.distance.pdist(df,'correlation')

此代码有效,但与{{1}}相比,速度极慢。我的问题是:如何改进我的代码,以便它运行得更快?或者我在哪里可以找到一个计算两个向量之间相关性的库,它只考虑两个向量中出现的特征?

感谢您的回答。

1 个答案:

答案 0 :(得分:2)

我认为你需要用Cython做这个,这是一个例子:

#cython: boundscheck=False, wraparound=False, cdivision=True

import numpy as np

cdef extern from "math.h":
    bint isnan(double x)
    double sqrt(double x)

def pair_correlation(double[:, ::1] x):
    cdef double[:, ::] res = np.empty((x.shape[0], x.shape[0]))
    cdef double u, v
    cdef int i, j, k, count
    cdef double du, dv, d, n, r
    cdef double sum_u, sum_v, sum_u2, sum_v2, sum_uv

    for i in range(x.shape[0]):
        for j in range(i, x.shape[0]):
            sum_u = sum_v = sum_u2 = sum_v2 = sum_uv = 0.0
            count = 0            
            for k in range(x.shape[1]):
                u = x[i, k]
                v = x[j, k]
                if u == u and v == v:
                    sum_u += u
                    sum_v += v
                    sum_u2 += u*u
                    sum_v2 += v*v
                    sum_uv += u*v
                    count += 1
            if count == 0:
                res[i, j] = res[j, i] = -9999
                continue

            um = sum_u / count
            vm = sum_v / count
            n = sum_uv - sum_u * vm - sum_v * um + um * vm * count
            du = sqrt(sum_u2 - 2 * sum_u * um + um * um * count) 
            dv = sqrt(sum_v2 - 2 * sum_v * vm + vm * vm * count)
            r = 1 - n / (du * dv)
            res[i, j] = res[j, i] = r
    return res.base

在没有NAN的情况下检查输出:

import numpy as np
from scipy.spatial.distance import pdist, squareform, correlation
x = np.random.rand(2000, 20)
np.allclose(pair_correlation(x), squareform(pdist(x, "correlation")))

使用NAN检查输出:

x = np.random.rand(2000, 20)
x[x < 0.3] = np.nan
r = pair_correlation(x)

i, j = 200, 60 # change this
mask = ~(np.isnan(x[i]) | np.isnan(x[j]))
u = x[i, mask]
v = x[j, mask]
assert abs(correlation(u, v) - r[i, j]) < 1e-12