Question

恢复此问题：Compute the pairwise distance in scipy with missing values

测试用例：我想计算不同长度的系列的成对距离被组合在一起，我必须以最有效的方式（使用欧氏距离）进行。

使其发挥作用的一种方式可能是：

import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist

a = pd.DataFrame(np.random.rand(10, 4), columns=['a','b','c','d'])
a.loc[0, 'a'] = np.nan
a.loc[1, 'a'] = np.nan
a.loc[0, 'c'] = np.nan
a.loc[1, 'c'] = np.nan

def dropna_on_the_fly(x, y):
    return  np.sqrt(np.nansum(((x-y)**2)))

pdist(starting_set, dropna_on_the_fly)

但我觉得这可能是非常低效的，因为pdist函数的内置方法是内部优化的，而函数只是简单地传递。

我预感到numpy中的矢量化解决方案，我broadcast减法，然后我继续使用np.nansum na抵抗额，但我不确定如何继续。

Answer 1

受this post的启发，会有两种解决方案。

方法＃1：向量化解决方案将是 -

ar = a.values
r,c = np.triu_indices(ar.shape[0],1)
out = np.sqrt(np.nansum((ar[r] - ar[c])**2,1))

方法＃2：大型数组的内存效率更高，性能更高 -

ar = a.values
b = np.where(np.isnan(ar),0,ar)

mask = ~np.isnan(ar)
n = b.shape[0]
N = n*(n-1)//2
idx = np.concatenate(( [0], np.arange(n-1,0,-1).cumsum() ))
start, stop = idx[:-1], idx[1:]
out = np.empty((N),dtype=b.dtype)
for j,i in enumerate(range(n-1)):
    dif = b[i,None] - b[i+1:]
    mask_j = (mask[i] & mask[i+1:])
    masked_vals = mask_j * dif
    out[start[j]:stop[j]] = np.einsum('ij,ij->i',masked_vals, masked_vals)
      # or simply : ((mask_j * dif)**2).sum(1)

out = np.sqrt(out)

如何有效地计算不同长度（内部）系列之间的成对距离？

1 个答案: