Question

我正在计算某种平方欧几里德距离矩阵。关于两个比较样本之间的某些不同，这个距离会受到惩罚。

使用itertools.combinations，总计500个点数据集到~125000个成对距离，并且对于这些距离中的每一个，我检查两个样本的不相似性是否适合4个可能的情况之一。对于500个6维样品，大约需要5-6分钟。如果我要处理1000个10维样本样本，如果我想进行参数优化，这就变得太长了。

这是我的代码。 delta是数据集（例如[500,6]），w1，w2，w3是相对于欧几里德距离（w3）的权重和两种惩罚（w1＆amp; w2）。我寻找的不同之处在于两个样本是否共享相同的子空间，例如[1,0,1]和[3,0，-1]，但[0,0,1]不是;共享子空间中的对立符号。 [1,0,1]和[3,0，-1]共享相同的子空间，但在其中一个共享坐标上有不同的符号。

可能不需要了解这些条件来优化代码。

import pandas as pd
from scipy.spatial import distance
from scipy.spatial.distance import squareform

dig={}
for i,j in itertools.combinations(range(delta.shape[0]),2):
    simi=((delta.iloc[i].astype(bool))==(delta.iloc[j].astype(bool)))*1
#gives a vector filled with 1 if the coordinate is non zero for both delta[i] and delta[j] and 0 otherwise

    nsimi=sum(simi) #dimension of common subspace(just the sum of 1's)

    idx=simi[simi>0] #index of this common subspace(place of these 1's)

    nsigns=sum(((np.sign(delta.iloc[i][idx])!=np.sign(delta.iloc[j][idx]))*1)) 
    #among those shared coordinates, how many aren't the same sign.


    #case 1 : same subspace and same signs : no penalization

    if nsimi==delta.shape[1] and sum(np.sign(delta.iloc[i])==np.sign(delta.iloc[j]))==delta.shape[1]:
        dig[(i,j)]=w3*distance.sqeuclidean(delta.iloc[i],delta.iloc[j])

    #case 2 : same subspace, but not same sign for shared coordinates 

    elif nsimi==delta.shape[1] and nsigns != 0:
        dig[(i,j)]=w2*nsigns +w3*distance.sqeuclidean(delta.iloc[i],delta.iloc[j])

    #cas 3 : not same subspace , but shared coordinates of same sign : 

    elif nsimi != delta.shape[1] and nsigns != 0:
        dig[(i,j)]=w1*(delta.shape[1]-nsimi)+w3*distance.sqeuclidean(delta.iloc[i],delta.iloc[j])

    #cas 4 : neither same subspace nor same sign for shared coordinates 

    elif nsimi != delta.shape[1] and sum(np.sign(delta.iloc[i])==np.sign(delta.iloc[j])) != delta.shape[1]:
        dig[(i,j)]=w1*(delta.shape[1]-nsimi) + w2*nsigns + 
                   w3*distance.sqeuclidean(delta.iloc[i],delta.iloc[j])

dist_list = [dist[1] for dist in sorted(dig.items())]
distpen = pd.DataFrame(squareform(dist_list))

我正在寻找一种快速计算方法，或者通过找到另一种方法来检查两个点的不相似性是否属于四种情况之一，或者通过寻找另一种方法来构建数据（现在我我正在填写一个字典并将其转换为距离矩阵。）

谢谢。

Answer 1

解决。关键是使用numpy数组而不是pandas数据帧，这是两倍的速度;创建一个函数，计算两个点i，j的距离，然后在三角形阵列上使用numpy vectorize对此函数，让squareform从中创建距离矩阵。

对于1000个6维样本的数据集，这可以将计算速度从20分钟加速到20秒。

快速距离矩阵计算，每个对的条件检查

1 个答案: