加快成对计算距离矩阵

时间:2020-11-05 10:01:29

标签: python performance numpy loops scipy

我正在使用cdist中的SciPy来计算一维数组上的成对元素,我这样使用它:

import numpy as np
import textdistance
from scipy.spatial.distance import cdist
from time import time


first_ = np.array(["hello gtjj rgreg", "hellllo  zefze ergee"])
second_ = np.array(["hlo asad gerg", "alle gtrhh gerg"])
first_ = np.tile(first_, 100)
second_ = np.tile(second_, 100)

 start_time = time()

mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.cosine(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.hamming.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.prefix.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.postfix.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.jaro_winkler(a[0], b[0]))

execution_time = time() - start_time

print(execution_time)

然后,我想更快地计算距离矩阵,所以我研究了cdist源代码,并使用构建矩阵的循环尝试了此操作:

start_time = time()


XA, XB = second_[:, np.newaxis], first_[:, np.newaxis]
s, sB = XA.shape, XB.shape

mA = s[0]
mB = sB[0]

dm1 = np.empty((mA, mB), dtype=np.double)
dm2 = np.empty((mA, mB), dtype=np.double)
dm3 = np.empty((mA, mB), dtype=np.double)
dm4 = np.empty((mA, mB), dtype=np.double)
dm5 = np.empty((mA, mB), dtype=np.double)

for i in range(0, mA):
            for j in range(0, mB):
                dm1[i, j] = textdistance.cosine(XA[i][0], XB[j][0])
                dm2[i, j] = textdistance.hamming.normalized_distance(XA[i][0], XB[j][0])
                dm3[i, j] = textdistance.prefix.normalized_distance(XA[i][0], XB[j][0])
                dm4[i, j] = textdistance.postfix.normalized_distance(XA[i][0], XB[j][0])
                dm5[i, j] = textdistance.jaro_winkler(XA[i][0], XB[j][0])

execution_time = time() - start_time

print(execution_time)

但是,在我尝试了两种解决方案之后,执行时间几乎不尽相同。任何人都可以看到一种增强我所有矩阵的计算的方法吗?

0 个答案:

没有答案