我正在使用cdist
中的SciPy
来计算一维数组上的成对元素,我这样使用它:
import numpy as np
import textdistance
from scipy.spatial.distance import cdist
from time import time
first_ = np.array(["hello gtjj rgreg", "hellllo zefze ergee"])
second_ = np.array(["hlo asad gerg", "alle gtrhh gerg"])
first_ = np.tile(first_, 100)
second_ = np.tile(second_, 100)
start_time = time()
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.cosine(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.hamming.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.prefix.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.postfix.normalized_distance(a[0], b[0]))
mat_to_compare = cdist(second_[:, np.newaxis], first_[:, np.newaxis], lambda a, b: textdistance.jaro_winkler(a[0], b[0]))
execution_time = time() - start_time
print(execution_time)
然后,我想更快地计算距离矩阵,所以我研究了cdist
源代码,并使用构建矩阵的循环尝试了此操作:
start_time = time()
XA, XB = second_[:, np.newaxis], first_[:, np.newaxis]
s, sB = XA.shape, XB.shape
mA = s[0]
mB = sB[0]
dm1 = np.empty((mA, mB), dtype=np.double)
dm2 = np.empty((mA, mB), dtype=np.double)
dm3 = np.empty((mA, mB), dtype=np.double)
dm4 = np.empty((mA, mB), dtype=np.double)
dm5 = np.empty((mA, mB), dtype=np.double)
for i in range(0, mA):
for j in range(0, mB):
dm1[i, j] = textdistance.cosine(XA[i][0], XB[j][0])
dm2[i, j] = textdistance.hamming.normalized_distance(XA[i][0], XB[j][0])
dm3[i, j] = textdistance.prefix.normalized_distance(XA[i][0], XB[j][0])
dm4[i, j] = textdistance.postfix.normalized_distance(XA[i][0], XB[j][0])
dm5[i, j] = textdistance.jaro_winkler(XA[i][0], XB[j][0])
execution_time = time() - start_time
print(execution_time)
但是,在我尝试了两种解决方案之后,执行时间几乎不尽相同。任何人都可以看到一种增强我所有矩阵的计算的方法吗?