我有大约30,000个向量,每个向量有大约300个元素。
对于另一个具有相同数字元素的向量,如何有效地找到最多(余弦)相似的向量?
以下是使用python循环的一种实现方式:
from time import time
import numpy as np
vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype # (35196, 312) float3
print target.shape, target.dtype # (312,) float32
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print "done with loop in %s seconds" % (time() - start_time) # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity) # index 2399 with 0.772758982696
以下删除了python循环的速度提高了44倍,但计算却不同:
print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time) # 0.0105748176575 seconds
是否有一种方法可以使与numpy进行迭代相关联的加速不失去最大索引逻辑和普通乘积的除法?为了优化这样的计算,仅使用C语言进行计算是否有意义?
答案 0 :(得分:3)
您对避免循环以获得性能有正确的想法。您可以使用argmin
来获取最小距离索引。
尽管如此,我也将距离计算更改为scipy cdist。这样,您可以计算到多个目标的距离,并且可以根据需要从多个距离度量中进行选择。
import numpy as np
from scipy.spatial import distance
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance
HTH。
答案 1 :(得分:1)
编辑:@Deepak。如果确实需要实际的计算值,则cdist是最快的。
from scipy.spatial import distance
start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
在0.013602018356323242秒内完成循环
与目标最相似的向量是索引为11001的0.2250217098612361
from time import time
import numpy as np
vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))
start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
最大点完成时间为0.009701013565063477秒
与目标最相似的向量是索引为12187的645549.917200941
max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
if similarity > max_similarity:
max_similarity = similarity
max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
在0.49567198753356934秒内完成循环
与目标最相似的向量是索引为11001的0.2250217098612361
def my_func(candidate,target):
return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
在0.7495708465576172秒内完成循环
与目标最相似的向量是索引为11001的0.2250217098612361
start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))
在0.04306602478027344秒内完成循环
与目标最相似的向量是索引为11001的0.2250217098612361
答案 2 :(得分:1)
关于我从cosine_similarity
获得的测量sklearn
是最优化的算法。
import time
import numpy
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity
target = numpy.random.rand(100,300)
vectors = numpy.random.rand(10000,300)
start = time.time()
most_similar_sklearn = cosine_similarity(target, vectors)
print("Sklearn cosine_similarity: {} s".format(time.time()-start))
start = time.time()
most_similar_scipy = 1-cdist(target, vectors, 'cosine')
print("Scipy cdist: {} s".format(time.time()-start))
equals = numpy.allclose(most_similar_sklearn, most_similar_scipy)
print("Equal results: {}".format(equals))
Sklearn余弦相似度:0.05303549766540527 s
Scipy cdist:0.44914913177490234 s
相等结果:正确
您甚至可以仅使用numpy
进行矩阵乘法来获得这样的结果,因为余弦相似度定义为范数所指定的点积。
但是,它需要进行一些预处理,因此matmul是可行的。
import time
import numpy
from sklearn.metrics.pairwise import cosine_similarity
target = numpy.random.rand(100,300)
vectors = numpy.random.rand(10000,300)
most_similar_sklearn = cosine_similarity(target, vectors)
start = time.time()
t_ext = target.reshape((100,300, 1))
v_ext = vectors.T.reshape((1,300,10000))
t_norm = numpy.linalg.norm(t_ext, axis=1)
v_norm = numpy.linalg.norm(v_ext, axis=1)
norm = t_norm @ v_norm
dat = target @ vectors.T
most_similar_numpy = dat / norm
print("Numpy matmul: {} s".format(time.time()-start))
equals = numpy.allclose(most_similar_sklearn, most_similar_numpy)
print("Equal results: {}".format(equals))
脾气肿:0.055016279220581055 s
等于结果:对