Python优化了大多数余弦相似向量

时间:2018-11-24 06:54:46

标签: python numpy optimization

我有大约30,000个向量,每个向量有大约300个元素。

对于另一个具有相同数字元素的向量,如何有效地找到最多(余弦)相似的向量?

以下是使用python循环的一种实现方式:

from time import time
import numpy as np

vectors = np.load("np_array_of_about_30000_vectors.npy")
target = np.load("single_vector.npy")
print vectors.shape, vectors.dtype  # (35196, 312) float3
print target.shape, target.dtype  # (312,) float32

start_time = time()
for i, candidate in enumerate(vectors):
    similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
    if similarity > max_similarity: 
        max_similarity = similarity 
        max_index = i
print "done with loop in %s seconds" % (time() - start_time)  # 0.466356039047 seconds
print "Most similar vector to target is index %s with %s" % (max_index, max_similarity)  #  index 2399 with 0.772758982696

以下删除了python循环的速度提高了44倍,但计算却不同:

print "starting max dot"
start_time = time()
print(np.max(np.dot(vectors, target)))
print "done with max dot in %s seconds" % (time() - start_time)  # 0.0105748176575 seconds

是否有一种方法可以使与numpy进行迭代相关联的加速不失去最大索引逻辑和普通乘积的除法?为了优化这样的计算,仅使用C语言进行计算是否有意义?

3 个答案:

答案 0 :(得分:3)

您对避免循环以获得性能有正确的想法。您可以使用argmin来获取最小距离索引。

尽管如此,我也将距离计算更改为scipy cdist。这样,您可以计算到多个目标的距离,并且可以根据需要从多个距离度量中进行选择。

import numpy as np
from scipy.spatial import distance

distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
max_similarity = 1 - min_distance

HTH。

答案 1 :(得分:1)

编辑:@Deepak。如果确实需要实际的计算值,则cdist是最快的。

from scipy.spatial import distance

start_time = time()
distances = distance.cdist([target], vectors, "cosine")[0]
min_index = np.argmin(distances)
min_distance = distances[min_index]
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

在0.013602018356323242秒内完成循环

与目标最相似的向量是索引为11001的0.2250217098612361


from time import time
import numpy as np

vectors = np.random.normal(0,100,(35196,300))
target = np.random.normal(0,100,(300))

start_time = time()
myvals = np.dot(vectors, target)
max_index = np.argmax(myvals)
max_similarity = myvals[max_index]
print("done with max dot in %s seconds" % (time() - start_time) )
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

最大点完成时间为0.009701013565063477秒

与目标最相似的向量是索引为12187的645549.917200941

max_similarity = 1e-10
start_time = time()
for i, candidate in enumerate(vectors):
    similarity = np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
    if similarity > max_similarity: 
        max_similarity = similarity 
        max_index = i
print("done with loop in %s seconds" % (time() - start_time))
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

在0.49567198753356934秒内完成循环

与目标最相似的向量是索引为11001的0.2250217098612361

def my_func(candidate,target):
    return np.dot(candidate, target)/(np.linalg.norm(candidate)*np.linalg.norm(target))
start_time = time()
out = np.apply_along_axis(my_func, 1, vectors,target)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

在0.7495708465576172秒内完成循环

与目标最相似的向量是索引为11001的0.2250217098612361

start_time = time()
vnorm = np.linalg.norm(vectors,axis=1)
tnorm = np.linalg.norm(target)
tnorm = np.ones(vnorm.shape)
out = np.matmul(vectors,target)/(vnorm*tnorm)
print("done with loop in %s seconds" % (time() - start_time))
max_index = np.argmax(out)
print("Most similar vector to target is index %s with %s" % (max_index, max_similarity))

在0.04306602478027344秒内完成循环

与目标最相似的向量是索引为11001的0.2250217098612361

答案 2 :(得分:1)

关于我从cosine_similarity获得的测量sklearn是最优化的算法。

import time
import numpy
from scipy.spatial.distance import cdist
from sklearn.metrics.pairwise import cosine_similarity

target = numpy.random.rand(100,300)
vectors = numpy.random.rand(10000,300)

start = time.time()
most_similar_sklearn = cosine_similarity(target, vectors)
print("Sklearn cosine_similarity: {} s".format(time.time()-start))
start = time.time()
most_similar_scipy = 1-cdist(target, vectors, 'cosine')
print("Scipy cdist: {} s".format(time.time()-start))
equals = numpy.allclose(most_similar_sklearn, most_similar_scipy)
print("Equal results: {}".format(equals))

Sklearn余弦相似度:0.05303549766540527 s
Scipy cdist:0.44914913177490234 s
相等结果:正确

您甚至可以仅使用numpy进行矩阵乘法来获得这样的结果,因为余弦相似度定义为范数所指定的点积。 但是,它需要进行一些预处理,因此matmul是可行的。

import time
import numpy
from sklearn.metrics.pairwise import cosine_similarity

target = numpy.random.rand(100,300)
vectors = numpy.random.rand(10000,300)

most_similar_sklearn = cosine_similarity(target, vectors)

start = time.time()

t_ext = target.reshape((100,300, 1))
v_ext = vectors.T.reshape((1,300,10000))
t_norm = numpy.linalg.norm(t_ext, axis=1)
v_norm = numpy.linalg.norm(v_ext, axis=1)
norm = t_norm @ v_norm
dat = target @ vectors.T
most_similar_numpy = dat / norm

print("Numpy matmul: {} s".format(time.time()-start))
equals = numpy.allclose(most_similar_sklearn, most_similar_numpy)
print("Equal results: {}".format(equals))

脾气肿:0.055016279220581055 s
等于结果:对