我使用L2标准化向量,所以我想在cdist
中通过使用点积而不是余弦来加快速度,余弦也计算范数(在我的情况下是单位)。我唯一需要的是检查其中一个向量是否被特别指定为全零(它是在算法的前几个阶段故意完成的,因此我认为不需要使用eps进行近似检查)。这是一个比较:
from scipy.spatial.distance import cdist
import numpy as np
#generate 1001 normalized vectors
vec = np.random.rand(1,1000) # 1 vector
vecs = np.random.rand(1000,1000) # 1000 vectors
#normalize:
vec = vec/np.linalg.norm(vec)
vecs = np.array([vec/np.linalg.norm(vec) for vec in vecs])
# I want this check, it is important for me
nullvect = np.zeros(1000)
import time
# ordinary cos - looks like it is implemented in python in distance.py:
'''
In [72]: cosine(a, b)
/Users/warren/miniconda3/lib/python3.5/site-packages/scipy/spatial/distance.py:329:
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
'''
start = time.time()
d = cdist(vec,vecs,metric='cosine')
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
#simplified distance function - just dot product:
def dist_func(vec1,vec2):
if np.array_equal(vec1,nullvect) or np.array_equal(vec2,nullvect):
return np.nan
return 1-np.dot(vec1,vec2)
# Faster(?) func:
start = time.time()
d = cdist(vec,vecs,dist_func)
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
然而,结果非常令人沮丧:
Time elapsed: 1.184225082397461ms //cosine
Time elapsed: 12.001752853393555ms //simplified
我可以做些什么来实现加速?
编辑:看起来最慢的部分是np.array_equal()
。如果我删除它,代码运行速度比余弦慢一点:
Time elapsed: 1.2218952178955078ms //cosine
Time elapsed: 1.550912857055664ms //simplified
但对np.nan
nullvect
nullvect
来说,nan
非常重要。
EDIT2:解释为什么我需要cdist
:
我需要这个检查的原因是我比较了word2vec模型中的单词向量,并且我手动将未知单词的向量分配给零数组(或者可以采取其他方法)。我可以方便地使用word_vectors
s作为结果,因为当我对word_data
的结果进行阈值处理时,不会考虑它们。例如,这里我有两个相应的数组 - dists
带有实际向量,列表word_data
带有与这些单词相关的不同数据 - 出现次数,文本中的位置等等。所以在阈值化后{{1我可以轻松地将结果蒙版应用于子集with np.errstate(divide='ignore'):
if word_vectors:
dists = cdist([word1.vector], word_vectors, distance_func).ravel()
# nan's are not taken into account in queries like ">" or "<"
very_close_ones = [i < high_similarity_threshold for i in dists]
close_ones = [i < similarity_threshold for i in dists]
distant_ones = [i > unsimilarity_threshold for i in dists]
# let's select the places of the closest words
places = list(compress(word_data, very_close_ones))
places2 = list(compress(word_data, close_ones))
places3 = list(compress(word_data, distant_ones))
:
nullvect
EDIT3:
我发现根据我的需要,我可以用nan
向量替换nullvect = np.array([np.nan]*1000)
:
np.dot
然后nan
产生单np.dot
作为结果,这对我来说没问题。问题仍然是def dist_func(vec1,vec2):
return 1-np.dot(vec1,vec2)
Time elapsed: 1.268625259399414ms
Time elapsed: 1.4104843139648438ms
比余弦慢。
start = time.time()
#d2 = 1-vecs.dot(vec.T)
d2 = 1-vec.dot(vecs.T) # this order even the shape is the same, but seems to be a little bit slower (~0.02ms average)
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
EDIT4:
我似乎找到了解决方案,使用矩阵乘法:
Time elapsed: 1.238107681274414ms //cdist, cos
Time elapsed: 0.3299713134765625ms //matrix-vector dot product
这最终会产生加速:
from scipy.spatial.distance import cdist
import numpy as np
#generate 1001 normalized vectors
vec = np.random.rand(1,1000) # 1 vector
vecs = np.random.rand(1000,1000) # 1000 vectors
#normalize:
vec = vec/np.linalg.norm(vec)
vecs = np.array([vec/np.linalg.norm(vec) for vec in vecs])
# I want this check, it is important for me
nullvect = np.array([np.nan]*1000) #np.zeros(1000)
vecs[0] = nullvect
import time
# ordinary cos - looks like it is implemented in python in distance.py:
'''
In [72]: cosine(a, b)
/Users/warren/miniconda3/lib/python3.5/site-packages/scipy/spatial/distance.py:329:
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
'''
start = time.time()
d = cdist(vec,vecs,metric='cosine')
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
#simplified distance function - just dot product:
def dist_func(vec1,vec2):
return 1-np.dot(vec1,vec2)
# Faster(?) func:
start = time.time()
#d2 = 1-vec.dot(vecs.T)
d2 = 1 - vecs.dot(vec.T)
end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))
#print(d.shape)
#print(d2.shape)
print(d.ravel()[:5])
print(d2.ravel()[:5])
我检查了数值结果,它们似乎相符。
工作代码:
Time elapsed: 1.190185546875ms
Time elapsed: 0.3883838653564453ms
[ nan 0.24669413 0.25673153 0.24910682 0.26340765]
[ nan 0.24669413 0.25673153 0.24910682 0.26340765]
结果:
>>> nullvect = np.array([np.nan]*300)
>>> nullvect2 = np.array([np.nan]*300)
>>> np.array_equal(nullvect,nullvect2)
False
谢谢大家!
EDIT5:
对于想要使用它的人,请注意:
>>> np.isnan(nullvect).any()
True
所以使用
<dependencies>
<dependency>
<groupId>javax</groupId>
<artifactId>javaee-web-api</artifactId>
<version>6.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-webmvc</artifactId>
<version>4.2.2.RELEASE</version>
</dependency>
</dependencies>
<build>
<!--BEGIN - Plugins -->
<plugins>
<plugin>
<groupId>org.apache.tomcat.maven</groupId>
<artifactId>tomcat9-maven-plugin</artifactId>
<version>9</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
<configuration>
<path>/</path>
<contextReloadable>true</contextReloadable>
</configuration>
</plugin>
</plugins>
<!--END - Plugins -->
</build>