Numpy,Scipy:尝试在cdist中使用点积来进行归一化向量,得到的速度比余弦慢

时间:2018-02-04 19:52:53

标签: python numpy scipy

我使用L2标准化向量,所以我想在cdist中通过使用点积而不是余弦来加快速度,余弦也计算范数(在我的情况下是单位)。我唯一需要的是检查其中一个向量是否被特别指定为全零(它是在算法的前几个阶段故意完成的,因此我认为不需要使用eps进行近似检查)。这是一个比较:

from scipy.spatial.distance import cdist

import numpy as np


#generate 1001 normalized vectors

vec = np.random.rand(1,1000) # 1 vector
vecs = np.random.rand(1000,1000) # 1000 vectors

#normalize:
vec = vec/np.linalg.norm(vec)
vecs = np.array([vec/np.linalg.norm(vec) for vec in vecs])

# I want this check, it is important for me
nullvect = np.zeros(1000)



import time

# ordinary cos - looks like it is implemented in python in distance.py: 
'''
In [72]: cosine(a, b)
/Users/warren/miniconda3/lib/python3.5/site-packages/scipy/spatial/distance.py:329: 
  dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
'''
start = time.time()

d = cdist(vec,vecs,metric='cosine')

end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))


#simplified distance function - just dot product:
def dist_func(vec1,vec2):
    if np.array_equal(vec1,nullvect) or np.array_equal(vec2,nullvect):
        return np.nan

    return 1-np.dot(vec1,vec2)

#  Faster(?) func:
start = time.time()

d = cdist(vec,vecs,dist_func)

end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))

然而,结果非常令人沮丧:

Time elapsed: 1.184225082397461ms  //cosine
Time elapsed: 12.001752853393555ms   //simplified

我可以做些什么来实现加速?

编辑:看起来最慢的部分是np.array_equal()。如果我删除它,代码运行速度比余弦慢一点:

Time elapsed: 1.2218952178955078ms  //cosine                                                                                                                                  
Time elapsed: 1.550912857055664ms   //simplified

但对np.nan nullvect nullvect来说,nan非常重要。

EDIT2:解释为什么我需要cdist

我需要这个检查的原因是我比较了word2vec模型中的单词向量,并且我手动将未知单词的向量分配给零数组(或者可以采取其他方法)。我可以方便地使用word_vectors s作为结果,因为当我对word_data的结果进行阈值处理时,不会考虑它们。例如,这里我有两个相应的数组 - dists带有实际向量,列表word_data带有与这些单词相关的不同数据 - 出现次数,文本中的位置等等。所以在阈值化后{{1我可以轻松地将结果蒙版应用于子集with np.errstate(divide='ignore'): if word_vectors: dists = cdist([word1.vector], word_vectors, distance_func).ravel() # nan's are not taken into account in queries like ">" or "<" very_close_ones = [i < high_similarity_threshold for i in dists] close_ones = [i < similarity_threshold for i in dists] distant_ones = [i > unsimilarity_threshold for i in dists] # let's select the places of the closest words places = list(compress(word_data, very_close_ones)) places2 = list(compress(word_data, close_ones)) places3 = list(compress(word_data, distant_ones))

nullvect

EDIT3: 我发现根据我的需要,我可以用nan向量替换nullvect = np.array([np.nan]*1000)

np.dot

然后nan产生单np.dot作为结果,这对我来说没问题。问题仍然是def dist_func(vec1,vec2): return 1-np.dot(vec1,vec2) Time elapsed: 1.268625259399414ms Time elapsed: 1.4104843139648438ms 比余弦慢。

start = time.time()
#d2 = 1-vecs.dot(vec.T)
d2 = 1-vec.dot(vecs.T) # this order even the shape is the same, but seems to be a little bit slower (~0.02ms average)

end = time.time()
print("Time elapsed: %sms"%(1000*(end-start))) 

EDIT4:

我似乎找到了解决方案,使用矩阵乘法:

Time elapsed: 1.238107681274414ms  //cdist, cos
Time elapsed: 0.3299713134765625ms //matrix-vector dot product

这最终会产生加速:

from scipy.spatial.distance import cdist

import numpy as np


#generate 1001 normalized vectors

vec = np.random.rand(1,1000) # 1 vector
vecs = np.random.rand(1000,1000) # 1000 vectors

#normalize:
vec = vec/np.linalg.norm(vec)
vecs = np.array([vec/np.linalg.norm(vec) for vec in vecs])

# I want this check, it is important for me
nullvect = np.array([np.nan]*1000) #np.zeros(1000)

vecs[0] = nullvect

import time

# ordinary cos - looks like it is implemented in python in distance.py: 
'''
In [72]: cosine(a, b)
/Users/warren/miniconda3/lib/python3.5/site-packages/scipy/spatial/distance.py:329: 
  dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
'''
start = time.time()

d = cdist(vec,vecs,metric='cosine')

end = time.time()
print("Time elapsed: %sms"%(1000*(end-start)))


#simplified distance function - just dot product:
def dist_func(vec1,vec2):
    return 1-np.dot(vec1,vec2)


#  Faster(?) func:
start = time.time()

#d2 = 1-vec.dot(vecs.T)
d2 = 1 - vecs.dot(vec.T)

end = time.time()
print("Time elapsed: %sms"%(1000*(end-start))) 

#print(d.shape)
#print(d2.shape)
print(d.ravel()[:5])
print(d2.ravel()[:5])

我检查了数值结果,它们似乎相符。

工作代码:

Time elapsed: 1.190185546875ms
Time elapsed: 0.3883838653564453ms
[        nan  0.24669413  0.25673153  0.24910682  0.26340765]
[        nan  0.24669413  0.25673153  0.24910682  0.26340765]

结果:

>>> nullvect = np.array([np.nan]*300)
>>> nullvect2 = np.array([np.nan]*300)
>>> np.array_equal(nullvect,nullvect2)
False

谢谢大家!

EDIT5:

对于想要使用它的人,请注意:

>>> np.isnan(nullvect).any()                                                                                                                                          
True 

所以使用

    <dependencies>
    <dependency>
        <groupId>javax</groupId>
        <artifactId>javaee-web-api</artifactId>
        <version>6.0</version>
        <scope>provided</scope>
    </dependency>

    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-webmvc</artifactId>
        <version>4.2.2.RELEASE</version>
    </dependency>
</dependencies>

<build>
    <!--BEGIN - Plugins -->
    <plugins>
        <plugin>
            <groupId>org.apache.tomcat.maven</groupId>
            <artifactId>tomcat9-maven-plugin</artifactId>
            <version>9</version>
        </plugin>

        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.7.0</version>
            <configuration>
                <path>/</path>
                    <contextReloadable>true</contextReloadable>
            </configuration>
        </plugin>
    </plugins>
    <!--END - Plugins -->
</build>

0 个答案:

没有答案