使用scikit-learn有效地计算余弦相似度

时间:2017-02-04 19:40:11

标签: python performance optimization scikit-learn cosine-similarity

在预处理和转换(BOW,TF-IDF)数据后,我需要计算与数据集中每个其他元素的余弦相似度。目前,我这样做:

cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]

在此示例中,每个输入变量(例如tr_title)都是SciPy稀疏矩阵。但是,此代码运行 非常慢 。我该怎么做才能优化代码,以便更快地运行?

2 个答案:

答案 0 :(得分:4)

要提高性能,您应该使用矢量化代码替换列表推导。这可以通过Numpy的pdistsquareform轻松实现,如下面的代码段所示:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

演示

In [87]: X
Out[87]: 
<9x21 sparse matrix of type '<type 'numpy.int64'>'
    with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()          
Out[88]: 
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]: 
[u'attack',
 u'awakens',
 u'back',
 u'clones',
 u'empire',
 u'force',
 u'hope',
 u'jedi',
 u'last',
 u'menace',
 u'new',
 u'of',
 u'phantom',
 u'return',
 u'revenge',
 u'sith',
 u'star',
 u'story',
 u'strikes',
 u'the',
 u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0.    1.    1.    1.    1.    1.    1.    1.    1.  ]
 [ 1.    0.    0.75  0.71  0.75  0.75  0.71  1.    0.71]
 [ 1.    0.75  0.    0.71  0.5   0.5   0.71  1.    0.42]
 [ 1.    0.71  0.71  0.    0.71  0.71  0.67  1.    0.67]
 [ 1.    0.75  0.5   0.71  0.    0.5   0.71  1.    0.71]
 [ 1.    0.75  0.5   0.71  0.5   0.    0.71  1.    0.71]
 [ 1.    0.71  0.71  0.67  0.71  0.71  0.    1.    0.67]
 [ 1.    1.    1.    1.    1.    1.    1.    0.    1.  ]
 [ 1.    0.71  0.42  0.67  0.71  0.71  0.67  1.    0.  ]]

请注意,X.toarray().shape会产生(9L, 21L)因为在上面的玩具示例中有9个标题和21个不同的单词,而cs_title 9乘9 数组

答案 1 :(得分:1)

考虑到两个向量的余弦相似性的两个特征,您可以将每个计算的工作量减少一半以上:

  1. 矢量与其自身的余弦相似度为1。
  2. 矢量 x 与矢量 y 的余弦相似度与矢量 y 与矢量 x <的余弦相似度相同/强>
  3. 因此,要计算对角线以下的元素。

    编辑:这是你如何计算它。特别注意 cs 只是一个虚拟函数来代替相似系数的实际计算。

    title1 = 'A four word title'
    title2 = 'A five word title'
    title3 = 'A six word title'
    title4 = 'A seven word title'
    
    titles = [title1, title2, title3, title4]
    N = len(titles)
    
    import numpy as np
    
    similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N)
    
    cs = lambda a,b: 10*a+b  # just a 'pretend' calculation of the coefficient
    
    for m in range(N):
        similarity_matrix [m,m] = 1
        for n in range(m+1,N):
            similarity_matrix [m,n] = cs(m,n)
            similarity_matrix [n,m] = similarity_matrix [m,n]
    
    print (similarity_matrix )
    

    这是结果。

    [[  1.   1.   2.   3.]
     [  1.   1.  12.  13.]
     [  2.  12.   1.  23.]
     [  3.  13.  23.   1.]]