Question

在预处理和转换（BOW，TF-IDF）数据后，我需要计算与数据集中每个其他元素的余弦相似度。目前，我这样做：

cs_title = [cosine_similarity(a, b) for a in tr_title for b in tr_title]
cs_abstract = [cosine_similarity(a, b) for a in tr_abstract for b in tr_abstract]
cs_mesh = [cosine_similarity(a, b) for a in pre_mesh for b in pre_mesh]
cs_pt = [cosine_similarity(a, b) for a in pre_pt for b in pre_pt]

在此示例中，每个输入变量（例如tr_title）都是SciPy稀疏矩阵。但是，此代码运行 非常慢 。我该怎么做才能优化代码，以便更快地运行？

Answer 1

要提高性能，您应该使用矢量化代码替换列表推导。这可以通过Numpy的pdist和squareform轻松实现，如下面的代码段所示：

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform

titles = [
    'A New Hope',
    'The Empire Strikes Back',
    'Return of the Jedi',
    'The Phantom Menace',
    'Attack of the Clones',
    'Revenge of the Sith',
    'The Force Awakens',
    'A Star Wars Story',
    'The Last Jedi',
    ]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(titles)
cs_title = squareform(pdist(X.toarray(), 'cosine'))

演示：

In [87]: X
Out[87]: 
<9x21 sparse matrix of type '<type 'numpy.int64'>'
    with 30 stored elements in Compressed Sparse Row format>

In [88]: X.toarray()          
Out[88]: 
array([[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]], dtype=int64)

In [89]: vectorizer.get_feature_names()
Out[89]: 
[u'attack',
 u'awakens',
 u'back',
 u'clones',
 u'empire',
 u'force',
 u'hope',
 u'jedi',
 u'last',
 u'menace',
 u'new',
 u'of',
 u'phantom',
 u'return',
 u'revenge',
 u'sith',
 u'star',
 u'story',
 u'strikes',
 u'the',
 u'wars']

In [90]: np.set_printoptions(precision=2)

In [91]: print(cs_title)
[[ 0.    1.    1.    1.    1.    1.    1.    1.    1.  ]
 [ 1.    0.    0.75  0.71  0.75  0.75  0.71  1.    0.71]
 [ 1.    0.75  0.    0.71  0.5   0.5   0.71  1.    0.42]
 [ 1.    0.71  0.71  0.    0.71  0.71  0.67  1.    0.67]
 [ 1.    0.75  0.5   0.71  0.    0.5   0.71  1.    0.71]
 [ 1.    0.75  0.5   0.71  0.5   0.    0.71  1.    0.71]
 [ 1.    0.71  0.71  0.67  0.71  0.71  0.    1.    0.67]
 [ 1.    1.    1.    1.    1.    1.    1.    0.    1.  ]
 [ 1.    0.71  0.42  0.67  0.71  0.71  0.67  1.    0.  ]]

请注意，X.toarray().shape会产生(9L, 21L)因为在上面的玩具示例中有9个标题和21个不同的单词，而cs_title是 9乘9 数组

Answer 2

考虑到两个向量的余弦相似性的两个特征，您可以将每个计算的工作量减少一半以上：

矢量与其自身的余弦相似度为1。
矢量 x 与矢量 y 的余弦相似度与矢量 y 与矢量 x <的余弦相似度相同/强>

因此，要计算对角线以下的元素。

编辑：这是你如何计算它。特别注意 cs 只是一个虚拟函数来代替相似系数的实际计算。

title1 = 'A four word title' title2 = 'A five word title' title3 = 'A six word title' title4 = 'A seven word title' titles = [title1, title2, title3, title4] N = len(titles) import numpy as np similarity_matrix = np.array(N**2*[0],np.float).reshape(N,N) cs = lambda a,b: 10*a+b # just a 'pretend' calculation of the coefficient for m in range(N): similarity_matrix [m,m] = 1 for n in range(m+1,N): similarity_matrix [m,n] = cs(m,n) similarity_matrix [n,m] = similarity_matrix [m,n] print (similarity_matrix )

这是结果。

[[ 1. 1. 2. 3.] [ 1. 1. 12. 13.] [ 2. 12. 1. 23.] [ 3. 13. 23. 1.]]

使用scikit-learn有效地计算余弦相似度

2 个答案: