我试图解决聚类问题..我有一个由CountVectorizer()函数生成的tf-idf加权向量列表。这是数据类型:
<1000x5369 sparse matrix of type '<type 'numpy.float64'>'
with 42110 stored elements in Compressed Sparse Row format>
我有一个&#34;质心&#34;以下维度的向量:
<1x5369 sparse matrix of type '<type 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>
当我尝试通过以下代码行测量我的tfidf_vec_list中的质心和其他向量的余弦相似度时:
for centroid in centroids:
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
其中相似度函数为:
def cosine_similarity(vector1,vector2):
score=1-scipy.spatial.distance.cosine(vector1,vector2)
return score
我收到错误:
Traceback (most recent call last):
File "<pyshell#25>", line 1, in <module>
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in cosine_similarity
score=1-scipy.spatial.distance.cosine(vector1,vector2)
File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
raise ValueError(**'dimension mismatch'**)
我尝试了所有方法,包括将矩阵转换为数组,将每个向量转换为列表。但是我得到了同样的错误!!
答案 0 :(得分:3)
scipy.spatial.distance.cosine
似乎不支持稀疏矩阵输入。具体来说,np.linalg.norm(sparse_vector)失败(参见Get norm of numpy sparse matrix rows)。
如果你在传递它们之前将两个输入向量(实际上它们是矩阵形式的行向量)转换为密集版本,它可以正常工作:
>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16
这对于单独的5369个元素向量(与整个矩阵相对)应该没有问题。