Numpy矩阵维度 - tfidf向量

时间:2014-12-06 22:38:01

标签: python numpy vector tf-idf

我试图解决聚类问题..我有一个由CountVectorizer()函数生成的tf-idf加权向量列表。这是数据类型:

<1000x5369 sparse matrix of type '<type 'numpy.float64'>'
with 42110 stored elements in Compressed Sparse Row format>

我有一个&#34;质心&#34;以下维度的向量:

<1x5369 sparse matrix of type '<type 'numpy.float64'>'
with 57 stored elements in Compressed Sparse Row format>

当我尝试通过以下代码行测量我的tfidf_vec_list中的质心和其他向量的余弦相似度时:

for centroid in centroids:
sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]

其中相似度函数为:

def cosine_similarity(vector1,vector2):
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
    return score

我收到错误:

Traceback (most recent call last):
  File "<pyshell#25>", line 1, in <module>
    sim_scores=[cosine_similarity(vector,centroid) for vector in tfidf_vec_list]
  File "/home/ashwin/Desktop/Python-2.7.9/programs/test_2.py", line 28, in             cosine_similarity
    score=1-scipy.spatial.distance.cosine(vector1,vector2)
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 287, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
    File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 302, in __mul__
    raise ValueError(**'dimension mismatch'**)

我尝试了所有方法,包括将矩阵转换为数组,将每个向量转换为列表。但是我得到了同样的错误!!

1 个答案:

答案 0 :(得分:3)

scipy.spatial.distance.cosine似乎不支持稀疏矩阵输入。具体来说,np.linalg.norm(sparse_vector)失败(参见Get norm of numpy sparse matrix rows)。

如果你在传递它们之前将两个输入向量(实际上它们是矩阵形式的行向量)转换为密集版本,它可以正常工作:

>>> xs
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> ys
<1x4 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
>>> cosine(xs, ys)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/scipy/spatial/distance.py", line 296, in cosine
    dist = 1.0 - np.dot(u, v) / (norm(u) * norm(v))
  File "/usr/lib/python3.4/site-packages/scipy/sparse/base.py", line 308, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch
>>> cosine(xs.todense(), ys.todense())
-2.2204460492503131e-16

这对于单独的5369个元素向量(与整个矩阵相对)应该没有问题。