Question

我已经使用Sklearn和TfidfVectorizer创建了一个文档术语矩阵。

tfidf = TfidfVectorizer(use_idf = True, 
                        norm = normalization,
                        min_df = min_doc_freq) 

dtm = tfidf.fit_transform(text)

这给出了128,111 x 3,469稀疏矩阵，CSR格式，存储了1,865,094个元素。我想乘以它的转置，但是每次这样做，都会遇到内存错误。

矩阵为128,111 x 3,469，这意味着生成的矩阵应为128,111 x 128,111，看起来并不大。

我正在使用Python 3.7.2（64位）。在撰写本文时，我正在使用的VM具有84个可用的RAM（总共超过125个）。

我尝试了以下代码，每次都得到相同的错误：

sim = dtm * dtm.T #(also used dtm.transpose()) 

sim = dtm @ dtm.T 

sim = dtm.dot(dtm.T)

我希望会返回一个稀疏矩阵，但是会收到“ MemoryError”。

 ~/utilities/anaconda3/lib/python3.7/site-packages/scipy/sparse/compressed.py in _mul_sparse_matrix(self, other) 
 500 maxval=nnz) 
 501 indptr = np.asarray(indptr, dtype=idx_dtype) 
 --> 502 indices = np.empty(nnz, dtype=idx_dtype) 
 503 data = np.empty(nnz, dtype=upcast(self.dtype, other.dtype)) 
 504 MemoryError:

CSR稀疏矩阵相乘时出现MemoryError

0 个答案: