Question

我正在使用scikit进行一些文本处理，例如tfidf。正确处理文件名的数量（~40k）。但就独特单词的数量而言，我无法处理数组/矩阵，无论是获取打印的唯一单词数量的大小，还是将numpy数组转储到文件中（使用savetxt）。以下是追溯。如果我能得到tfidf的最高值，因为我不需要每个单个文档的每个单词。或者，我可以从计算中排除其他单词（不是停止单词，而是我可以添加的文本文件中的一组单独的单词将被排除）。虽然，我不知道我会采取的措辞能否缓解这种情况。最后，如果我能以某种方式抓住矩阵的碎片，那也可以。处理这类事情的任何例子都会有所帮助，并给我一些想法的起点。（PS我看了看并尝试了Hashingvectorizer，但似乎我不能用它做tfidf？）

Traceback (most recent call last):
  File "/sklearn.py", line 40, in <module>
    array = X.toarray()
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 790, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 239, in toarray
    B = self._process_toarray_args(order, out)
  File "/home/kba/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 699, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
ValueError: array is too big.

相关代码：

path = "/home/files/"

fh = open('output.txt','w')


filenames = os.listdir(path)

filenames.sort()

try:
    filenames.remove('.DS_Store')
except ValueError:
    pass # or scream: thing not in some_list!
except AttributeError:
    pass # call security, some_list not quacking like a list!

vectorizer = CountVectorizer(input='filename', analyzer='word', strip_accents='unicode', stop_words='english') 
X=vectorizer.fit_transform(filenames)
fh.write(str(vectorizer.vocabulary_))

array = X.toarray()
print array.size
print array.shape

编辑：如果这有帮助，

print 'Array is:' + str(X.get_shape()[0])  + ' by ' + str(X.get_shape()[1]) + ' matrix.'

获取太大的稀疏矩阵的维度，在我的情况下：

Array is: 39436 by 113214 matrix.

Answer 1

回溯在这里得到答案：当你在结束时调用X.toarray()时，它将稀疏矩阵表示转换为密集表示。这意味着您现在不是为每个文档中的每个单词存储恒定数量的数据，而是在所有文档上存储所有单词的值。

值得庆幸的是，大多数操作都使用稀疏矩阵，或者具有稀疏变量。只是避免致电.toarray()或.todense()，你就会好起来。

有关详细信息，请查看scipy sparse matrix documentation。

处理大量用于文本处理/ tf-idf等的独特单词

1 个答案: