如何确保TF-IDF数组不会占用过多内存?

时间:2018-09-16 20:16:56

标签: python machine-learning neural-network tf-idf tfidfvectorizer

正如我从解释TF-IDF输出的某些文章中所知的那样,是稀疏矩阵,然后我们使用.toarray()将其转换为输入到神经网络的方法,但是我在这里遇到一些关于内存的错误,我不明白关于此问题,计算机中是否使用了过多的内存?以及如何解决这个问题。

代码是:

vectorizer = TfidfVectorizer().fit(train_text)

tfidf_vector = vectorizer.transform(train_text).toarray()
tfidf_vector = tfidf_vector[:,:,None]

print(tfidf_vector.shape)
X_train, X_test, Y_train, Y_test = train_test_split(tfidf_vector, 
test_size=0.2, random_state=1)

和错误:

    File "C:/Users/xiangli/PycharmProjects/preparing_moviedata/polarity.py", line 60, in <module>
    tfidf_vector = vectorizer.transform(train_text).toarray()
  File "C:\Users\xiangli\Miniconda3\envs\preparing_moviedata\lib\site-packages\scipy\sparse\compressed.py", line 947, in toarray
    out = self._process_toarray_args(order, out)
  File "C:\Users\xiangli\Miniconda3\envs\preparing_moviedata\lib\site-packages\scipy\sparse\base.py", line 1184, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

我想通过Tf-idf矢量化输出用于输入神经网络。

0 个答案:

没有答案