使用Countvectorizer提取ngram时如何解决内存问题?

时间:2017-04-26 04:46:30

标签: python scikit-learn out-of-memory countvectorizer

我有一个300 MB的语料库。我有32位python版本3.6的32位窗口。这项操作需要多少内存?我的代码如下。

a = load_files('D:\Train') # have two sub folders.
vectorizer = CountVectorizer(ngram_range=(4,4),binary=True)
X = vectorizer.fit_transform(a.data) 

ERROR:

File "D:/spyder/april.py", line 32, in <module>
X = vectorizer.fit_transform(a.data)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
for feature in analyze(doc):

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 141, in _word_ngrams
tokens.append(" ".join(original_tokens[i: i + n]))

MemoryError

我在谷歌搜索解决方案。他们提出了使用Hashing矢量化器的想法。但有人提到它没有给出相应的令牌和功能名称。计数矢量化器将提供功能和索引。请为我提供一个Count vectorizer本身的解决方案。

0 个答案:

没有答案