在讨论here之后,我决定发布另一个关于在scikit-learn中提取特征时出现的奇怪内存错误的问题。以下代码按预期工作:
import os
from sklearn.feature_extraction.text import CountVectorizer
data = []
for i in range(0, 1000):
filename = "a.txt"
data.append(os.path.join(DATA_DIR, filename))
vectorizer = CountVectorizer(encoding = 'utf-8-sig', input = 'filename')
vectors = vectorizer.fit_transform(data)
但是,如果我将范围更改为(0,2000),它会给出一个内存错误,其中包含以下跟踪:
Traceback (most recent call last):
File "C:\...\main.py", line 16, in <module>
vectors = vectorizer.fit_transform(data)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 817, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\text.py", line 769, in _count_vocab
values = np.ones(len(j_indices))
File "C:\Python27\lib\site-packages\numpy\core\numeric.py", line 178, in ones
a = empty(shape, dtype, order)
MemoryError
注意:
如果有人向我解释发生了什么,我将非常感激...提前谢谢你!