我正在阅读一些文本文件,并将这些文本文件的所有单词拆分并存储在列表中。然后我正在做onehot编码。当文本文件大小大于1MB时,我遇到了MemoryError问题。这是我的代码
from numpy import array
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
path="D://DATA//"
data=[]
for file in os.listdir(path):
print(file)
if file.endswith(".txt"):
with open(os.path.join(path, file), 'r',encoding="utf8") as f:
d1=f.read()
data+=d1.split()
print(data)
values = data
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)
这是我得到的错误
Traceback (most recent call last):
File "C:/Users/Desktop/onehot.py", line 21, in <module>
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
File "C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\preprocessing\_encoders.py", line 516, in fit_transform
self._categorical_features, copy=True)
File "C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\preprocessing\base.py", line 52, in _transform_selected
return transform(X)
File "C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\preprocessing\_encoders.py", line 489, in _legacy_fit_transform
return out if self.sparse else out.toarray()
File "C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\sparse\compressed.py", line 962, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\sparse\base.py", line 1187, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
我正在使用python 64位... 我已经搜索了此类问题,并被告知要更改gcv_mode,但我不知道在这种情况下该如何使用。请对此提供帮助。预先感谢
答案 0 :(得分:0)
因此,您似乎正在处理一些文本数据,对文本数据执行一个热编码方案,肯定会耗尽内存,我建议您使用
Sklean的TfidfVectorizer或CountVectorizer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
或者您可以玩 CountVectorizer : https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]