我正在尝试使用 scikit-learn 来预测输入文本字符串的值。我使用 HashingVectorizer 进行数据向量化, PassiveAggressiveClassifier 使用 partial_fit 进行学习(请参阅以下代码):
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle
a,r = [],[]
vectorizer = TfidfVectorizer()
with open('val', 'rb') as f:
r = pickle.load(f)
with open('text', 'rb') as f:
a = pickle.load(f)
L = (vectorizer.fit_transform(a))
training_set = L[:3250]
testing_set = L[3250:]
M = np.array(r)
training_result = M[:3250]
testing_result = M[3250:]
cls = np.unique(r)
model = PassiveAggressiveClassifier()
model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)
print testing_result
print predicted
错误日志:
File "try.py", line 89, in <module>
model.partial_fit(training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
coef_init, intercept_init)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
dtype=np.float64, order="C")
MemoryError
我以前使用 CountVectorizer 和逻辑回归进行分类,并且没有问题。 但我的学习数据是约。数百万行,我想使用上面的脚本实现增量学习,每次执行都会导致内存错误。
更新
在循环中应用部分学习后,partial_fit函数返回无法匹配的特征数量错误(ValueError: Number of features 8897 does not match previous data 9190.
)
即使我设置了最大特征属性,生成的预测也是不正确的。
有没有什么方法可以使用partial_fit方法获取可变数量的特征?
执行输出:
(400, 8481)
(400, 9277)
Traceback (most recent call last):
File "f9.py", line 65, in <module>
training_set, training_result, classes=cls)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
coef_init=None, intercept_init=None)
File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
% (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.
任何帮助将不胜感激。
谢谢!
答案 0 :(得分:1)
内存错误来自于内存中有太多数据这一事实。当您加载数据时,您的数量等于N,那么当您使用partial_fit时,根据算法,它将存储更多数据,可能接近N.
您无需存储两次数据。尝试减少初始数据块的大小。将它分成几部分,您将赋予partial_fit
方法。
您应该逐行读取文件以创建数据块,然后适合该块,并刷新内存,然后再次执行
with open(path, "r", encoding='utf-8') as f:
i = 0
for line in f:
% Create chunk of X line
i ++
arr.add(line)
% Learn with partial_fit
if (i == X):
model.partial_fit()
% Flush the last chunk
arr = []