分类器适配和partial_fit的内存错误

时间:2015-06-09 08:52:30

标签: python machine-learning scikit-learn

我正在尝试使用 scikit-learn 来预测输入文本字符串的值。我使用 HashingVectorizer 进行数据向量化, PassiveAggressiveClassifier 使用 partial_fit 进行学习(请参阅以下代码):

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn import metrics
from sklearn.metrics import zero_one_loss
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import PassiveAggressiveClassifier, SGDClassifier, Perceptron
from sklearn.pipeline import make_pipeline
from sklearn.externals import joblib
import pickle

a,r = [],[]

vectorizer = TfidfVectorizer()

with open('val', 'rb') as f:
    r = pickle.load(f)

with open('text', 'rb') as f:
    a = pickle.load(f)

L = (vectorizer.fit_transform(a))

training_set = L[:3250]
testing_set = L[3250:]

M = np.array(r)

training_result = M[:3250]
testing_result = M[3250:]

cls = np.unique(r)

model = PassiveAggressiveClassifier()

model.partial_fit(training_set, training_result, classes=cls)
print(model)
predicted = model.predict(testing_set)

print testing_result
print predicted

错误日志:

File "try.py", line 89, in <module>
    model.partial_fit(training_set, training_result, classes=cls)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 374, in _partial_fit
    coef_init, intercept_init)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 167, in _allocate_parameter_mem
    dtype=np.float64, order="C")
MemoryError

我以前使用 CountVectorizer 逻辑回归进行分类,并且没有问题。 但我的学习数据是约。数百万行,我想使用上面的脚本实现增量学习,每次执行都会导致内存错误

更新

在循环中应用部分学习后,partial_fit函数返回无法匹配的特征数量错误(ValueError: Number of features 8897 does not match previous data 9190.) 即使我设置了最大特征属性,生成的预测也是不正确的。 有没有什么方法可以使用partial_fit方法获取可变数量的特征?

执行输出:

(400, 8481)
(400, 9277)
Traceback (most recent call last):
  File "f9.py", line 65, in <module>
    training_set, training_result, classes=cls)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/passive_aggressive.py", line 115, in partial_fit
    coef_init=None, intercept_init=None)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/linear_model/stochastic_gradient.py", line 379, in _partial_fit
    % (n_features, self.coef_.shape[-1]))
ValueError: Number of features 9277 does not match previous data 8481.

任何帮助将不胜感激。

谢谢!

1 个答案:

答案 0 :(得分:1)

内存错误来自于内存中有太多数据这一事实。当您加载数据时,您的数量等于N,那么当您使用partial_fit时,根据算法,它将存储更多数据,可能接近N.

您无需存储两次数据。尝试减少初始数据块的大小。将它分成几部分,您将赋予partial_fit方法。

您应该逐行读取文件以创建数据块,然后适合该块,并刷新内存,然后再次执行

with open(path, "r", encoding='utf-8') as f:
    i = 0
    for line in f:
        % Create chunk of X line
        i ++
        arr.add(line)

        % Learn with partial_fit
        if (i == X):
            model.partial_fit()
            % Flush the last chunk 
            arr = []