使用fit()函数

时间:2018-03-28 01:34:51

标签: python pandas scikit-learn classification sklearn-pandas

我将X_train和y_train分别作为2个numpy.ndarrays的大小(32561,108)和(32561)。

每次我调用GaussianProcessClassifier时都会收到内存错误。

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.gaussian_process.kernels import RBF
>>> X_train.shape
(32561, 108)
>>> y_train.shape
(32561,)
 >>> gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
>>> gp_opt.fit(X_train,y_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 613, in fit
    self.base_estimator_.fit(X, y)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 209, in fit
    self.kernel_.bounds)]
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 427, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 199, in fmin_l_bfgs_b
    **opts)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 201, in obj_func
    theta, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 338, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 753, in __call__
    K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1002, in __call__
    K = self.constant_value * np.ones((X.shape[0], Y.shape[0]))
  File "/home/retsim/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 188, in ones
    a = empty(shape, dtype, order)
MemoryError
>>> 

为什么我会收到此错误,我该如何解决?

3 个答案:

答案 0 :(得分:3)

在线400 of gpc.py,您正在使用的分类器的实现,创建了一个大小为(N, N)的矩阵,其中N是观察的数量。所以代码试图创建一个形状(32561, 32561)的矩阵。这显然会引起一些问题,因为该矩阵有超过十亿个元素。

至于它为什么这样做,我真的不知道scikit-learn的实现,但一般来说,高斯过程需要在整个输入空间上估计协方差矩阵,这就是为什么如果你有高的话它们就不那么好了 - 维数据。 (文档说“高维”比任何东西都要多。)

我唯一的建议是如何解决它是批量工作。 Scikit-learn可能有一些实用程序可以为您生成批次,或者您可以手动执行。

答案 1 :(得分:2)

根据Scikit-Learn documentation,估计器 GaussianProcessClassifier (以及 GaussianProcessRegressor )的参数为 copy_X_train 默认情况下设置为 True

  

class sklearn.gaussian_process.GaussianProcessClassifier(kernel = None,   Optimizer ='fmin_l_bfgs_b',n_restarts_optimizer = 0,   max_iter_predict = 100,warm_start = False,copy_X_train = True,   random_state = None,multi_class =“ one_vs_rest”,n_jobs = 1)

参数 copy_X_train 的说明指出:

  

如果为True,则将训练数据的永久副本存储在   宾语。否则,仅存储对训练数据的引用,   如果修改了数据,这可能会导致预测发生变化   外部。

我曾尝试在具有32 GB RAM的PC上为估计器安装与OP所述大小相似的训练数据集(观测值和特征)。在 copy_X_train 设置为 True 的情况下,“培训数据的持久副本” 可能耗尽了我的RAM,从而导致MemoryError 。将此参数设置为 False 可以解决此问题。

Scikit-Learn的描述指出,基于此设置,“仅存储对训练数据的引用,如果对数据进行外部修改,则可能导致预测更改”。我对此声明的解释是:

  

代替存储整个训练数据集(以矩阵形式)   拟合估计量中基于 n 观察值的大小 nxn   存储对此数据集的引用-因此避免了高RAM   用法。只要数据集在外部保持完整(即不在内部   拟合的估算器),可以在预测时可靠地获取   必须制作。数据集的修改会影响预测。

可能会有更好的解释和理论解释。

答案 2 :(得分:0)

查看数据降维技术,例如主成分分析。这将减少您的功能并减小输入矩阵的大小