我目前正在尝试使用来自scikits的SGDRegressor学习解决大型数据集上的多变量目标问题,X~ =(10 ^ 6,10 ^ 4)。因此,我使用以下代码生成部分设计矩阵(X),其中每次迭代生成一批大小(10 ^ 3,10 ^ 4):
design = self.__iterX__(events)
reglins = [linear_model.SGDRegressor(fit_intercept=True) for i in range(nTargets)]
for X,times in design:
for i in range(nTargets):
reglins[i].partial_fit(X,y.ix[times].values[:,i])
但是我得到以下堆栈跟踪:
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site- packages/sklearn/linear_model/stochastic_gradient.py", line 841, in partial_fit
coef_init=None, intercept_init=None)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 812, in _partial_fit
sample_weight, n_iter)
File ".../Enthought/Canopy_64bit/User/lib/python2.7/site-packages/sklearn/linear_model/stochastic_gradient.py", line 948, in _fit_regressor
intercept_decay)
File "sgd_fast.pyx", line 508, in sklearn.linear_model.sgd_fast.plain_sgd (sklearn/linear_model/sgd_fast.c:8651)
ValueError: floating-point under-/overflow occurred.
环顾四周似乎是因为没有正确地规范化X。我知道scikits learn有各种各样的功能,但是假设我在块中生成X,是否足以简单地规范每个块,或者我需要找到一种方法来一次标准化整列?
顺便说一句,是否有一个特殊的原因是partial_fit函数不允许多变量目标?
答案 0 :(得分:3)
您可以适合一个街区并申请其他街区:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
x1 = scalar.fit_transform(X_block_1)
xn = scalar.transform(X_block_n)
您可以选择其他规范化方法from this page。