Question

我正在为具有时间序列划分的SVR设计执行grid-search。我的问题是网格搜索大约需要30分钟以上的时间，太长了。我有一个包含17,800位数据的大型数据集，但是此持续时间太长。有什么办法可以减少我的时间吗？我的代码是：

from sklearn.svm import SVR
from sklearn.model_selection import TimeSeriesSplit
from sklearn import svm
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing as pre

X_feature = X_feature.reshape(-1, 1)
y_label = y_label.reshape(-1,1)

param = [{'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                       'C': [1, 10, 100, 1000]},
                       {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [1, 2, 3, 4]}] 


reg = SVR(C=1)
timeseries_split = TimeSeriesSplit(n_splits=3)
clf = GridSearchCV(reg, param, cv=timeseries_split, scoring='neg_mean_squared_error')


X= pre.MinMaxScaler(feature_range=(0,1)).fit(X_feature)

scaled_X = X.transform(X_feature)


y = pre.MinMaxScaler(feature_range=(0,1)).fit(y_label)

scaled_y = y.transform(y_label)



clf.fit(scaled_X,scaled_y )

我的y缩放数据是：

 [0.11321139]
 [0.07218848]
 ...
 [0.64844211]
 [0.4926122 ]
 [0.4030334 ]]

我缩放X的数据是：

[[0.2681013 ]
 [0.03454225]
 [0.02062136]
 ...
 [0.92857565]
 [0.64930691]
 [0.20325924]]

Answer 1

使用GridSearchCV(..., n_jobs=-1)可以并行使用所有可用的CPU内核。

或者您可以使用RandomizedSearchCV

Answer 2

取决于数据大小和分类器，可能需要很长时间。另外，您可以尝试通过一次仅使用一次内核来将过程分成更小的部分，

param_rbf = {'kernel': ['rbf'], 'gamma': [1e-2, 1e-3, 1e-4, 1e-5],
                   'C': [1, 10, 100, 1000]}

然后像这样使用它

clf = GridSearchCV(reg, param_rbf, cv=timeseries_split, scoring='neg_mean_squared_error')

类似地，通过不同的params字典对不同的内核分别进行预测

params_poly = {'kernel': ['poly'], 'C': [1, 10, 100, 1000], 'degree': [1, 2, 3, 4]}

我知道这不完全是一种解决方案，但仅是一些建议，可以帮助您减少时间。

还要将verbose选项设置为True。这将帮助您显示分类器的进度。

此外，设置n_jobs=-1不一定会导致速度降低。 See this answer供参考。

进行网格搜索需要30分钟以上的时间，有什么办法可以减少这种情况？（Jupyter Azure）

2 个答案:

进行网格搜索需要30分钟以上的时间，有什么办法可以减少这种情况？ （Jupyter Azure）

2 个答案:

进行网格搜索需要30分钟以上的时间，有什么办法可以减少这种情况？（Jupyter Azure）