我正在尝试在sklearn估计器上运行随机网格搜索,但是我不想交叉验证,因为我已经为我的数据进行了训练/验证/测试拆分。我已经构建了运行随机网格搜索的功能,但是我想跨线程并行化。我一直在寻找joblib,试图找出如何修改Parallel(delayed(func))函数,但无法找出如何在我的代码上实现。
这是我的功能:
def randomized_grid_search(model=None, param_grid=None, percent=0.5,
X_train=None, y_train=None,
X_val=None, y_val=None):
# converts parameter grid into a list
param_list = list(ParameterGrid(param_grid))
# the number of combinations to try in the grid
n = int(len(param_list) * percent)
# the reduced grid as a list
reduced_grid = sample(param_list, n)
best_score = 0
best_grid = None
"""
Loops through each of the posibble scenarios and
then scores each model with prediction from validation set.
The best score is kept and held with best parameters.
"""
for g in reduced_grid:
model.set_params(**g)
model.fit(X_train,y_train)
y_pred = model.predict(X_val)
recall = recall_score(y_val, y_pred)
if recall > best_score:
best_score = recall
best_grid = g
"""
Combines the training and validation datasets and
trains the model with the best parameters from the
grid search"""
best_model = model
best_model.set_params(**best_grid)
X2 = pd.concat([X_train, X_val])
y2 = pd.concat([y_train, y_val])
return best_model.fit(X2, y2)
从https://joblib.readthedocs.io/en/latest/parallel.html开始,我认为这是我需要前进的方向:
with Parallel(n_jobs=2) as parallel:
accumulator = 0.
n_iter = 0
while accumulator < 1000:
results = parallel(delayed(sqrt)(accumulator + i ** 2)
for i in range(5))
accumulator += sum(results) # synchronization barrier
n_iter += 1
我应该做这样的事情还是我走错了路?
答案 0 :(得分:1)
我在GitHub上找到了@ skylander86编写的一些代码,作者在其中使用:
param_scores = Parallel(n_jobs=self.n_jobs)(delayed(_fit_classifier)(klass, self.classifier_args, param, self.metric, X_train, Y_train, X_validation, Y_validation) for param in ParameterGrid(self.param_grid))
我希望有帮助。
答案 1 :(得分:0)
您是否尝试过通过n_jobs参数使用内置的并行化功能?
grid = sklearn.model_selection.GridSearchCV(..., n_jobs=-1)
GridSearchCV文档将n_jobs参数描述为:
n_jobs:int或无,可选(默认=无)要并行运行的作业数。除非在joblib.parallel_backend上下文中,否则None表示1。 -1表示使用所有处理器...
因此,尽管它不会跨线程分布,但会跨处理器分布;从而达到一定程度的并行化。