使用joblib

时间:2019-03-01 17:50:30

标签: python parallel-processing scikit-learn joblib

我正在尝试在sklearn估计器上运行随机网格搜索,但是我不想交叉验证,因为我已经为我的数据进行了训练/验证/测试拆分。我已经构建了运行随机网格搜索的功能,但是我想跨线程并行化。我一直在寻找joblib,试图找出如何修改Parallel(delayed(func))函数,但无法找出如何在我的代码上实现。

这是我的功能:

def randomized_grid_search(model=None, param_grid=None, percent=0.5,
                           X_train=None, y_train=None, 
                           X_val=None, y_val=None):        
    # converts parameter grid into a list
    param_list = list(ParameterGrid(param_grid))
    # the number of combinations to try in the grid
    n = int(len(param_list) * percent)
    # the reduced grid as a list
    reduced_grid = sample(param_list, n)
    best_score = 0
    best_grid = None

    """ 
    Loops through each of the posibble scenarios and
    then scores each model with prediction from validation set.
    The best score is kept and held with best parameters.
    """ 
    for g in reduced_grid:
        model.set_params(**g)
        model.fit(X_train,y_train)
        y_pred = model.predict(X_val)
        recall = recall_score(y_val, y_pred)
        if recall > best_score:
            best_score = recall
            best_grid = g

    """
    Combines the training and validation datasets and 
    trains the model with the best parameters from the 
    grid search"""
    best_model = model
    best_model.set_params(**best_grid)
    X2 = pd.concat([X_train, X_val])
    y2 = pd.concat([y_train, y_val])
    return best_model.fit(X2, y2)

https://joblib.readthedocs.io/en/latest/parallel.html开始,我认为这是我需要前进的方向:

with Parallel(n_jobs=2) as parallel:
    accumulator = 0.
    n_iter = 0
    while accumulator < 1000:
       results = parallel(delayed(sqrt)(accumulator + i ** 2)
                          for i in range(5))
       accumulator += sum(results)  # synchronization barrier
       n_iter += 1

我应该做这样的事情还是我走错了路?

2 个答案:

答案 0 :(得分:1)

我在GitHub上找到了@ skylander86编写的一些代码,作者在其中使用:

param_scores = Parallel(n_jobs=self.n_jobs)(delayed(_fit_classifier)(klass, self.classifier_args, param, self.metric, X_train, Y_train, X_validation, Y_validation) for param in ParameterGrid(self.param_grid))

我希望有帮助。

答案 1 :(得分:0)

您是否尝试过通过n_jobs参数使用内置的并行化功能?

grid = sklearn.model_selection.GridSearchCV(..., n_jobs=-1)

GridSearchCV文档将n_jobs参数描述为:

  

n_jobs:int或无,可选(默认=无)要并行运行的作业数。除非在joblib.parallel_backend上下文中,否则None表示1。 -1表示使用所有处理器...

因此,尽管它不会跨线程分布,但会跨处理器分布;从而达到一定程度的并行化。