调用多个H2O数据帧作为python multiprocess中的输入

时间:2019-05-18 10:14:46

标签: python-3.x multiprocessing h2o

我正在尝试在python(版本3.6.8)中使用多处理,其中2个表将作为函数中的输入来调用。在函数中,我通过h2o拟合模型。这是我的代码:

def innerFold(params, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold):

    inner_scores = []

    for param in params:

        counter = param.get('counter')
        del param['counter']

        print('parameter combination: ', param)
        print('COUNTER: ', counter)

        #define model and fit
        gbm = H2OGradientBoostingEstimator(stopping_rounds = 5,
                                           stopping_metric = 'rmse',
                                           stopping_tolerance = 1e-4,
                                           seed = random_state,
                                           **param)

        print('TRAINING STARTS....')
        gbm.train(x = feats,
                  y = target,
                  training_frame = h2o_train_inner)


        score = gbm.model_performance(h2o_test_inner).r2()
        pd_scores = pd_scores.append({'outer_fold': int(outer_fold),
                                      'inner_fold': int(inner_fold),
                                      'score': score,
                                      'param_idx': int(counter)},
                                     ignore_index=True)

        inner_scores.append(gbm.model_performance(h2o_test_inner).r2())

    return pd_scores

我将“ params”参数拆分为5个,以便由5个不同的内核处理,其余参数应相同。为此,我按如下所示拆分“参数”:

df0 = np.array_split(param_combs,5)[0] 
df1 = np.array_split(param_combs,5)[1] 
df2 = np.array_split(param_combs,5)[2] 
df3 = np.array_split(param_combs,5)[3] 
df4 = np.array_split(param_combs,5)[4]

并进行如下介绍:

args = [(df0, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold), 
        (df1, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold), 
        (df2, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
        (df3, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
        (df4, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold)]

,专长,目标,h2o_train_inner,h2o_test_inner,outer_fold,inner_fold是列表的类型(包含字典),字符串,h2o数据框,h2o数据框,int,int。

最终,我开始如下操作:

p = mp.Pool(processes=5)
pool_results = p.starmap(innerFold, args)

我得到:

  

TypeError:()缺少1个必需的位置参数:“ keyvals”

看来参数数量还可以。我在这里想念什么?

编辑:显然,问题出在H2O数据帧上。如果我将它们转换为pandas df,则可以。知道如何直接使用H2O df吗?

EDIT2 :据我了解,发送给该函数的参数(例如上面的innerFold)是腌制的。由于无法对h2o对象进行腌制,因此该函数在转换为pandas df之后便可以使用。

0 个答案:

没有答案