我正在尝试在python(版本3.6.8)中使用多处理,其中2个表将作为函数中的输入来调用。在函数中,我通过h2o拟合模型。这是我的代码:
def innerFold(params, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold):
inner_scores = []
for param in params:
counter = param.get('counter')
del param['counter']
print('parameter combination: ', param)
print('COUNTER: ', counter)
#define model and fit
gbm = H2OGradientBoostingEstimator(stopping_rounds = 5,
stopping_metric = 'rmse',
stopping_tolerance = 1e-4,
seed = random_state,
**param)
print('TRAINING STARTS....')
gbm.train(x = feats,
y = target,
training_frame = h2o_train_inner)
score = gbm.model_performance(h2o_test_inner).r2()
pd_scores = pd_scores.append({'outer_fold': int(outer_fold),
'inner_fold': int(inner_fold),
'score': score,
'param_idx': int(counter)},
ignore_index=True)
inner_scores.append(gbm.model_performance(h2o_test_inner).r2())
return pd_scores
我将“ params”参数拆分为5个,以便由5个不同的内核处理,其余参数应相同。为此,我按如下所示拆分“参数”:
df0 = np.array_split(param_combs,5)[0]
df1 = np.array_split(param_combs,5)[1]
df2 = np.array_split(param_combs,5)[2]
df3 = np.array_split(param_combs,5)[3]
df4 = np.array_split(param_combs,5)[4]
并进行如下介绍:
args = [(df0, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
(df1, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
(df2, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
(df3, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold),
(df4, feats, target, h2o_train_inner, h2o_test_inner, outer_fold, inner_fold)]
,专长,目标,h2o_train_inner,h2o_test_inner,outer_fold,inner_fold是列表的类型(包含字典),字符串,h2o数据框,h2o数据框,int,int。
最终,我开始如下操作:
p = mp.Pool(processes=5)
pool_results = p.starmap(innerFold, args)
我得到:
TypeError:新()缺少1个必需的位置参数:“ keyvals”
看来参数数量还可以。我在这里想念什么?
编辑:显然,问题出在H2O数据帧上。如果我将它们转换为pandas df,则可以。知道如何直接使用H2O df吗?
EDIT2 :据我了解,发送给该函数的参数(例如上面的innerFold)是腌制的。由于无法对h2o对象进行腌制,因此该函数在转换为pandas df之后便可以使用。