我对使用Python进行并行处理非常陌生。我设法并行运行我的代码,但是,我仍然怀疑我是否以最有效的方式做到了这一点。首先,我将数据拆分如下:
gbm_param_combs = get_cartesian_prod(gbm_params)
random.Random(23).shuffle(gbm_param_combs)
gbm_param_combs = gbm_param_combs[501:506]
for counter, param in enumerate(gbm_param_combs):
param['counter'] = int(counter)
gbm_df0 = np.array_split(gbm_param_combs,5)[0]
gbm_df1 = np.array_split(gbm_param_combs,5)[1]
gbm_df2 = np.array_split(gbm_param_combs,5)[2]
gbm_df3 = np.array_split(gbm_param_combs,5)[3]
gbm_df4 = np.array_split(gbm_param_combs,5)[4]
然后,我创建了一个具有多个输入的函数,以同时调用该函数。在该函数中,我拟合了模型并计算了错误分数。
def finalFold(params, feats_final, target, h2o_train, h2o_test, fold, pd_scores_final):
h2o_train = h2o.H2OFrame(h2o_train)
h2o_test = h2o.H2OFrame(h2o_test)
scores = []
random_state = 123
for param in params:
counter = param.get('counter')
param = {k:v for k, v in param.items() if k not in ('counter')}
print('parameter combination: ', param)
print('COUNTER: ', counter)
#define model and fit
gbm = H2OGradientBoostingEstimator(stopping_rounds = 5,
stopping_metric = 'rmse',
stopping_tolerance = 1e-4,
seed = random_state,
**param)
print('GBM TRAINING STARTS....')
gbm.train(x = feats_final,
y = target,
training_frame = h2o_train)
score = gbm.model_performance(h2o_test).r2()
pd_scores_final = pd_scores_final.append({'fold': int(fold),
'score': score,
'corr' : 0.0,
'param_idx': int(counter)},
ignore_index=True)
return pd_scores_final
最后,我使用starmap调用该函数,如下所示:
p = mp.Pool(processes=5)
.....
for fold, (train_index, test_index) in enumerate(kfolds.split(pd_data)):
.....
argsGBM = [(gbm_df0, feats_final, target, h2o_train.as_data_frame(), h2o_test.as_data_frame(), fold, pd_scores_final_GBM),
(gbm_df1, feats_final, target, h2o_train.as_data_frame(), h2o_test.as_data_frame(), fold, pd_scores_final_GBM),
(gbm_df2, feats_final, target, h2o_train.as_data_frame(), h2o_test.as_data_frame(), fold, pd_scores_final_GBM),
(gbm_df3, feats_final, target, h2o_train.as_data_frame(), h2o_test.as_data_frame(), fold, pd_scores_final_GBM),
(gbm_df4, feats_final, target, h2o_train.as_data_frame(), h2o_test.as_data_frame(), fold, pd_scores_final_GBM)]
pool_results3 = p.starmap(finalFold, argsGBM)
for k in range(0,len(pool_results3)):
if k ==0:
pd_scores_final_GBM = pd.DataFrame(pool_results3[k])
else:
pd_scores_final_GBM = pd.concat([pd_scores_final_GBM,pd.DataFrame(pool_results3[k])], axis=0, ignore_index=True)
但是,我看到的是pool_results3各个部分的结果是相同的。那就是:
代码有什么问题?