我正在尝试评估PySpark中的多个管道。我可以在每个CV / TVS中单独做一个,但是我只想做一个,这样就可以直接给我最好的模型,而我不知道如何使它起作用。>
lr_assembler
和assembler
是VectorAsembler
的2个实例(不同的特征选择)。
pca
,lr
,rf
和gbt
是PCA
,LinearRegression
,RandomForestRegressor
和{{1 }}。
管道定义:
GBTRegressor
paramMaps定义:
pipeline = Pipeline()
lr_stages = [lr_assembler, pca, lr]
rf_stages = [assembler, rf]
gbt_stages = [assembler, gbt]
lr_pipeline = Pipeline(stages=lr_stages)
rf_pipeline = Pipeline(stages=rf_stages)
gbt_pipeline = Pipeline(stages=gbt_stages)
TrainValidationSplit定义:
lr_grid = ParamGridBuilder().baseOn({pipeline.stages:lr_stages})\
.addGrid(pca.k, [2, 5, 7])\
.build()
rf_grid = ParamGridBuilder().baseOn({pipeline.stages:rf_stages})\
.addGrid(rf.maxDepth, [5, 10])\
.addGrid(rf.featureSubsetStrategy, ['3', '6'])\
.build()
gbt_grid = ParamGridBuilder().baseOn({pipeline.stages:gbt_stages})\
.addGrid(gbt.maxDepth, [5, 10])\
.addGrid(gbt.maxIter, [50, 100])\
.build()
grid = lr_grid + rf_grid + gbt_grid
模型训练:
tvs = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=grid, evaluator=rmse_evaluator, trainRatio=0.8, parallelism=3, seed=7)
在运行最后一行之后,这是我得到的错误(不确定是否应该在此处发布整个内容):
model = tvs.fit(train_val)
感谢您的时间。
答案 0 :(得分:1)
我遇到了同样的问题,我通过初始化Pipeline阶段解决了该问题。
pipeline = Pipeline(stages=[]) # Must initialize with empty list!
这里有一个很好的例子: https://github.com/dsharpc/dsharpc.github.io/blob/master/SparkMLFlights/README.md