用火花管线交叉验证

时间:2016-07-13 11:09:37

标签: apache-spark pipeline apache-spark-mllib cross-validation apache-spark-ml

管道外的交叉验证。

val naivebayes
val indexer
val pipeLine = new Pipeline().setStages(Array(indexer, naiveBayes))

val paramGrid = new ParamGridBuilder()
   .addGrid(naiveBayes.smoothing, Array(1.0, 0.1, 0.3, 0.5))
   .build()
val crossValidator = new CrossValidator().setEstimator(pipeLine)
   .setEvaluator(new MulticlassClassificationEvaluator)
   .setNumFolds(2).setEstimatorParamMaps(paramGrid)

val crossValidatorModel = crossValidator.fit(trainData)

val predictions = crossValidatorModel.transform(testData)

管道内的交叉验证

val naivebayes
val indexer

// param grid for multiple parameter
val paramGrid = new ParamGridBuilder()
   .addGrid(naiveBayes.smoothing, Array(0.35, 0.1, 0.2, 0.3, 0.5))
   .build()

// validator for naive bayes
val crossValidator = new CrossValidator().setEstimator(naiveBayes)
   .setEvaluator(new MulticlassClassificationEvaluator)
   .setNumFolds(2).setEstimatorParamMaps(paramGrid)

// pipeline to execute compound transformation
val pipeLine = new Pipeline().setStages(Array(indexer, crossValidator))

// pipeline model
val pipeLineModel = pipeLine.fit(trainData)

// transform data
val predictions = pipeLineModel.transform(testData)

所以我想知道哪种方式更好,它的专业版和专业版缺点

对于这两个功能,我得到相同的结果和准确性。即便是第二种方法也比第一种方法快一点。

1 个答案:

答案 0 :(得分:0)

根据我参加的培训,这应该是最佳做法:

cv = CrossValidator(estimator=lr,..)
pipelineModel = Pipeline(stages=[idx,assembler,cv])
cv_model= pipelineModel.fit(train)

这样,您的管道只能容纳一次,而每次使用param_grid进行的定期运行都无法容纳,这会使它运行得更快。 希望这会有所帮助!