用于交叉验证器的sparkml setParallelism

时间:2018-04-22 20:31:55

标签: scala apache-spark machine-learning apache-spark-mllib cross-validation

所以我尝试使用SparkML设置交叉验证,但我收到运行时错误,说

"value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator" 

我目前正在关注spark页面教程。我是新手,所以任何帮助都表示赞赏。贝娄是我的代码片段:

import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}

// Tokenizer
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")

// HashingTF
val hash_tf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")

// ML models
val l_regression = new LogisticRegression().setMaxIter(100).setRegParam(0.15)

// Pipeline
val pipe = new Pipeline().setStages(Array(tokenizer, hash_tf, l_regression))

val paramGrid = new ParamGridBuilder()
.addGrid(hash_tf.numFeatures, Array(10,100,1000))
.addGrid(l_regression.regParam, Array(0.1,0.01,0.001))
.build()

val c_validator = new CrossValidator()
.setEstimator(pipe)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
.setParallelism(2)

1 个答案:

答案 0 :(得分:1)

setParallelism is available only in Spark 2.3 or later。您必须使用早期版本:

  

(仅限专家)参数设定者

     

(...)

     

def setParallelism(value: Int): CrossValidator.this.type

     

设置最大并行度以并行计算模型。串行评估的默认值为1

     

注释@Since(" 2.3.0")