在Apache Spark中,RandomForestClassifier的输入带有无效的标签列错误

时间:2016-04-09 13:18:50

标签: scala apache-spark machine-learning random-forest apache-spark-mllib

我试图使用SCALA中的随机森林分类器模型使用5倍交叉验证来找到准确度。但是我在运行时遇到以下错误:

  

java.lang.IllegalArgumentException:为RandomForestClassifier提供了带有无效标签列标签的输入,没有指定的类数。请参见StringIndexer。

在线上获得上述错误---> val cvModel = cv.fit(trainingData)

我用于使用随机森林对数据集进行交叉验证的代码如下:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("exprogram/dataset.txt")
val parsedData = data.map { line =>
val parts = line.split(',')
LabeledPoint(parts(41).toDouble, 
Vectors.dense(parts(0).split(',').map(_.toDouble)))
}


val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
val training = splits(0)
val test = splits(1)

val trainingData = training.toDF()

val testData = test.toDF()

val nFolds: Int = 5
val NumTrees: Int = 5

val rf = new     
RandomForestClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setNumTrees(NumTrees)

val pipeline = new Pipeline()
      .setStages(Array(rf)) 

val paramGrid = new ParamGridBuilder()
          .build()

val evaluator = new  MulticlassClassificationEvaluator()
    .setLabelCol("label")
    .setPredictionCol("prediction")
    .setMetricName("precision") 

val cv = new CrossValidator()
   .setEstimator(pipeline)
   .setEvaluator(evaluator) 
   .setEstimatorParamMaps(paramGrid)
   .setNumFolds(nFolds)

val cvModel = cv.fit(trainingData)

val results = cvModel.transform(testData)
.select("label","prediction").collect

val numCorrectPredictions = results.map(row => 
if (row.getDouble(0) == row.getDouble(1)) 1 else 0).foldLeft(0)(_ + _)
val accuracy = 1.0D * numCorrectPredictions / results.size

println("Test set accuracy: %.3f".format(accuracy))

任何人都可以解释上面代码中的错误。

1 个答案:

答案 0 :(得分:9)

(?=^.{8,}$) #Positive look ahead to check whether there are at least 8 characters (?!^\d) #Negative look ahead to check that string does not begin with a digit (?!.*\d$) #Negative look ahead to check that string does not end with a digit (?!.+\d\d) #Negative look ahead to check that string does not have two consecutive digits (?=.*[&%$^#@=]) #Positive look ahead to check that string have at least any of the characters present in character class (?=(.*[A-Z]){2}) #Positive look ahead to check that string contains at least two Upper Case characters (?=(.*[a-z]){2}) #Positive look ahead to check that string contains at least two Lower Case characters (?=(.*[0-9]){2}) #Positive look ahead to check that string contains at least two digits ,与许多其他ML算法一样,需要在标签列上设置特定元数据,并将值标记为[0,1,2 ...,#class)的整数值,表示为双精度。通常,这由RandomForestClassifier上游Transformers处理。由于您手动转换标签,因此未设置元数据字段,并且分类器无法确认是否满足这些要求。

StringIndexer

您可以使用val df = Seq( (0.0, Vectors.dense(1, 0, 0, 0)), (1.0, Vectors.dense(0, 1, 0, 0)), (2.0, Vectors.dense(0, 0, 1, 0)), (2.0, Vectors.dense(0, 0, 0, 1)) ).toDF("label", "features") val rf = new RandomForestClassifier() .setFeaturesCol("features") .setNumTrees(5) rf.setLabelCol("label").fit(df) // java.lang.IllegalArgumentException: RandomForestClassifier was given input ... 重新编码标签栏:

StringIndexer

set required metadata manually

import org.apache.spark.ml.feature.StringIndexer

val indexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("label_idx")
  .fit(df)

rf.setLabelCol("label_idx").fit(indexer.transform(df))

注意

使用val meta = NominalAttribute .defaultAttr .withName("label") .withValues("0.0", "1.0", "2.0") .toMetadata rf.setLabelCol("label_meta").fit( df.withColumn("label_meta", $"label".as("", meta)) ) 创建的标签取决于频率而非值:

StringIndexer

<强> PySpark

在Python中,元数据字段可以直接在模式上设置:

indexer.labels
// Array[String] = Array(2.0, 0.0, 1.0)