多类分类评估器字段不存在错误--Apache Spark

时间:2016-10-28 09:35:12

标签: scala apache-spark

我是Spark的新手,在Scala中尝试了一个基本的分类器。

我正在尝试获得准确性,但在使用 MulticlassClassificationEvaluator 时,它会给出以下错误:

Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)

代码如下:

val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
  .builder()
  .appName("Email Classifier")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()
import spark.implicits._

val spamInput = "TRAIN_00000_0.eml"      //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)  
val normalData = spark.read.textFile(normalInput)     

case class Feature(index: Int, value: String)  

val indexer = new StringIndexer()
  .setInputCol("value")
  .setOutputCol("label")                                       

val regexTokenizer = new RegexTokenizer()
  .setInputCol("value")
  .setOutputCol("cleared")      
  .setPattern("\\w+").setGaps(false)

val remover = new StopWordsRemover()
  .setInputCol("cleared")
  .setOutputCol("filtered") 

val hashingTF = new HashingTF()
 .setInputCol("filtered").setOutputCol("features")
 .setNumFeatures(100)

val nb = new NaiveBayes()

val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)  

val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)  

model.write.overwrite().save("myNaiveBayesModel")

val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")

Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))  
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)

Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)

初始化MulticlassClassificationEvaluator时发生错误。如何指定列名?任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:1)

错误在这一行:

val predictionSpam = sameModel.transform(spamTest).select("prediction")

您的数据框仅包含prediction列且没有标签列。