我在Apache Spark ML(版本2.1.0)中使用NaiveBayes多项分类器来预测某些文本类别。
使用StringIndexer将字符串转换为标签,如下所示:
val labelIndexer = new StringIndexer().setInputCol("name").setOutputCol("label").fit(trainData).setHandleInvalid("skip")
它给出了一个例外,而对测试数据的预测只有单个记录,这是一个看不见的标签。如果有看到和看不见的标签的组合,那么它的工作正常,它将跳过预测结果中看不见的标签记录。
Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:64)
at scala.collection.IterableLike$class.head(IterableLike.scala:91)
at scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:108)
at scala.collection.IndexedSeqOptimized$class.head(IndexedSeqOptimized.scala:120)
at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:108)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1943)
at org.apache.spark.sql.Dataset.first(Dataset.scala:1950)
at org.apache.spark.ml.feature.VectorAssembler.first$lzycompute$1(VectorAssembler.scala:57)
at org.apache.spark.ml.feature.VectorAssembler.org$apache$spark$ml$feature$VectorAssembler$$first$1(VectorAssembler.scala:57)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply$mcI$sp(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2$$anonfun$1.apply(VectorAssembler.scala:88)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:88)
at org.apache.spark.ml.feature.VectorAssembler$$anonfun$2.apply(VectorAssembler.scala:58)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
at org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:58)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:299)
at org.apache.spark.ml.PipelineModel$$anonfun$transform$1.apply(Pipeline.scala:299)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:299)
at com.infostretch.machinelearning.sample.GroupingByNaiveBayesExample$.main(GroupingByNaiveBayesExample.scala:111)
at com.infostretch.machinelearning.sample.GroupingByNaiveBayesExample.main(GroupingByNaiveBayesExample.scala)
培训数据:
id,group,name,text
1,apple,abc,a b c d
2,orange,def,x y z
测试数据:
id,name,text
3,pqr,a b x
此处,字段名称值' pqr'在对测试数据进行预测时对模型是看不见的。