ChiSqSelector拟合方法中的NoSuchElementException(版本1.6.0)

时间:2016-03-27 23:47:02

标签: apache-spark feature-selection apache-spark-mllib chi-squared apache-spark-ml

我遇到了一个对我没有多大意义的错误,并且无法在网上找到足够的信息来自行回答。

我已编写代码来生成(String,ArrayBuffer [String])对的列表,然后使用HashingTF将features列转换为向量(bc它用于NLP研究解析我结束的位置具有很多独特的功能;长篇故事)。然后我使用StringIndexer转换字符串标签。我找不到"密钥"在训练数据上运行ChiSqSelector.fit时出错。堆栈跟踪指向ChiSqTest中标签的哈希映射查找。这让我觉得很奇怪,因为我可能有点理由认为我使用它错了并且没有以某种方式解释看不见的标签 - 除了这是训练数据的合适方法。

无论如何,这里是我的代码的有趣位,其次是堆栈跟踪的重要部分。任何帮助将非常感谢!!

val parSdp = sc.parallelize(sdp.take(10)) // it dies on a small amount of data
val insts: RDD[(String, ArrayBuffer[String])] =
    parSdp.flatMap(x=> TrainTest.transformGraphSpark(x))

val indexer = new StringIndexer()
    .setInputCol("labels")
    .setOutputCol("labelIndex")

val instDF = sqlContext.createDataFrame(insts)
    .toDF("labels","feats")
val hash = new HashingTF()
    .setInputCol("feats")
    .setOutputCol("hashedFeats")
    .setNumFeatures(1000000)
val readyDF = hash.transform(indexer
    .fit(instDF)
    .transform(instDF))

val selector = new ChiSqSelector()
    .setNumTopFeatures(100)
    .setFeaturesCol("hashedFeats")
    .setLabelCol("labelIndex")
    .setOutputCol("selectedFeatures")

val Array(training, dev,test) = readyDF.randomSplit(Array(0.8,0.1,0.1), seed = 12345)

val chisq = selector.fit(training)

堆栈跟踪:

java.util.NoSuchElementException: key not found: 23.0                           
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:58)
    at scala.collection.MapLike$class.apply(MapLike.scala:141)
    at scala.collection.AbstractMap.apply(Map.scala:58)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:131)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4$$anonfun$apply$4.apply(ChiSqTest.scala:129)
    at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
    at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:129)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
    at org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
    at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
    at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:122)
    ... etc etc

我也意识到通过将sdp.take的大小更改为大于(100),我得到了一个不同的错误:

java.lang.IllegalArgumentException: Chi-squared statistic undefined for input matrix due to0 sum in column [4].
    at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredMatrix(ChiSqTest.scala:229)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:134)
    at org.apache.spark.mllib.stat.test.ChiSqTest$$anonfun$chiSquaredFeatures$4.apply(ChiSqTest.scala:125)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.apache.spark.mllib.stat.test.ChiSqTest$.chiSquaredFeatures(ChiSqTest.scala:125)
    at org.apache.spark.mllib.stat.Statistics$.chiSqTest(Statistics.scala:176)
    at org.apache.spark.mllib.feature.ChiSqSelector.fit(ChiSqSelector.scala:193)
    at org.apache.spark.ml.feature.ChiSqSelector.fit(ChiSqSelector.scala:86)
    at $iwC$$iwC.<init>(<console>:96)
    at $iwC.<init>(<console>:130)

0 个答案:

没有答案