Spark ML - 从向量索引

时间:2017-08-29 13:14:03

标签: apache-spark spark-dataframe apache-spark-ml

我正在使用Spark 2.2 ML RandomForestClassifier进行一些预测。

我有这样的结果:

+-----+----------------------------------------+----+----------+
|label|features                                |prob|prediction|
+-----+----------------------------------------+----+----------+
|0.0  |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,5,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,6,7,12,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
|0.0  |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |...       |
+-----+----------------------------------------+----+----------+

现在我想将功能解码回人类可读的表示,例如我想知道索引4的确切特征是什么。

我假设我可以从索引器的标签中获取此信息,但我有一个这样的代码:

private def transform() {   
  val aIndexer = indexer("a")
  val bIndexer = indexer("b")
  val cIndexer = indexer("c")

  val aEncoder = encoder("a")
  val bEncoder = encoder("b")
  val cEncoder = encoder("c")

  val vectorAssembler = new VectorAssembler()
    .setInputCols(Array("aVec", "bVec", "cVec"))
    .setOutputCol("features")

  val indexers = Array[PipelineStage](aIndexer, bIndexer, cIndexer)
  val encoders = Array[PipelineStage](aEncoder, bEncoder, cEncoder)

  val pipeline = new Pipeline().setStages(indexers ++ encoders :+ vectorAssembler)
  val model = pipeline.fit(in)
  model.write.overwrite().save(opts.pipelineFileName)

  model.transform(in).show(false)
}

private def indexer(name: String): StringIndexer = {
  new StringIndexer().setInputCol(name).setOutputCol(s"${name}Idx").setHandleInvalid("keep")
}

private def encoder(name: String): OneHotEncoder = {
  new OneHotEncoder().setInputCol(s"${name}Idx").setOutputCol(s"${name}Vec").setDropLast(false)
}

似乎无法访问索引器的标签来进行任何匹配。

为简单起见,我们假设如下:我有3个分类特征 - A,B和C.它们具有值A1,A2,B1,B2,C1,C2。

我想要做的是匹配结果向量中索引4处的特征意味着B2。

有没有办法做到这一点?

0 个答案:

没有答案