我正在使用Spark 2.2 ML RandomForestClassifier进行一些预测。
我有这样的结果:
+-----+----------------------------------------+----+----------+
|label|features |prob|prediction|
+-----+----------------------------------------+----+----------+
|0.0 |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[0,4,9,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,5,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,6,7,12,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
|0.0 |(80,[1,4,7,11,16],[1.0,1.0,1.0,1.0,1.0])|... |... |
+-----+----------------------------------------+----+----------+
现在我想将功能解码回人类可读的表示,例如我想知道索引4的确切特征是什么。
我假设我可以从索引器的标签中获取此信息,但我有一个这样的代码:
private def transform() {
val aIndexer = indexer("a")
val bIndexer = indexer("b")
val cIndexer = indexer("c")
val aEncoder = encoder("a")
val bEncoder = encoder("b")
val cEncoder = encoder("c")
val vectorAssembler = new VectorAssembler()
.setInputCols(Array("aVec", "bVec", "cVec"))
.setOutputCol("features")
val indexers = Array[PipelineStage](aIndexer, bIndexer, cIndexer)
val encoders = Array[PipelineStage](aEncoder, bEncoder, cEncoder)
val pipeline = new Pipeline().setStages(indexers ++ encoders :+ vectorAssembler)
val model = pipeline.fit(in)
model.write.overwrite().save(opts.pipelineFileName)
model.transform(in).show(false)
}
private def indexer(name: String): StringIndexer = {
new StringIndexer().setInputCol(name).setOutputCol(s"${name}Idx").setHandleInvalid("keep")
}
private def encoder(name: String): OneHotEncoder = {
new OneHotEncoder().setInputCol(s"${name}Idx").setOutputCol(s"${name}Vec").setDropLast(false)
}
似乎无法访问索引器的标签来进行任何匹配。
为简单起见,我们假设如下:我有3个分类特征 - A,B和C.它们具有值A1,A2,B1,B2,C1,C2。
我想要做的是匹配结果向量中索引4处的特征意味着B2。
有没有办法做到这一点?