Question

我正在通过以下方式训练随机森林模型：

//Indexer
val stringIndexers = categoricalColumns.map { colName =>
  new StringIndexer()
    .setInputCol(colName)
    .setOutputCol(colName + "Idx")
    .setHandleInvalid("keep")
    .fit(training)
}

//HotEncoder
val encoders = featuresEnconding.map { colName =>
  new OneHotEncoderEstimator()
    .setInputCols(Array(colName + "Idx"))
    .setOutputCols(Array(colName + "Enc"))
    .setHandleInvalid("keep")
}  

//Adding features into a feature vector column   
val assembler = new VectorAssembler()
              .setInputCols(featureColumns)
              .setOutputCol("features")


val rf = new RandomForestRegressor()
              .setLabelCol("label")
              .setFeaturesCol("features")

val stepsRF = stringIndexers ++ encoders ++ Array(assembler, rf)

val pipelineRF = new Pipeline()
                 .setStages(stepsRF)


val paramGridRF = new ParamGridBuilder()
                  .addGrid(rf.maxBins, Array(800))
                  .addGrid(rf.featureSubsetStrategy, Array("all"))
                  .addGrid(rf.minInfoGain, Array(0.05))
                  .addGrid(rf.minInstancesPerNode, Array(1))
                  .addGrid(rf.maxDepth, Array(28,29,30))
                  .addGrid(rf.numTrees, Array(20))
                  .build()


//Defining the evaluator
val evaluatorRF = new RegressionEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")

//Using cross validation to train the model
//Start with TrainSplit -Cross Validations taking so long so far
val cvRF = new CrossValidator()
.setEstimator(pipelineRF)
.setEvaluator(evaluatorRF)
.setEstimatorParamMaps(paramGridRF)
.setNumFolds(10)
.setParallelism(3)

//Fitting the model with our training dataset
val cvRFModel = cvRF.fit(training)

我现在想要的是在训练后了解模型中每个功能的重要性。

我能够像Array [Double]这样获得每个功能的重要性：

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]

val size = bestModel.stages.size-1

val ftrImp = bestModel.stages(size).asInstanceOf[RandomForestRegressionModel].featureImportances.toArray

但是我只了解每个特征的重要性和一个数字索引，但是我不知道模型中对应于每个重要性值的特征名称是什么。

我还要提及的是，由于我使用的是hotencoder，因此功能的最终数量比原始的featureColumns数组大得多。

如何提取模型训练期间使用的特征名称？

Answer 1

我找到了可能的解决方案：

import org.apache.spark.ml.attribute._

val bestModel = cvRFModel.bestModel.asInstanceOf[PipelineModel]

val lstModel = bestModel.stages.last.asInstanceOf[RandomForestRegressionModel]
val schema = predictions.schema

val featureAttrs = AttributeGroup.fromStructField(schema(lstModel.getFeaturesCol)).attributes.get
val mfeatures = featureAttrs.map(_.name.get)


val mdf = sc.parallelize(mfeatures zip ftrImp).toDF("featureName","Importance")
.orderBy(desc("Importance"))
display(mdf)

SparkML-创建RandomForestRegressionModel的df（feature，feature_importance）

1 个答案: