如何在spark-ml CrossValidatorModel中获得最佳逻辑回归的系数?

时间:2017-01-29 12:36:46

标签: scala apache-spark logistic-regression cross-validation apache-spark-ml

我使用逻辑回归和spark-ml管道训练一个简单的CrossValidatorModel。我可以预测新数据,但我想超越黑匣子并对系数进行一些分析

 val lr = new LogisticRegression().
  setFitIntercept(true).
  setMaxIter(maxIter).
  setElasticNetParam(alpha).
  setStandardization(true).
  setFamily("binomial").
  setWeightCol("weight").
  setFeaturesCol("features").
  setLabelCol("response")

val assembler = new VectorAssembler().
  setInputCols(Array("feat1", "feat2")).
  setOutputCol("features")

val modelPipeline = new Pipeline().
  setStages(Array(assembler,lr))

val evaluator = new BinaryClassificationEvaluator()
  .setLabelCol("response")

然后我定义了一个参数网格,我在网格上训练以获得最佳模型和AUC

val paramGrid = new ParamGridBuilder().
  addGrid(lr.regParam, lambdas).
  build()

val pipeline = new CrossValidator().
  setEstimator(modelPipeline).
  setEvaluator(evaluator).
  setEstimatorParamMaps(paramGrid).
  setNumFolds(nfolds)

val cvModel = pipeline.fit(train)

如何获得最佳逻辑回归模型的系数(beta)?

1 个答案:

答案 0 :(得分:7)

提取最佳模型:

val bestModel = cvModel.bestModel match {
  case pm: PipelineModel => Some(pm)
  case _ => None
}

查找逻辑回归模型:

val lrm = bestModel
  .map(_.stages.collect { case lrm: LogisticRegressionModel => lrm })
  .flatMap(_.headOption)

提取系数:

lrm.map(m => (m.intercept, m.coefficients))

快速而肮脏的等价物:

val lrm: LogisticRegressionModel = cvModel
  .bestModel.asInstanceOf[PipelineModel]
  .stages
  .last.asInstanceOf[LogisticRegressionModel]

(lrm.intercept, lrm.coefficients)