如何在Apache Spark Pipeline中打印最佳模型参数?

时间:2015-09-14 13:04:38

标签: java apache-spark machine-learning apache-spark-mllib

我使用Apache Spark的管道API来验证参数。 我正在构建TrainValidationSplitModel,如下所示:

Pipeline pipeline = ...
ParamMap[] paramGrid = ...

TrainValidationSplit trainValidationSplit = new TrainValidationSplit().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator()).setEstimatorParamMaps(paramGrid).setTrainRatio(0.8);
TrainValidationSplitModel model = trainValidationSplit.fit(training);

我的问题是:如何提取和打印最佳训练模型的参数?

1 个答案:

答案 0 :(得分:3)

最后我做到了。 Spark在培训后打印此指标。我有火花的ERROR日志级别,所以我还没有看到这个:

2015-10-21 12:57:33,828 [INFO  org.apache.spark.ml.tuning.TrainValidationSplit]
Train validation split metrics: WrappedArray(0.7141940371838821, 0.7358721053749735)

2015-10-21 12:57:33,831 [INFO  org.apache.spark.ml.tuning.TrainValidationSplit]
Best set of parameters:
{
    hashingTF_79cf758f5ab1-numFeatures: 2000000,
    nb_67d55ce4e1fc-smoothing: 1.0
}

2015-10-21 12:57:33,831 [INFO  org.apache.spark.ml.tuning.TrainValidationSplit]
Best train validation split metric: 0.7358721053749735.

现在我已经在我的log4j.properties文件中为类TrainValidationSplit添加了级别INFO:

log4j.logger.org.apache.spark.ml.tuning.TrainValidationSplit=INFO
log4j.additivity.org.apache.spark.ml.tuning.TrainValidationSplit=false