PySpark 中随机森林树的可视化?

时间:2021-06-11 10:50:03

标签: pyspark random-forest decision-tree apache-spark-ml mlflow

如何使用 TrainValidationSplit 在 RandomForestClassifier 中可视化最佳随机森林树?

显示普通决策树没有问题。当我这样做时,我只是分解了 Pipeline.stages[-1] 以获得一个 DecisionTree 模型。但是当我使用 RandomForest 时,它不起作用,因为我有很多树。

我的目标是在使用参数网格搜索后可视化最佳调整的树模型。

from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import mlflow

rT = RandomForestClassifier(featuresCol="features",labelCol="label",maxBins=50)

# Pipeline

stagesrT = [indexer, labelToIndex, assembler, rT]

pipeline = Pipeline(stages=stagesrT)

evaluator = MulticlassClassificationEvaluator(labelCol="label",metricName="accuracy")

grid = ParamGridBuilder() \
  .addGrid(rT.maxDepth, [2, 3, 5, 6]) \
  .addGrid(rT.maxBins, [50, 60, 70, 80]) \
  .addGrid(rT.numTrees, [10,20,30,40,100,200]) \
  .build()

tuning = TrainValidationSplit(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=grid, parallelism=25)

rT.setMinInstancesPerNode(100)

with mlflow.start_run(run_name="random-forrest") as run:  
  
  # Log model
  tunedModel = tuning.fit(trainDF)
  mlflow.spark.log_model(tunedModel, "model")

0 个答案:

没有答案