如何使用 TrainValidationSplit 在 RandomForestClassifier 中可视化最佳随机森林树?
显示普通决策树没有问题。当我这样做时,我只是分解了 Pipeline.stages[-1] 以获得一个 DecisionTree 模型。但是当我使用 RandomForest 时,它不起作用,因为我有很多树。
我的目标是在使用参数网格搜索后可视化最佳调整的树模型。
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import mlflow
rT = RandomForestClassifier(featuresCol="features",labelCol="label",maxBins=50)
# Pipeline
stagesrT = [indexer, labelToIndex, assembler, rT]
pipeline = Pipeline(stages=stagesrT)
evaluator = MulticlassClassificationEvaluator(labelCol="label",metricName="accuracy")
grid = ParamGridBuilder() \
.addGrid(rT.maxDepth, [2, 3, 5, 6]) \
.addGrid(rT.maxBins, [50, 60, 70, 80]) \
.addGrid(rT.numTrees, [10,20,30,40,100,200]) \
.build()
tuning = TrainValidationSplit(estimator=pipeline, evaluator=evaluator, estimatorParamMaps=grid, parallelism=25)
rT.setMinInstancesPerNode(100)
with mlflow.start_run(run_name="random-forrest") as run:
# Log model
tunedModel = tuning.fit(trainDF)
mlflow.spark.log_model(tunedModel, "model")