我尝试了解通过对pyspark中的随机森林进行训练而发生的以下问题。按照标准,我们定义了管道,评估器,网格等
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
rfClassifier = RandomForestClassifier(labelCol='Target')
cv_pipeline = Pipeline(stages=[rfClassifier])
evaluator = MulticlassClassificationEvaluator(labelCol='Target')
paramGrid = ParamGridBuilder().addGrid(RandomForestClassifier.numTrees, [100, 120]).\
addGrid(RandomForestClassifier.maxDepth, [7, 10] ).addGrid(RandomForestClassifier.subsamplingRate, [0.8,0.9]).\
addGrid(RandomForestClassifier.featureSubsetStrategy, ['auto','log2', 'onethird']).build()
执行交叉验证
train.cache()
crossval = CrossValidator(estimator = cv_pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)
cvModel=crossval.fit(train)
并查看最佳模型
model = cvModel.bestModel
java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name))
for param in paramGrid[0]}
出乎意料的是,我得到了以下结果
{'featureSubsetStrategy': 'auto',
'maxDepth': 5,
'numTrees': 20,
'subsamplingRate': 1.0}
有人可以解释一下,为什么没有从网格中选择最佳参数吗?非常感谢