PySpark:交叉验证期间获得的最佳参数不是来自网格

时间:2020-04-26 16:00:39

标签: pyspark cross-validation

我尝试了解通过对pyspark中的随机森林进行训练而发生的以下问题。按照标准,我们定义了管道,评估器,网格等

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

rfClassifier = RandomForestClassifier(labelCol='Target')
cv_pipeline = Pipeline(stages=[rfClassifier])

evaluator = MulticlassClassificationEvaluator(labelCol='Target')

paramGrid = ParamGridBuilder().addGrid(RandomForestClassifier.numTrees, [100, 120]).\
addGrid(RandomForestClassifier.maxDepth, [7, 10] ).addGrid(RandomForestClassifier.subsamplingRate, [0.8,0.9]).\
addGrid(RandomForestClassifier.featureSubsetStrategy, ['auto','log2', 'onethird']).build()

执行交叉验证

train.cache()
crossval = CrossValidator(estimator = cv_pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=3)
cvModel=crossval.fit(train)

并查看最佳模型

model = cvModel.bestModel
java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[0]}

出乎意料的是,我得到了以下结果

{'featureSubsetStrategy': 'auto',
 'maxDepth': 5,
 'numTrees': 20,
 'subsamplingRate': 1.0}

有人可以解释一下,为什么没有从网格中选择最佳参数吗?非常感谢

0 个答案:

没有答案
相关问题