如何在pyspark中不使用paramgrid Builder来执行交叉验证?

时间:2018-10-10 13:05:39

标签: pyspark apache-spark-mllib

我想在pyspark中进行一种sklearn交叉验证,而无需使用 ParamGrid Builder

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(regParam=0.1,elasticNet=0.2,maxIter=100)
crossval = CrossValidator(estimator=lr,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  

是否可以在不使用paramGrid Builder的情况下以这种方式执行交叉验证?我的用例是我想将参数作为参数而不作为paramGrid对象传递给线性回归类。

1 个答案:

答案 0 :(得分:0)

一种简单的解决方案是仅提供您要在ParamGrid中使用的参数:

paramGrid = ParamGridBuilder() \
  .addGrid(lr.regParam, [0.1]) \
  .addGrid(lr.elasticNet, [0.2]) \
  .addGrid(lr.maxIter, [100]) 
  .build()

crossval = CrossValidator(estimator=lr,
                estimatorParamMaps=paramGrid, 
                evaluator=BinaryClassificationEvaluator(),
                numFolds=2)  

您始终可以编写自己的K-fold版本,使用:将数据集分成K个部分:

fold1, fold2 = df.randomSplit([0.5,0.5])
folds = [fold1, fold2]
res = []
for fold in folds:
    train, test = fold.randomSplit([0.80,0.20])
    model.train(train)
    res.append(model.evaluate(test))

do_what_you_want(res)