我正在使用火花交叉验证来调整显式ALS模型的参数。评估者是BinaryClassificationEvaluator,其metricName ='areaUnderROC'来计算AUC。但这出错了。我的代码如下:
alsExplicit = ALS(
implicitPrefs=is_implicit,
numItemBlocks=100,
numUserBlocks=100,
userCol='device_id',
itemCol='item_id',
ratingCol='rating',
)
paramMapExplicit = ParamGridBuilder() \
.addGrid(alsExplicit.rank, [30, 40]) \
.addGrid(alsExplicit.maxIter, [10, 15]) \
.addGrid(alsExplicit.regParam, [0.01, 0.1]) \
.build()
evaluator_AUC = BinaryClassificationEvaluator(
labelCol='rating',
rawPredictionCol='prediction',
metricName='areaUnderROC'
)
cvExplicit = CrossValidator(estimator=alsExplicit, estimatorParamMaps=paramMapExplicit, evaluator=evaluator_AUC, numFolds=5)
cvModelExplicit = cvExplicit.fit(train_data) # This lines goes Error
错误是:
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column prediction must be of type equal to one of the following types: [DoubleType, org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7] but was actually of type FloatType.'
当我将评估器更改为RegressionEvaluator时,它运行良好,如下所示:
evaluator_RMSE = RegressionEvaluator(
metricName='rmse',
labelCol='rating',
predictionCol='prediction'
)
并且,如果我训练一个具有固定参数的模型,然后使用该模型转换测试数据,然后使用BinaryClassificationEvaluator计算AUC,那么同样会出错。
model = als.fit(train_data)
pred = model.transform(test_data)
auc = evaluator_AUC.evaluate(pred)
然后我尝试手动更改类型:
pred = pred.withColumn('prediction', pred['prediction'].cast(DoubleType()))
auc = evaluator_AUC.evaluate(pred)
这种方式有效。
但是,当使用交叉验证时,我无法更改数据框的类型。我该怎么办?