evaluator = BinaryClassificationEvaluator()
grid = ParamGridBuilder().build() # no hyper parameter optimization
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
evaluator.evaluate(cvModel.transform(dataset))
返回:
cvModel.avgMetrics = [1.602872634746238]
evaluator.evaluate(cvModel.transform(dataset)) = 0.7267754950388204
问题:
dataset
进行拟合和评估)答案 0 :(得分:3)
最近这是一个fixed的错误。但是,它尚未发布。
根据您提供的内容,我使用以下代码来复制问题:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.linalg import Vectors
from pyspark.sql.types import Row
dataset = sc.parallelize([
Row(features=Vectors.dense([1., 0.]), label=1.),
Row(features=Vectors.dense([1., 1.]), label=0.),
Row(features=Vectors.dense([0., 0.]), label=1.),
]).toDF()
evaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
grid = ParamGridBuilder().addGrid('maxIter', [100, 10]).build() # no hyper parameter optimization
cv = CrossValidator(estimator=LogisticRegression(), estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
evaluator.evaluate(cvModel.transform(dataset))
Out[23]: 1.0
cvModel.avgMetrics
Out[34]: [2.0, 2.0]
简单地说,
avgMetrics
对折叠进行求和,而不是平均值
编辑:
关于第二个问题,最简单的验证方法是提供测试数据集:
to_test = sc.parallelize([
Row(features=Vectors.dense([1., 0.]), label=1.),
Row(features=Vectors.dense([1., 1.]), label=0.),
Row(features=Vectors.dense([0., 1.]), label=1.),
]).toDF()
evaluator.evaluate(cvModel.transform(to_test))
Out[2]: 0.5
确认函数调用返回测试数据集上的指标。