Question

我对BinaryClassificationMetrics（Mllib）输入感到困惑。根据{{3}}，我们需要从已经预测的变换后的DataFrame传递类型(RDD[(Double,Double)])的 predictandlabel ，概率（向量）＆amp; rawPrediction（矢量）。

我已经从Predicted和label列创建了RDD [（Double，Double）]。在 NavieBayesModel 上执行BinaryClassificationMetrics评估后，我能够检索ROC，PR等。但是值有限，我无法使用该值绘制曲线由此产生。 Roc包含4个值，PR包含3个值。

这是准备 PredictedandLabel 的正确方法，还是需要使用 rawPrediction 列或概率列而不是预测< / strong>专栏？

Answer 1

准备这样：

import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

val df = sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val predictions = new NaiveBayes().fit(df).transform(df)

val preds = predictions.select("probability", "label").rdd.map(row => 
  (row.getAs[Vector](0)(0), row.getAs[Double](1)))

并评估：

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

new BinaryClassificationMetrics(preds, 10).roc

如果预测只有0或1个桶可以像你的情况一样低。尝试更复杂的数据，如下所示：

val anotherPreds = df1.select(rand(), $"label").rdd.map(row => (row.getDouble(0), row.getDouble(1)))
new BinaryClassificationMetrics(anotherPreds, 10).roc

如何在Naive Bayes模型的BinaryClassificationMetrics评估中给出预测和标签列

1 个答案: