Question

我有一个数据框，它存储了我所拥有的各种二进制分类类问题的分数和标签。例如：

| problem | score | label |
|:--------|:------|-------|
| a       | 0.8   | true  |  
| a       | 0.7   | true  |  
| a       | 0.2   | false |  
| b       | 0.9   | false |  
| b       | 0.3   | true  |  
| b       | 0.1   | false |  
| ...     | ...   | ...   |

现在我的目标是为每个问题获取二进制评估指标（例如，参见AreaUnderROC，参见https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification），最终结果如下：

| problem | areaUnderROC |
| a       | 0.83         |
| b       | 0.68         |
| ...     | ...          |

我想做类似的事情：

df.groupBy("problem").agg(getMetrics)

但是我不确定如何用getMetrics写Aggregators（参见https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html）。有什么建议吗？

Answer 1

有一个专为二进制指标构建的模块 - see it in the python docs

此代码应该有效，

from pyspark.mllib.evaluation import BinaryClassificationMetrics

score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)

score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)

......等等其他问题

在我看来这是最简单的方法:)

Answer 2

Spark有非常有用的类来从二进制或多类分类中获取指标。但它们可用于基于RDD的api版本。所以，做一些代码并玩数据帧和rdd就可以了。一个充分的例子可能如下：

object TestMetrics {

  def main(args: Array[String]) : Unit = {

    Logger.getLogger("org").setLevel(Level.OFF)
    Logger.getLogger("akka").setLevel(Level.OFF)

    implicit val spark: SparkSession =
      SparkSession
        .builder()
        .appName("Example")
        .master("local[1]")
        .getOrCreate()

    import spark.implicits._

    val sc = spark.sparkContext

    // Test data with your schema

    val someData = Seq(
      Row("a",0.8, true),
      Row("a",0.7, true),
      Row("a",0.2, true),
      Row("b",0.9, true),
      Row("b",0.3, true),
      Row("b",0.1, true)
    )

    // Set your threshold to get a positive or negative
    val threshold : Double = 0.5

    import org.apache.spark.sql.functions._

    // First udf to convert probability in positives or negatives
    def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0

    // Cast boolean to double
    val thresholdUdf = udf { _thresholdUdf(threshold)}
    val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }

    // Schema to build the dataframe
    val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))

    val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))

    // Apply first trans to get the double representation of all fields
    val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))

    // First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
    val pbl = df0.select("problem").distinct().as[String].collect()

    // Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
    val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
                 .filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))

    // And the metrics for each 'problem' are available
    val results = dfList.toMap.mapValues(metrics =>
      Seq(metrics.areaUnderROC(),
          metrics.areaUnderROC()))

    val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))

    // Get Metrics by key, in your case the 'problem'
    results.foreach(element => println(element))

    moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
    // Score and labels
  }
}

如何评估每个组的二进制分类器评估指标（在scala中）？

2 个答案: