我有一个数据框,它存储了我所拥有的各种二进制分类类问题的分数和标签。例如:
| problem | score | label |
|:--------|:------|-------|
| a | 0.8 | true |
| a | 0.7 | true |
| a | 0.2 | false |
| b | 0.9 | false |
| b | 0.3 | true |
| b | 0.1 | false |
| ... | ... | ... |
现在我的目标是为每个问题获取二进制评估指标(例如,参见AreaUnderROC,参见https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification),最终结果如下:
| problem | areaUnderROC |
| a | 0.83 |
| b | 0.68 |
| ... | ... |
我想做类似的事情:
df.groupBy("problem").agg(getMetrics)
但是我不确定如何用getMetrics
写Aggregators
(参见https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html)。有什么建议吗?
答案 0 :(得分:1)
有一个专为二进制指标构建的模块 - see it in the python docs
此代码应该有效,
from pyspark.mllib.evaluation import BinaryClassificationMetrics
score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)
score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)
......等等其他问题
在我看来这是最简单的方法:)
答案 1 :(得分:0)
Spark有非常有用的类来从二进制或多类分类中获取指标。但它们可用于基于RDD的api版本。所以,做一些代码并玩数据帧和rdd就可以了。一个充分的例子可能如下:
object TestMetrics {
def main(args: Array[String]) : Unit = {
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
implicit val spark: SparkSession =
SparkSession
.builder()
.appName("Example")
.master("local[1]")
.getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
// Test data with your schema
val someData = Seq(
Row("a",0.8, true),
Row("a",0.7, true),
Row("a",0.2, true),
Row("b",0.9, true),
Row("b",0.3, true),
Row("b",0.1, true)
)
// Set your threshold to get a positive or negative
val threshold : Double = 0.5
import org.apache.spark.sql.functions._
// First udf to convert probability in positives or negatives
def _thresholdUdf(threshold: Double) : Double => Double = prob => if(prob > threshold) 1.0 else 0.0
// Cast boolean to double
val thresholdUdf = udf { _thresholdUdf(threshold)}
val castToDouUdf = udf { (label: Boolean) => if(label) 1.0 else 0.0 }
// Schema to build the dataframe
val schema = List(StructField("problem", StringType), StructField("score", DoubleType), StructField("label", BooleanType))
val df = spark.createDataFrame(spark.sparkContext.parallelize(someData), StructType(schema))
// Apply first trans to get the double representation of all fields
val df0 = df.withColumn("binarypredict", thresholdUdf('score)).withColumn("labelDouble", castToDouUdf('label))
// First loop to get the 'problems list'. Maybe it would be possible to do all in one cycle
val pbl = df0.select("problem").distinct().as[String].collect()
// Get the RDD from dataframe and build the Array[(string, BinaryClassificationMetrics)]
val dfList = pbl.map(a => (a, new BinaryClassificationMetrics(df0.select("problem", "binarypredict", "labelDouble").as[(String, Double, Double)]
.filter(el => el._1 == a).map{ case (_, predict, label) => (predict, label)}.rdd)))
// And the metrics for each 'problem' are available
val results = dfList.toMap.mapValues(metrics =>
Seq(metrics.areaUnderROC(),
metrics.areaUnderROC()))
val moreMetrics = dfList.toMap.map((metrics) => (metrics._1, metrics._2.scoreAndLabels))
// Get Metrics by key, in your case the 'problem'
results.foreach(element => println(element))
moreMetrics.foreach(element => element._2.foreach { pr => println(s"${element._1} ${pr}") })
// Score and labels
}
}