Question

我在PySpark数据帧上训练了DecisionTree模型。结果数据框如下所示：

rdd = sc.parallelize(
    [
        (0., 1.), 
        (0., 0.), 
        (0., 0.), 
        (1., 1.), 
        (1.,0.), 
        (1.,0.),
        (1.,1.),
        (1.,1.)
    ]
)
df = sqlContext.createDataFrame(rdd, ["prediction", "target_index"])
df.show()
+----------+------------+
|prediction|target_index|
+----------+------------+
|       0.0|         1.0|
|       0.0|         0.0|
|       0.0|         0.0|
|       1.0|         1.0|
|       1.0|         0.0|
|       1.0|         0.0|
|       1.0|         1.0|
|       1.0|         1.0|
+----------+------------+

让我们计算一个指标，回想一下：

metricsp = MulticlassMetrics(df.rdd)
print metricsp.recall()
0.625

确定。让我们试着确认这是正确的：

tp = df[(df.target_index == 1) & (df.prediction == 1)].count()
tn = df[(df.target_index == 0) & (df.prediction == 0)].count()
fp = df[(df.target_index == 0) & (df.prediction == 1)].count()
fn = df[(df.target_index == 1) & (df.prediction == 0)].count()
print "True Positives:", tp
print "True Negatives:", tn
print "False Positives:", fp
print "False Negatives:", fn
print "Total", df.count()
True Positives: 3
True Negatives: 2
False Positives: 2
False Negatives: 1
Total 8

并计算回忆：

r = float(tp)/(tp + fn)
print "recall", r

recall 0.75

结果不同。我做错了什么？

顺便说一下，Metrics类的所有函数都给出了相同的结果：

print metricsp.recall()
print metricsp.precision()
print metricsp.fMeasure()
0.625
0.625
0.625

Answer 1

问题是您正在使用MultiClassMetrics处理二进制分类器的输出。来自docs：

recall()
Returns recall (equals to precision for multiclass classifier because sum of all false positives is equal to sum of all false negatives)

要获得正确的结果，请使用recall（label = 1）：

>>> print metricsp.recall(label=1)
0.75

BTW，df.show()中的标题似乎混乱了，应该是：

+----------+------------+
|prediction|target_index|
+----------+------------+
|       0.0|         1.0|
|       0.0|         0.0|
|       0.0|         0.0|
|       1.0|         1.0|
|       1.0|         0.0|
|       1.0|         0.0|
|       1.0|         1.0|
|       1.0|         1.0|
+----------+------------+

PySpark DecisionTree模型的精度和召回与手动结果不同

1 个答案: