我从PySpark开始,建立二进制分类模型(逻辑回归),我需要为模型找到最佳阈值(截止点)。
我想使用ROC曲线找到该点,但是我不知道如何为该曲线中的每个点提取阈值。有没有办法找到这个值?
我发现的东西:
其他事实
答案 0 :(得分:1)
如果您特别需要为不同的阈值生成ROC曲线,则一种方法可能是生成您感兴趣的阈值列表,并针对每个阈值在数据集中进行拟合/转换。或者,您可以使用probability
的响应中的model.transform(test)
字段来手动计算每个阈值点的ROC曲线。
或者,您可以使用BinaryClassificationMetrics提取一条曲线,按阈值绘制各种度量(F1得分,精度,召回率)。
不幸的是,似乎PySpark版本没有实现Scala版本执行的大多数方法,因此您需要包装该类才能在Python中完成。
例如:
from pyspark.mllib.evaluation import BinaryClassificationMetrics
# Scala version implements .roc() and .pr()
# Python: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/mllib/common.html
# Scala: https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html
class CurveMetrics(BinaryClassificationMetrics):
def __init__(self, *args):
super(CurveMetrics, self).__init__(*args)
def _to_list(self, rdd):
points = []
# Note this collect could be inefficient for large datasets
# considering there may be one probability per datapoint (at most)
# The Scala version takes a numBins parameter,
# but it doesn't seem possible to pass this from Python to Java
for row in rdd.collect():
# Results are returned as type scala.Tuple2,
# which doesn't appear to have a py4j mapping
points += [(float(row._1()), float(row._2()))]
return points
def get_curve(self, method):
rdd = getattr(self._java_model, method)().toJavaRDD()
return self._to_list(rdd)
用法:
import matplotlib.pyplot as plt
preds = predictions.select('label','probability').rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))
# Returns as a list (false positive rate, true positive rate)
roc = CurveMetrics(preds).get_curve('roc')
plt.figure()
x_val = [x[0] for x in points]
y_val = [x[1] for x in points]
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.plot(x_val, y_val)
答案 1 :(得分:0)
一种方法是使用sklearn.metrics.roc_curve
。
首先使用您的拟合模型进行预测:
from pyspark.ml.classification import LogisticRegression
lr = LogisticRegression(labelCol="label", featuresCol="features")
model = lr.fit(trainingData)
predictions = model.transform(testData)
然后收集您的分数和标签 1 :
preds = predictions.select('label','probability')\
.rdd.map(lambda row: (float(row['probability'][1]), float(row['label'])))\
.collect()
现在将preds
转换为可与roc_curve
一起使用
from sklearn.metrics import roc_curve
y_score, y_true = zip(*preds)
fpr, tpr, thresholds = roc_curve(y_true, y_score, pos_label = 1)
注释:
1
处。但是,在二进制分类问题中,您会立即知道AUC是否小于0.5。在这种情况下,只需将1-p
作为概率(因为类概率之和为1)。