Question

我使用DecisionTree.jl包的ScikitLearn风格为其中一个RDatasets数据集的二进制分类问题创建随机森林模型（请参阅DecisionTree.jl的底部）我的意思是ScikitLearn风味的主页）。我还使用MLBase包进行模型评估。

我已经建立了我的数据的随机森林模型，并希望为此模型创建一个ROC曲线。阅读可用的文档，我确实理解ROC曲线在理论上是什么。我无法弄清楚如何为特定模型创建一个。

从Wikipedia page我用下面粗体斜体标出的第一句话的最后一部分是导致我混淆的那一部分：＆＃34;统计中，接收者操作特征（ROC），或者ROC曲线是一个图形图，用于说明二元分类器系统 的性能，因为其识别阈值是变化的 。＆＃34;整篇文章中的阈值更多，但这仍然让我对二进制分类问题感到困惑。什么是阈值，我该如何改变它？

此外，在MLBase documentation on ROC Curves中它表示＆＃34;计算ROC实例或ROC曲线（ROC实例的向量），基于给定的分数和阈值thres。＆＃34;但实际上并没有在其他任何地方提到这个门槛。

我的项目的示例代码如下。基本上，我想为随机森林创建一条ROC曲线，但我不确定如何或是否合适。

using DecisionTree
using RDatasets
using MLBase

quakes_data = dataset("datasets", "quakes");

# Add in a binary column as feature column for classification
quakes_data[:MagGT5] = convert(Array{Int32,1}, quakes_data[:Mag] .> 5.0)

# Getting features and labels where label = 1 is mag > 1 and label = 2 is mag <= 5
features = convert(Array, quakes_data[:, [1:3;5]]);
labels = convert(Array, quakes_data[:, 6]);
labels[labels.==0] = 2

# Create a random forest model with the tuning parameters I want
r_f_model = RandomForestClassifier(nsubfeatures = 3, ntrees = 50, partialsampling=0.7, maxdepth = 4)

# Train the model in-place on the dataset (there isn't a fit function without the in-place functionality)
DecisionTree.fit!(r_f_model, features, labels)

# Apply the trained model to the test features data set (here I haven't partitioned into training and test)
r_f_prediction = convert(Array{Int64,1}, DecisionTree.predict(r_f_model, features))

# Applying the model to the training set and looking at model stats
TrainingROC = roc(labels, r_f_prediction) #getting the stats around the model applied to the train set
#     p::T    # positive in ground-truth
#     n::T    # negative in ground-truth
#     tp::T   # correct positive prediction
#     tn::T   # correct negative prediction
#     fp::T   # (incorrect) positive prediction when ground-truth is negative
#     fn::T   # (incorrect) negative prediction when ground-truth is positive

我还阅读了this个问题，并没有发现它真有用。

Answer 1

二进制分类中的任务是提供0 / 1（或true / false，red / blue）标签到一个新的，未标记的数据点。大多数分类算法旨在输出连续的实际值。对于具有已知或预测标签1的点，此值优化为更高，对于具有已知或预测标签0的点，此值更低。要使用此值生成0 / 1预测，系统会使用额外的阈值。值高于阈值的点预计会标记为1（如果低于阈值，则会预测0标签。）

为什么这个设置有用？因为，有时对0而不是1进行错误预测会更加昂贵，然后您可以将阈值设置为较低，从而使算法输出更频繁地预测1。

在极端情况下，预测0而非1不会为应用程序带来任何费用，您可以将阈值设置为无穷大，使其始终输出0（这显然是最好的解决方案，因为它没有任何成本）。

阈值技巧无法消除分类器中的错误 - 实际问题中的分类器不是完美的或没有噪声。它可以做的是更改最终分类的0 - 何时 - 1错误和1 - 何时 - 实际 - 0错误之间的比率。

当您增加阈值时，会使用0标签对更多点进行分类。考虑一个图表，其中包含在x轴上用0分类的点的分数，以及在y轴上具有0 - 当 - 实际 - 1误差的点的分数。对于阈值的每个值，在此图表上绘制结果分类器的点。绘制所有阈值的点，即可获得曲线。这是ROC曲线的（某些变体），它总结了分类器的能力。经常使用的分类质量度量标准是该图表的AUC或曲线下面积，但事实上，整个曲线可能对应用程序感兴趣。

这样的摘要出现在许多关于机器学习的文章中，这些文章都是google查询。

希望这能澄清阈值的作用及其与ROC曲线的关系。

朱莉娅的随机森林和ROC曲线

1 个答案: