Question

我有一些预测变量和二进制目标的数据。例如：

df <- data.frame(a=sort(sample(1:100,30)), b= sort(sample(1:100,30)), 
                 target=c(rep(0,11),rep(1,4),rep(0,4),rep(1,11)))

我使用glm()

训练了一个逻辑回归模型

model1 <- glm(formula= target ~ a + b, data=df, family=binomial)

现在我试图预测输出（例如，相同的数据就足够了）

predict(model1, newdata=df, type="response")

这会生成概率数的向量。但我想预测实际的课程。我可以在概率数字上使用round（），但这假设低于0.5的任何东西都是类＆＃39; 0＆＃39;，以上任何东西都是class＆＃39; 1＆＃39;。这是正确的假设吗？即使每个阶级的人口可能不相等（或接近相等）？或者有没有办法估算这个门槛？

Answer 1

在glm模型中使用的最佳阈值（或截止）点是最大化特异性和灵敏度的点。此阈值点可能不会在您的模型中给出最高预测，但它不会偏向正面或负面。 ROCR包中包含可以帮助您执行此操作的功能。检查此包中的performance()功能。它将为您提供您正在寻找的东西。这是您期望获得的图片：

enter image description here

在找到截止点后，我通常自己编写一个函数来查找预测值高于截止值的数据点数，并将其与它们所属的组匹配。

Answer 2

用于确定良好模型参数的黄金标准，包括“我应该为逻辑回归设置的阈值”，是交叉验证。

一般的想法是保留训练集的一个或多个部分，并选择最大化此保留集上正确分类数量的阈值，但Wikipedia可以为您提供更多详细信息。< / p>

Answer 3

围绕尝试复制第一个图形的工具。给定一个predictions <- prediction(pred,labels)对象，然后：

baseR方法

plot(unlist(performance(predictions, "sens")@x.values), unlist(performance(predictions, "sens")@y.values), 
     type="l", lwd=2, ylab="Specificity", xlab="Cutoff")
par(new=TRUE)
plot(unlist(performance(predictions, "spec")@x.values), unlist(performance(predictions, "spec")@y.values), 
     type="l", lwd=2, col='red', ylab="", xlab="")
axis(4, at=seq(0,1,0.2),labels=z)
mtext("Specificity",side=4, padj=-2, col='red')

ggplot2方法

sens <- data.frame(x=unlist(performance(predictions, "sens")@x.values), 
                   y=unlist(performance(predictions, "sens")@y.values))
spec <- data.frame(x=unlist(performance(predictions, "spec")@x.values), 
                   y=unlist(performance(predictions, "spec")@y.values))

sens %>% ggplot(aes(x,y)) + 
  geom_line() + 
  geom_line(data=spec, aes(x,y,col="red")) +
  scale_y_continuous(sec.axis = sec_axis(~., name = "Specificity")) +
  labs(x='Cutoff', y="Sensitivity") +
  theme(axis.title.y.right = element_text(colour = "red"), legend.position="none")

Answer 4

PresenceAbsence::optimal.thresholds包的函数PresenceAbsence中实现了12种方法。

Freeman，E. A.和Moisen，G. G.（2008）也对此进行了介绍。根据预测的患病率和kappa对二元分类阈值标准的性能进行比较。生态模型，217（1-2），48-58。

Answer 5

要以编程方式获得具有最接近的灵敏度和特异性值（即，上图中的交叉点）的数据阈值，您可以将此代码变得非常接近：

>>> from collections import defaultdict
>>> result = defaultdict(dict)
>>> for item in list_dict :
...     result[item[0]].update(item[1])
... 
>>> result
defaultdict(<type 'dict'>, {'key2': {'subkey2': 4, 'subkey1': 8}, 'key1': {'subkey5': 5, 'subkey2': 2, 'sybkey5': 10, 'subkey1': 0}})
>>>

Answer 6

您可以尝试以下方法：

perfspec <- performance(prediction.obj = pred, measure="spec", x.measure="cutoff")

plot(perfspec)

par(new=TRUE)

perfsens <- performance(prediction.obj = pred, measure="sens", x.measure="cutoff")

plot(perfsens)

R中glm逻辑回归模型的确定阈值

6 个答案: