假设我有200个科目,A组100个,B组100个,每个我测量一些连续参数。
require(ggplot2)
set.seed(100)
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep('A', 100), rep('B', 100))
data <- data.frame(value, group)
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
我想确定最大化分离并最小化组之间错误分类的值(阈值?断点?)。 R?
中是否存在这样的功能?我尝试按照“r断点最大组间分离”和“r阈值最小化错误分类”的方式进行搜索,但我的google-foo今天似乎已关闭。
修改
回应@Thomas的评论,我试图使用逻辑回归拟合数据,然后求解阈值,但我还没有走得太远。
lr <- glm(group~value)
coef(lr)
# (Intercept) value
# 1.1857435 -0.0911762
So Bo = 1.1857435,B1 = -0.0911762
从Wikipedia,我看到F(x)= 1 /(1 + e ^ - (Bo + B1x)),并求解x:
x =(ln(F(x)/(1 - F(x))) - Bo)/ B1
但是在R中尝试这个,我得到一个明显不正确的答案:
(log(0.5/(1 - 0.5)) - 1.1857435)/-0.0911762 # 13.00497
答案 0 :(得分:2)
一种简单的方法是编写一个函数来计算给定阈值的准确度:
accuracy = Vectorize(function(th) mean(c("A", "B")[(value > th) + 1] == group))
然后使用optimize
:
optimize(accuracy, c(min(value), max(value)), maximum=TRUE)
# $maximum
# [1] 8.050888
#
# $objective
# [1] 0.86
答案 1 :(得分:1)
感谢@Thomas和@BenBolker的帮助,我得到了我需要的答案。
<强>摘要强>
<强>代码强>
# Include libraries
require(ggplot2)
require(MASS)
# Set seed
set.seed(100)
# Put together some dummy data
value <- c(rnorm(100, mean = 5, sd = 3), rnorm(100, mean = 10, sd = 3))
group <- c(rep(0, 100), rep(1, 100))
data <- data.frame(value, group)
# Plot the distribution -- visually
# The answer appears to be b/t 7 and 8
ggplot(data = data, aes(x = value)) +
geom_bar(aes(color = group))
# Fit a glm model, specifying the binomial distribution
my.glm <- glm(group~value, data = data, family = binomial)
b0 <- coef(my.glm)[[1]]
b1 <- coef(my.glm)[[2]]
# See what the probability function looks like
lr <- function(x, b0, b1) {
prob <- 1 / (1 + exp(-1*(b0 + b1*x)))
return(prob)
}
# The line appears to cross 0.5 just above 7.5
x <- -0:12
y <- lr(x, b0, b1)
lr.val <- data.frame(x, y)
ggplot(lr.val, aes(x = x, y = y)) +
geom_line()
# The inverse of this function computes the threshold for a given probability
inv.lr <- function(p, b0, b1) {
x <- (log(p / (1 - p)) - b0)/b1
return(x)
}
# With the betas from this function, we get 7.686814
inv.lr(0.5, b0, b1)
# Or, feeding the glm model into dose.p from MASS, we get the same answer
dose.p(my.glm, p = 0.5)
谢谢大家的帮助!