Question

以下是两个相关问题，但它们与我的不重复，因为第一个问题具有针对数据集的解决方案，第二个问题涉及glm与start一起提供时失败offset。

https://stackoverflow.com/questions/31342637/error-please-supply-starting-valueshttps://stackoverflow.com/questions/8212063/r-glm-starting-values-not-accepted-log-link

我有以下数据集：

library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
  rbinom(50, 1, i)  
})



dt <- data.table(df)

dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]

这样dt是：

> dt
     names     probs response 
  1:     1 0.0000000        0 
  2:     1 0.0000000        0 
  3:     1 0.0000000        0 
  4:     1 0.0000000        0 
  5:     1 0.0000000        0 
 ---                                     
496:    10 0.9446753        0 
497:    10 0.9446753        1 
498:    10 0.9446753        1 
499:    10 0.9446753        1 
500:    10 0.9446753        1

我正在尝试使用lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'))。

将逻辑回归模型与身份链接相匹配

这会出错：

Error: no valid set of coefficients has been found: please supply starting values

我尝试通过提供start参数来修复它，但后来又出现了另一个错误。

> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0, 1))
Error: cannot find valid starting values: please specify some

此时这些错误对我来说毫无意义，我不知道该怎么做。

编辑：@iraserd对此问题提出了更多建议。使用start = c(0.5, 0.5)，我得到：

> lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start = c(0.5, 0.5))
There were 25 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: step size truncated: out of bounds
2: step size truncated: out of bounds
3: step size truncated: out of bounds
4: step size truncated: out of bounds
5: step size truncated: out of bounds
6: step size truncated: out of bounds
7: step size truncated: out of bounds
8: step size truncated: out of bounds
9: step size truncated: out of bounds
10: step size truncated: out of bounds
11: step size truncated: out of bounds
12: step size truncated: out of bounds
13: step size truncated: out of bounds
14: step size truncated: out of bounds
15: step size truncated: out of bounds
16: step size truncated: out of bounds
17: step size truncated: out of bounds
18: step size truncated: out of bounds
19: step size truncated: out of bounds
20: step size truncated: out of bounds
21: step size truncated: out of bounds
22: step size truncated: out of bounds
23: step size truncated: out of bounds
24: step size truncated: out of bounds
25: glm.fit: algorithm stopped at boundary value

和

> summary(lm2)

Call:
glm(formula = response ~ probs, family = binomial(link = "identity"), 
    data = dt, start = c(0.5, 0.5))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4023  -0.6710   0.3389   0.4641   1.7897  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) 1.486e-08  1.752e-06   0.008    0.993    
probs       9.995e-01  2.068e-03 483.372   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 69312  on 49999  degrees of freedom
Residual deviance: 35984  on 49998  degrees of freedom
AIC: 35988

Number of Fisher Scoring iterations: 24

我非常怀疑这与某些响应是以真实概率零生成的事实有关，当probs的系数接近1时会导致问题。

Answer 1

fit.glm代码中有两个地方以错误no valid set of coefficients has been found: please supply starting values终止。在一种情况下，当某些计算出的偏差变为无穷大时，如果提供了无效的etastart和mustart选项，则会出现另一种情况。

另见答案，详细阐述：How do I use a custom link function in glm?

当您尝试对概率进行回归（0到1之间的值）时，我猜您需要指定不等于0或1的起始值：

lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='identity'), start=c(0.5,0.5))

这会引发很多警告，并以溢出终止，可能是因为示例的人为性质。

更改公式以使用logit链接（因为您希望根据您的问题进行逻辑回归）消除了警告（并且不需要启动参数）：

    lm2 <- glm(data = dt, formula = response ~ probs, family = binomial(link='logit')

Answer 2

irased认为错误可能来自here或here。两者都在迭代重新加权最小二乘法的主循环中。

拳头检查可以使任何不合格的事物失效。对于您的情况（以及与二项式族的所有链接函数），它们来自binomial("identity")$dev.resids，它调用this C function。如果平均值log超出（0,1）（即超出有效范围），则在某些情况下可以将mu评估为负值。

如果任何线性预测变量eta或均值mu无效，我们到达第二个分支，并且我们处于第一次迭代中，此时coefold是NULL

if (!(valideta(eta) && validmu(mu))) {
  if(is.null(coefold))
    stop("no valid set of coefficients has been found: please supply starting values", call. = FALSE)
  # ...
}

看看您正在使用的家庭，valideta和validmu

with(binomial("identity"), {
    print(valideta)
    print(validmu)
})
#R> function (eta) 
#R> TRUE
#R> <environment: namespace:stats>
#R> function (mu) 
#R> all(is.finite(mu)) && all(mu > 0 & mu < 1)
#R> <bytecode: 0x55de9ffd4448>
#R> <environment: 0x55dea8ee2418>

这很有意义，因为概率（均值）必须在（0,1）之间。因此，我们可以得出结论，在迭代重新加权最小二乘法的过程中，某些均值必须在某个时刻超出（0,1）范围。

您使用的链接函数不能保证均值在（0,1）范围内，因为反向链接函数是

binomial("identity")$linkinv
#R> function (eta) 
#R> eta
#R> <environment: namespace:stats>

这是您的问题。无法保证或签入glm来确保所有内容均有效。但是，某些链接功能始终满足此约束。指定起始值可能可以使您在迭代的重新加权最小二乘法中不会输入无效均值的区域。

我高度怀疑这与以下事实有关：某些响应的生成概率为零，当probs的系数接近1时会引起问题。

是的，这正是问题所在。只需将您的示例替换为

library(data.table)
df <- data.frame(names = factor(1:10))
set.seed(0)
df$probs <- c(0, 0, runif(8, 0, 1))
df$response = lapply(df$probs, function(i){
    rbinom(50, 1, i)  
})

dt <- data.table(df)
dt <- dt[, list(response = unlist(response)), by = c('names', 'probs')]

tmp <- dt$probs
tmp <- pmin(pmax(tmp, .Machine$double.eps), 1 - .Machine$double.eps)
dt$probs_logit <- log(tmp / (1 - tmp))
fit <- glm(data = dt, formula = response ~ probs_logit - 1, family = binomial("logit"))
#R> Warning message:
#R> glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(fit)
#R> 
#R> Call:
#R> glm(formula = response ~ probs_logit - 1, family = binomial("logit"), 
#R>     data = dt)
#R> 
#R> Deviance Residuals: 
#R>     Min       1Q   Median       3Q      Max  
#R> -2.4320  -0.6616   0.0000   0.4519   1.8038  
#R> 
#R> Coefficients:
#R>             Estimate Std. Error z value Pr(>|z|)    
#R> probs_logit  1.02336    0.09468   10.81   <2e-16 ***
#R> ---
#R> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#R> 
#R> (Dispersion parameter for binomial family taken to be 1)
#R> 
#R>     Null deviance: 693.15  on 500  degrees of freedom
#R> Residual deviance: 355.18  on 499  degrees of freedom
#R> AIC: 357.18
#R> 
#R> Number of Fisher Scoring iterations: 8
#R>

给您一个警告，但允许您在截断和转换概率后从几乎正确的模型进行仿真。

这些R glm错误消息的含义是什么：＆＃34;错误：未找到有效的系数集：请提供起始值＆＃34;

2 个答案: