Question

我在使用glm在R中运行逻辑回归时遇到了一些困难。有两种方法可以将二进制响应变量传递给glm以执行逻辑回归。您可以以串行数据格式将数据传递给glm（例如，每次观察一行，响应变量为0或1，自变量取代您拥有的任何值），或者您可以将其传递给作为一个表，至少有三列：第一列表示试验次数，第二列表示成功次数，第三列是自变量。

当我使用glm使用后一种数据格式（例如一个有三列的数据帧）时，我得到了预期的输出，但是当我使用前者输入数据（即串行数据格式）时，我得不到预期的答案。

这是一个例子

prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)

model <- glm(trials ~ prices, family=binomial(link="logit"))

> summary(model)

Call:
glm(formula = trials ~ prices, family = binomial)

Deviance Residuals: 
     1       2       3  
 1.332  -4.440   1.553  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.923863   0.241677   -7.96 1.71e-15 ***
prices       0.044995   0.002593   17.35  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 715.832  on 2  degrees of freedom
Residual deviance:  23.897  on 1  degrees of freedom
AIC: 49.228

Number of Fisher Scoring iterations: 4

在这种情况下，我获得了预期的值，但是使用了串行数据

> head(atable)
  ordered sale_price
1       0     149.99
2       0     149.99
3       0     149.99
4       0     149.99
5       0     149.99
6       0     149.99
> summary(atable)
    ordered          sale_price    
 Min.   :0.00000   Min.   : 89.99  
 1st Qu.:0.00000   1st Qu.: 89.99  
 Median :0.00000   Median : 89.99  
 Mean   :0.07843   Mean   :105.87  
 3rd Qu.:0.00000   3rd Qu.: 99.99  
 Max.   :1.00000   Max.   :149.99 

> conv_model <- glm(ordered ~ sale_price, family=binomial(link="logit"), data=atable)
> summary(conv_model)

Call:
glm(formula = ordered ~ sale_price, family = binomial(link = "logit"), 
    data = atable)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4743  -0.4743  -0.4743  -0.1209   3.1376  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.549136   0.095341    5.76 8.43e-09 ***
sale_price  -0.019949   0.001002  -19.90  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11322  on 20591  degrees of freedom
Residual deviance: 10623  on 20590  degrees of freedom
AIC: 10627

Number of Fisher Scoring iterations: 7

只是为了表明它是相同的数据

> table(atable$ordered, atable$sale_price)

    89.99 99.99 149.99
  0 11907  2024   5046
  1  1369   215     31

我得到的输出完全不同，我完全糊涂了。谁能帮我吗？我假设我做了一些简单的事情

Answer 1

我认为你的问题是你正在改变“成功”的定义。

来自?glm（强调我的）

对于二项式和拟二项式族，响应也可以指定为......一个双列矩阵，其中列给出了成功和失败的数量。

所以第一栏是“成功”。在您的代码中，您使用cbind(non_purchases, purchases)，这使non_purchases成为“成功”列。但是在您的表格中，非购买被编码为0以表示失败。使用下面的代码，我们得到相同的结果：

prices <- c(89.99, 99.99, 149.99)
non_purchases <- c(11907, 2024, 5046)
purchases <- c(1369, 215, 31)
trials <- cbind(non_purchases, purchases)

dd = data.frame(
    price = c(rep(prices, non_purchases), rep(prices, purchases)),
    purchase = c(rep(0, sum(non_purchases)), rep(1, sum(purchases)))
)

coef(glm(purchase ~ price, data = dd, family = "binomial"))
# (Intercept)       price 
#  1.92386320 -0.04499477 

coef(glm(cbind(purchases, non_purchases) ~ prices, family = "binomial"))
# (Intercept)       price 
#  1.92386320 -0.04499477

难以在R

1 个答案: