当我们切换标签(0/1)时,系数会发生什么变化?

时间:2015-08-26 13:49:52

标签: r regression

我试图在实践中看到这里解释的内容what happens to the coefficients once labels are switched,但我没有得到预期的结果。这是我的尝试:

我正在使用以和#34;实践数据科学为例,以R"为例给出的自然公共使用数据的例子。输出是一个逻辑变量,如果新生婴儿在风险级别为FALSE且为TRUE,则对新生婴儿进行分类

load(url("https://github.com/WinVector/zmPDSwR/tree/master/CDC/NatalRiskData.rData"))
train <- sdata[sdata$ORIGRANDGROUP<=5,]
test <- sdata[sdata$ORIGRANDGROUP>5,]
complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
                 "URF_ECLAM")
y <- "atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications,  riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))
summary(model)

这导致以下系数:

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -4.412189   0.289352 -15.249  < 2e-16 ***
PWGT                      0.003762   0.001487   2.530 0.011417 *  
UPREVIS                  -0.063289   0.015252  -4.150 3.33e-05 ***
CIG_RECTRUE               0.313169   0.187230   1.673 0.094398 .  
GESTREC3< 37 weeks        1.545183   0.140795  10.975  < 2e-16 ***
DPLURALtriplet or higher  1.394193   0.498866   2.795 0.005194 ** 
DPLURALtwin               0.312319   0.241088   1.295 0.195163    
ULD_MECOTRUE              0.818426   0.235798   3.471 0.000519 ***
ULD_PRECIPTRUE            0.191720   0.357680   0.536 0.591951    
ULD_BREECHTRUE            0.749237   0.178129   4.206 2.60e-05 ***
URF_DIABTRUE             -0.346467   0.287514  -1.205 0.228187    
URF_CHYPERTRUE            0.560025   0.389678   1.437 0.150676    
URF_PHYPERTRUE            0.161599   0.250003   0.646 0.518029    
URF_ECLAMTRUE             0.498064   0.776948   0.641 0.521489

好的,现在让我们切换atRisk变量中的标签:

esdata$atRisk  <- factor(sdata$atRisk)
levels(sdata$atRisk) <- c("TRUE", "FALSE")

并重新运行上述分析,我希望看到上述报告系数的符号发生变化,但是,我得到完全相同的系数:

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -4.412189   0.289352 -15.249  < 2e-16 ***
PWGT                      0.003762   0.001487   2.530 0.011417 *  
UPREVIS                  -0.063289   0.015252  -4.150 3.33e-05 ***
CIG_RECTRUE               0.313169   0.187230   1.673 0.094398 .  
GESTREC3< 37 weeks        1.545183   0.140795  10.975  < 2e-16 ***
DPLURALtriplet or higher  1.394193   0.498866   2.795 0.005194 ** 
DPLURALtwin               0.312319   0.241088   1.295 0.195163    
ULD_MECOTRUE              0.818426   0.235798   3.471 0.000519 ***
ULD_PRECIPTRUE            0.191720   0.357680   0.536 0.591951    
ULD_BREECHTRUE            0.749237   0.178129   4.206 2.60e-05 ***
URF_DIABTRUE             -0.346467   0.287514  -1.205 0.228187    
URF_CHYPERTRUE            0.560025   0.389678   1.437 0.150676    
URF_PHYPERTRUE            0.161599   0.250003   0.646 0.518029    
URF_ECLAMTRUE             0.498064   0.776948   0.641 0.521489

我在这做错了什么?你能帮忙吗

1 个答案:

答案 0 :(得分:0)

因为您设置了train <- sdata[sdata$ORIGRANDGROUP<=5,]然后更改了sdata$atRisk <- factor(sdata$atRisk),但您的模型正在使用train数据集,其级别不会被更改。

相反,你可以做

y <- "!atRisk"
x <- c("PWGT", "UPREVIS", "CIG_REC", "GESTREC3", "DPLURAL", complications,  riskfactors)
fmla <- paste(y, paste(x, collapse="+"), sep="~")
model <- glm(fmla, data=train, family=binomial(link="logit"))


Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2641   0.1358   0.1511   0.1818   0.9732  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               4.412189   0.289352  15.249  < 2e-16 ***
PWGT                     -0.003762   0.001487  -2.530 0.011417 *  
UPREVIS                   0.063289   0.015252   4.150 3.33e-05 ***
CIG_RECTRUE              -0.313169   0.187230  -1.673 0.094398 .  
GESTREC3< 37 weeks       -1.545183   0.140795 -10.975  < 2e-16 ***
DPLURALtriplet or higher -1.394193   0.498866  -2.795 0.005194 ** 
DPLURALtwin              -0.312319   0.241088  -1.295 0.195163    
ULD_MECOTRUE             -0.818426   0.235798  -3.471 0.000519 ***
ULD_PRECIPTRUE           -0.191720   0.357680  -0.536 0.591951    
ULD_BREECHTRUE           -0.749237   0.178129  -4.206 2.60e-05 ***
URF_DIABTRUE              0.346467   0.287514   1.205 0.228187    
URF_CHYPERTRUE           -0.560025   0.389678  -1.437 0.150676    
URF_PHYPERTRUE           -0.161599   0.250003  -0.646 0.518029    
URF_ECLAMTRUE            -0.498064   0.776948  -0.641 0.521489    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2698.7  on 14211  degrees of freedom
Residual deviance: 2463.0  on 14198  degrees of freedom
AIC: 2491

Number of Fisher Scoring iterations: 7