如何使用后分层输出来影响R中预测模型中的变量?

时间:2014-04-20 14:19:39

标签: r logistic-regression

我目前的数据集对女性进行了过度抽样,以至于它们占总样本量的41% - 这应该是50%到50%。如何使用我的后分层输出来影响我的(逻辑回归)预测模型?

这是我在改变接受调查的女性人数时获得支持的新均值和系数所做的工作:

> library(foreign)
> library(survey)
> 
> mydata <- read.csv("~/Desktop/R/mydata.csv")
> 
> #Enter Actual Population Size
> mydata$fpc <- 1200
> 
> #Enter ID Column Name
> id <- mydata$My.ID
> 
> #Enter Column to Post-Stratify
> type <- mydata$Male
> 
> #Enter Column Variables
> x1 <- 0
> y1 <- 1
> 
> #Enter Corresponding Frequencies
> x2 <- 600
> y2 <- 600
> 
> #Enter the Variable of Interest
> mydata$interest <- mydata$Support
> 
> preliminary.design <- svydesign(id = ~1, data = mydata, fpc = ~fpc)
> 
> ps.weights <- data.frame(type = c(x1,y1), Freq = c(x2, y2))
> 
> mydesign <- postStratify(preliminary.design, ~type, ps.weights)
> 
> #Print Original Mean of Variable of Interest
> mean(mydata$Support)
[1] 0.6666666667
> 
> #Total Actual Population Size
> sum(ps.weights$Freq)
[1] 1200
> 
> #Unweighted Observations Where the Variable of Interest is Not Missing
> unwtd.count(~interest, mydesign)
       counts SE
counts    411  0
> 
> #Print the Post-Stratified Mean and SE of the Variable
> svymean(~interest, mydesign)
               mean      SE
interest 0.71077946 0.01935
> 
> #Print the Weighted Total and SE of the Variable
> svytotal(~interest, mydesign)
             total       SE
interest 852.93535 23.21552
> 
> #Print the Mean and SE of the Interest Variable, by Type
> svyby(~interest, ~type, mydesign, svymean)
  type     interest            se
0    0 0.6196721311 0.02256768435
1    1 0.8018867925 0.03142947839
> 
> mysvyby <- svyby(~interest, ~type, mydesign, svytotal)
> 
> #Print the Coefficients of each Type
> coef(mysvyby)
          0           1 
371.8032787 481.1320755 
> 
> #Print the Standard Error of each Type
> SE(mysvyby)
[1] 13.54061061 18.85768704
> 
> #Print Confidence Intervals for the Coefficient Estimates
> confint(mysvyby)
        2.5 %      97.5 %
0 345.2641696 398.3423878
1 444.1716880 518.0924629

上面的所有输出似乎都是正确的 - 但我无法弄清楚如何利用这些数据来影响逻辑回归模型的输出。这是没有任何后分层影响的代码:

> mydata <- read.csv("~/Desktop/R/mydata.csv")
> 
> attach(mydata) 
> 
> # Define variables 
> 
> Y <- cbind(Support)
> X <- cbind(Black, vote, Male) 
> 
> # Descriptive statistics 
> 
> summary(Y) 
    Support         
 Min.   :0.0000000  
 1st Qu.:0.0000000  
 Median :1.0000000  
 Mean   :0.6666667  
 3rd Qu.:1.0000000  
 Max.   :1.0000000  
> 
> summary(X) 
     Black            vote                   Male          
 Min.   :0.0000000   Min.   : 0.8100   Min.   :0.0000000  
 1st Qu.:0.0000000   1st Qu.:24.0350   1st Qu.:0.0000000  
 Median :0.0000000   Median :47.6300   Median :0.0000000  
 Mean   :0.4355231   Mean   :48.0447   Mean   :0.2579075  
 3rd Qu.:1.0000000   3rd Qu.:72.1300   3rd Qu.:1.0000000  
 Max.   :1.0000000   Max.   :91.3200   Max.   :1.0000000  
> 
> table(Y) 
Y
  0   1 
137 274 
> 
> table(Y)/sum(table(Y)) 
Y
           0            1 
0.3333333333 0.6666666667 
> 
> 
> # Logit model coefficients 
> 
> logit<- glm(Y ~ X, family=binomial (link = "logit")) 
> 
> summary(logit) 

Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-2.1658288  -1.1277933   0.5904486   0.9190314   1.3256407  

Coefficients:
                  Estimate   Std. Error  z value   Pr(>|z|)    
(Intercept)    0.462496014  0.265017604  1.74515  0.0809584 .  
XBlack         1.329633506  0.244053422  5.44812 5.0904e-08 ***
Xvote         -0.008839950  0.004262016 -2.07412  0.0380678 *  
XMale          0.781144950  0.283218355  2.75810  0.0058138 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 523.21465  on 410  degrees of freedom
Residual deviance: 469.48706  on 407  degrees of freedom
AIC: 477.48706

Number of Fisher Scoring iterations: 4

> 
> # Logit model odds ratios 
> 
> exp(logit$coefficients) 
  (Intercept)        XBlack Xvote                XMale 
 1.5880327947  3.7796579101  0.9911990073  2.1839713716 

有没有办法在R中合并这两个脚本以更新我的logit模型,以便在我预测时将性别视为50/50而不是74%女性/ 26%男性?

谢谢!

1 个答案:

答案 0 :(得分:0)

由于您想要从模型中创建预测,这里有一个可能的解决方案:(1)使逻辑回归模型与您手头的数据(即74%女性和26%男性)一致(2) )从您的模型中提取预测概率,将性别变量设置为0.5。有关详细信息,请参阅?predict.glm