我目前的数据集对女性进行了过度抽样,以至于它们占总样本量的41% - 这应该是50%到50%。如何使用我的后分层输出来影响我的(逻辑回归)预测模型?
这是我在改变接受调查的女性人数时获得支持的新均值和系数所做的工作:
> library(foreign)
> library(survey)
>
> mydata <- read.csv("~/Desktop/R/mydata.csv")
>
> #Enter Actual Population Size
> mydata$fpc <- 1200
>
> #Enter ID Column Name
> id <- mydata$My.ID
>
> #Enter Column to Post-Stratify
> type <- mydata$Male
>
> #Enter Column Variables
> x1 <- 0
> y1 <- 1
>
> #Enter Corresponding Frequencies
> x2 <- 600
> y2 <- 600
>
> #Enter the Variable of Interest
> mydata$interest <- mydata$Support
>
> preliminary.design <- svydesign(id = ~1, data = mydata, fpc = ~fpc)
>
> ps.weights <- data.frame(type = c(x1,y1), Freq = c(x2, y2))
>
> mydesign <- postStratify(preliminary.design, ~type, ps.weights)
>
> #Print Original Mean of Variable of Interest
> mean(mydata$Support)
[1] 0.6666666667
>
> #Total Actual Population Size
> sum(ps.weights$Freq)
[1] 1200
>
> #Unweighted Observations Where the Variable of Interest is Not Missing
> unwtd.count(~interest, mydesign)
counts SE
counts 411 0
>
> #Print the Post-Stratified Mean and SE of the Variable
> svymean(~interest, mydesign)
mean SE
interest 0.71077946 0.01935
>
> #Print the Weighted Total and SE of the Variable
> svytotal(~interest, mydesign)
total SE
interest 852.93535 23.21552
>
> #Print the Mean and SE of the Interest Variable, by Type
> svyby(~interest, ~type, mydesign, svymean)
type interest se
0 0 0.6196721311 0.02256768435
1 1 0.8018867925 0.03142947839
>
> mysvyby <- svyby(~interest, ~type, mydesign, svytotal)
>
> #Print the Coefficients of each Type
> coef(mysvyby)
0 1
371.8032787 481.1320755
>
> #Print the Standard Error of each Type
> SE(mysvyby)
[1] 13.54061061 18.85768704
>
> #Print Confidence Intervals for the Coefficient Estimates
> confint(mysvyby)
2.5 % 97.5 %
0 345.2641696 398.3423878
1 444.1716880 518.0924629
上面的所有输出似乎都是正确的 - 但我无法弄清楚如何利用这些数据来影响逻辑回归模型的输出。这是没有任何后分层影响的代码:
> mydata <- read.csv("~/Desktop/R/mydata.csv")
>
> attach(mydata)
>
> # Define variables
>
> Y <- cbind(Support)
> X <- cbind(Black, vote, Male)
>
> # Descriptive statistics
>
> summary(Y)
Support
Min. :0.0000000
1st Qu.:0.0000000
Median :1.0000000
Mean :0.6666667
3rd Qu.:1.0000000
Max. :1.0000000
>
> summary(X)
Black vote Male
Min. :0.0000000 Min. : 0.8100 Min. :0.0000000
1st Qu.:0.0000000 1st Qu.:24.0350 1st Qu.:0.0000000
Median :0.0000000 Median :47.6300 Median :0.0000000
Mean :0.4355231 Mean :48.0447 Mean :0.2579075
3rd Qu.:1.0000000 3rd Qu.:72.1300 3rd Qu.:1.0000000
Max. :1.0000000 Max. :91.3200 Max. :1.0000000
>
> table(Y)
Y
0 1
137 274
>
> table(Y)/sum(table(Y))
Y
0 1
0.3333333333 0.6666666667
>
>
> # Logit model coefficients
>
> logit<- glm(Y ~ X, family=binomial (link = "logit"))
>
> summary(logit)
Call:
glm(formula = Y ~ X, family = binomial(link = "logit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1658288 -1.1277933 0.5904486 0.9190314 1.3256407
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.462496014 0.265017604 1.74515 0.0809584 .
XBlack 1.329633506 0.244053422 5.44812 5.0904e-08 ***
Xvote -0.008839950 0.004262016 -2.07412 0.0380678 *
XMale 0.781144950 0.283218355 2.75810 0.0058138 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 523.21465 on 410 degrees of freedom
Residual deviance: 469.48706 on 407 degrees of freedom
AIC: 477.48706
Number of Fisher Scoring iterations: 4
>
> # Logit model odds ratios
>
> exp(logit$coefficients)
(Intercept) XBlack Xvote XMale
1.5880327947 3.7796579101 0.9911990073 2.1839713716
有没有办法在R中合并这两个脚本以更新我的logit模型,以便在我预测时将性别视为50/50而不是74%女性/ 26%男性?
谢谢!
答案 0 :(得分:0)
由于您想要从模型中创建预测,这里有一个可能的解决方案:(1)使逻辑回归模型与您手头的数据(即74%女性和26%男性)一致(2) )从您的模型中提取预测概率,将性别变量设置为0.5。有关详细信息,请参阅?predict.glm
。