简介
我有嵌入学校的教师的多层次调查数据。我根据概率选择和响应率(下面oldwt
)手动计算了设计权重和不答复调整权重。现在我想通过两个边缘来创建后分层权重:教师的性别(男性或女性)和就业状态(全职或非全职)。在Statalist的善良人士的帮助下(见here),我似乎成功地在Stata完成了这项工作。但是,在尝试在R中复制结果时,我提出了截然不同的输出。
示例数据
#Variables
#school : unique school id
#caseid : unique teacher id
#oldwt : the product of the design weight and the non-response adjustment
#gender : male or female
#timecat : employment status (full-time or part-time)
#scgender : a combined factor variable of school x gender
#sctime : a combined factor variable of school x timecat
#genderp : the school's true population for gender
#fullp : the school's true population for timecat
#Sample Data
foo <- structure(list(caseid = 1:11, school = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), oldwt = c(1.8, 1.8, 1.8, 1.8, 1.8, 1.3, 1.3, 1.3, 1.3, 1.3, 1.3), gender = structure(c(2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor"), timecat = structure(c(2L, 2L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Full-time", "Part-time"), class = "factor"), scgender = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 4L, 4L, 3L, 4L, 4L), .Label = c("1.Female", "1.Male", "2.Female", "2.Male"), class = "factor"), sctime = structure(c(2L, 2L, 1L, 1L, 1L, 4L, 4L, 3L, 3L, 3L, 3L), .Label = c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time"), class = "factor"), genderp = c(0.444, 0.556, 0.556, 0.444, 0.444, 0.25, 0.75, 0.75, 0.25, 0.75, 0.75), fullp = c(0.222, 0.222, 0.778, 0.778, 0.778, 0.375, 0.375, 0.625, 0.625, 0.625, 0.625)), .Names = c("caseid", "school", "oldwt", "gender", "timecat", "scgender", "sctime", "genderp", "fullp"), class = "data.frame", row.names = c(NA, -11L))
耙码
(有关在R中使用anesrake
的深入示例,请参阅here和here。
# extract true population proportions into a vector
genderp <- c(aggregate(foo$genderp, by=list(foo$scgender), FUN=max))
fullp <- c(aggregate(foo$fullp, by=list(foo$sctime), FUN=max))
genderp <- as.vector(genderp$x)
fullp <- as.vector(fullp$x)
# align the levels/labels of the population total with the variables
names(genderp) <- c("1.Female", "1.Male", "2.Female", "2.Male")
names(fullp) <- c("1.Full-time", "1.Part-time", "2.Full-time", "2.Part-time")
# create target list of true population proportions for variables
targets <- list(genderp, fullp)
names(targets) <- c("scgender", "sctime")
# rake
library(anesrake)
outsave <- anesrake(targets, foo, caseid = foo$caseid, weightvec = foo$oldwt, verbose = F, choosemethod = "total", type = "nolim", nlim = 2, force1 = FALSE)
outsave
与Stata输出的比较
问题是R的输出与Stata的输出不匹配(即使我设置force1 = TRUE
),似乎Stata输出是正确的,使得我认为我的草率R代码是错误的。是这样的吗?
caseid R Stata
1 0.070 0.633
2 0.152 1.367
3 0.404 3.633
4 0.187 1.683
5 0.187 1.683
6 0.143 1.146
7 0.232 1.854
8 0.173 1.382
9 0.107 0.854
10 0.173 1.382
11 0.173 1.382
答案 0 :(得分:1)
您在R中的目标分布应该总结一个并代表您的人口中的分布。看看my example。我认为force1选项不会计算你想要的分布,至少每个学校都有相同的人口权重。这就是force1正在做的事情:
targets[[1]]/sum(targets[[1]])
1.Female 1.Male 2.Female 2.Male
0.278 0.222 0.125 0.375
这就是你想要的吗?