我有10多个调查样本(每个样本都是一个svydesign对象),我需要计算一些变量的比例。我认为这样做更有效的方式是通过循环,因为我不能使用svymean(~var1+var2, data)
,因为 var2 对于某些特定的 var1 是NA值,因此如果我使用na.rm = TRUE
,则会丢失一些 var1 信息,而如果我不输入 var2 ,则会导致NA。
我试图创建一个像这样的函数:
svymean_all <- function(data, ...) {
x <- c(...)
for (i in length(x)) {
svymean(x[i], data)
}
}
但它不起作用。
# download data
library(lodown)
pnad_cat <- get_catalog("pnad", output_dir = file.path(path.expand( "~" ), "PNAD"))
pnad_cat <- subset( pnad_cat , year >= 2001 )
pnad_cat <- lodown( "pnad" , pnad_cat )
options(survey.lonely.psu = "adjust")
library(survey)
# Here is the first part that I think is inefficient
# I have to do this for every variable I want
result1 <- as.list(NULL)
result2 <- as.list(NULL)
# do the analysis for each year
for (i in 1:nrow(pnad_cat)) {
pnad_df <- readRDS( pnad_cat[ i , 'output_filename' ] )
pop_types <- data.frame(v4609 = unique(pnad_df$v4609),
Freq = unique(pnad_df$v4609))
prestratified_design <- svydesign(id = ~v4618,
strata = ~v4617,
data = pnad_df,
weights = ~pre_wgt,
nest = TRUE)
rm(pnad_df) ; gc()
pnad_design <- postStratify(design = prestratified_design,
strata = ~v4609,
population = pop_types)
rm(prestratified_design) ; gc()
# Here is the second part that I think is inefficient
result1[[i]] <- svymean(~v0602, pnad_design, na.rm = TRUE)
result2[[i]] <- svymean(~v6002, pnad_design, na.rm = TRUE)
}
对于许多变量和条件,我应该使用某些子集(不仅是v0602和v6002)执行类似的操作。有什么办法可以更简单地做到这一点?