处理各种调查样本中的几个变量

时间:2018-08-31 22:36:34

标签: r survey

我有10多个调查样本(每个样本都是一个svydesign对象),我需要计算一些变量的比例。我认为这样做更有效的方式是通过循环,因为我不能使用svymean(~var1+var2, data),因为 var2 对于某些特定的 var1 是NA值,因此如果我使用na.rm = TRUE,则会丢失一些 var1 信息,而如果我不输入 var2 ,则会导致NA。

我试图创建一个像这样的函数:

svymean_all <- function(data, ...) {
  x <- c(...)
  for (i in length(x)) {
     svymean(x[i], data)
  }
}

但它不起作用。

尝试更具体

# download data
library(lodown)
pnad_cat <- get_catalog("pnad", output_dir = file.path(path.expand( "~" ), "PNAD"))
pnad_cat <- subset( pnad_cat , year >= 2001 )
pnad_cat <- lodown( "pnad" , pnad_cat )

options(survey.lonely.psu = "adjust")
library(survey)

# Here is the first part that I think is inefficient
# I have to do this for every variable I want
result1 <- as.list(NULL)
result2 <- as.list(NULL)

# do the analysis for each year
for (i in 1:nrow(pnad_cat)) {
  pnad_df <- readRDS( pnad_cat[ i , 'output_filename' ] )
  pop_types <- data.frame(v4609 = unique(pnad_df$v4609), 
                          Freq = unique(pnad_df$v4609))

  prestratified_design <- svydesign(id = ~v4618,
                                    strata = ~v4617,
                                    data = pnad_df,
                                    weights = ~pre_wgt,
                                    nest = TRUE)

  rm(pnad_df) ; gc()

  pnad_design <- postStratify(design = prestratified_design,
                              strata = ~v4609,
                              population = pop_types)

  rm(prestratified_design) ; gc()


  # Here is the second part that I think is inefficient
  result1[[i]] <- svymean(~v0602, pnad_design, na.rm = TRUE)
  result2[[i]] <- svymean(~v6002, pnad_design, na.rm = TRUE)
}

对于许多变量和条件,我应该使用某些子集(不仅是v0602和v6002)执行类似的操作。有什么办法可以更简单地做到这一点?

0 个答案:

没有答案