Lodown包中的SCF数据发布

时间:2019-07-23 06:37:36

标签: r survey

当我使用lodown软件包分析SCF时,发现了一个非常奇怪的问题。年龄在35岁以下,某些大学的教育程度的黑人群体的数据必定有问题。该组的份额/平均值太高。

我试图综合考虑种族,年龄和受教育程度这三个因素,以查看特定人群总财富在总人口中所占的比例。

# input data
scf_imp <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016.rds" ) )

scf_rw <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016 rw.rds" ) )

scf_design <-
  svrepdesign(
    weights = ~wgt ,
    repweights = scf_rw[ , -1 ] ,
    data = imputationList( scf_imp ) ,
    scale = 1 ,
    rscales = rep( 1 / 998 , 999 ) ,
    mse = FALSE ,
    type = "other" ,
    combined.weights = TRUE
  )

# Variable Recoding
scf_design <- update(scf_design ,

                     racecl4 = factor(racecl4 ,
                                      labels = c("White" ,
                                                 "Black" ,
                                                 "Hispanic/Latino" ,
                                                 "Other" )),
                     edcl = factor(edcl ,
                                   labels = c("less than high school" ,
                                              "high school or GED" ,
                                              "some college" ,
                                              "college degree" )),
                     agecl = factor(agecl ,
                                    labels = c("less than 35" ,
                                               "35-44" ,
                                               "45-54" ,
                                               "55-64" ,
                                               "65-74" ,
                                               "75 or more"))
)
# calculation
trible <- scf_MIcombine( with( scf_design ,
                               svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , svytotal )
) )

sum_black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% sum()
black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% matrix(nrow = 4)
black <- as.data.frame(black/sum_black)
colnames(black) <- c("less than 35" , "35-44" , "45-54" , "55-64" ,"65-74" , "75 or more")
black <- black %>% mutate(total = rowSums(black))
black <- rbind(black,total = colSums(black))
black <- sapply(black,scales::percent) %>% as.data.frame()
rownames(black) <- c("less than high school" , "high school or GED" , "some college" , "college degree", "total" )
black <- rownames_to_column(black,"share for black")

我应用了相同的方法来计算平均值。结果表明,年龄小于35岁的黑人群体和某些大学的教育程度具有很高的份额/平均价值。但事实并非如此。我使用的数据或方法有问题吗?

http://ww4.sinaimg.cn/large/006tNc79ly1g59rj71tfgj312m07i43t.jpg http://ww2.sinaimg.cn/large/006tNc79ly1g59rhfq18aj30zc07e79l.jpg

1 个答案:

答案 0 :(得分:1)

对消费者财务状况的调查大约有6,000条未加权记录,您将结果分成近100个组,因此每个单元格平均填充N = 60。看看它有多小。

counts <- scf_MIcombine( with( scf_design ,
                               svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , unwtd.count )
) )

并非一成不变的规则,但是如果标准误超过统计量的30%,则该统计量可能不稳定。看看SE( trible ) / coef( trible ) > 0.3,您会发现几乎所有统计信息都是不稳定的。

SCF是一个了不起的数据集,但是样本大小可能不足以支持如此精确的突破。.