当我使用lodown软件包分析SCF时,发现了一个非常奇怪的问题。年龄在35岁以下,某些大学的教育程度的黑人群体的数据必定有问题。该组的份额/平均值太高。
我试图综合考虑种族,年龄和受教育程度这三个因素,以查看特定人群总财富在总人口中所占的比例。
# input data
scf_imp <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016.rds" ) )
scf_rw <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016 rw.rds" ) )
scf_design <-
svrepdesign(
weights = ~wgt ,
repweights = scf_rw[ , -1 ] ,
data = imputationList( scf_imp ) ,
scale = 1 ,
rscales = rep( 1 / 998 , 999 ) ,
mse = FALSE ,
type = "other" ,
combined.weights = TRUE
)
# Variable Recoding
scf_design <- update(scf_design ,
racecl4 = factor(racecl4 ,
labels = c("White" ,
"Black" ,
"Hispanic/Latino" ,
"Other" )),
edcl = factor(edcl ,
labels = c("less than high school" ,
"high school or GED" ,
"some college" ,
"college degree" )),
agecl = factor(agecl ,
labels = c("less than 35" ,
"35-44" ,
"45-54" ,
"55-64" ,
"65-74" ,
"75 or more"))
)
# calculation
trible <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , svytotal )
) )
sum_black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% sum()
black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% matrix(nrow = 4)
black <- as.data.frame(black/sum_black)
colnames(black) <- c("less than 35" , "35-44" , "45-54" , "55-64" ,"65-74" , "75 or more")
black <- black %>% mutate(total = rowSums(black))
black <- rbind(black,total = colSums(black))
black <- sapply(black,scales::percent) %>% as.data.frame()
rownames(black) <- c("less than high school" , "high school or GED" , "some college" , "college degree", "total" )
black <- rownames_to_column(black,"share for black")
我应用了相同的方法来计算平均值。结果表明,年龄小于35岁的黑人群体和某些大学的教育程度具有很高的份额/平均价值。但事实并非如此。我使用的数据或方法有问题吗?
http://ww4.sinaimg.cn/large/006tNc79ly1g59rj71tfgj312m07i43t.jpg http://ww2.sinaimg.cn/large/006tNc79ly1g59rhfq18aj30zc07e79l.jpg
答案 0 :(得分:1)
对消费者财务状况的调查大约有6,000条未加权记录,您将结果分成近100个组,因此每个单元格平均填充N = 60。看看它有多小。
counts <- scf_MIcombine( with( scf_design ,
svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , unwtd.count )
) )
并非一成不变的规则,但是如果标准误超过统计量的30%,则该统计量可能不稳定。看看SE( trible ) / coef( trible ) > 0.3
,您会发现几乎所有统计信息都是不稳定的。
SCF是一个了不起的数据集,但是样本大小可能不足以支持如此精确的突破。.