Question

（编辑：完全提炼的问题）

使用包mitools＆amp;调查和跟随Anthony Damico的代码，我正在使用Survey of Consumer Finance数据集几天。原始的数据集列表是＆＃34; scf_imp＆＃34;，并且插补强加的数据集列表是＆＃34; scf_design＆＃34;。问题如下：

5个多重插补数据框具有不同的列，因此如果我使用该列变量制作一个样本子集（＆＃34;在我的情况下为＃34;在这种情况下为＃34;容纳＆＃34; ＃34;列的行为与其他数据帧不同。

我尝试的是：

按标准对整个列表进行子集化（房屋＆gt; 0＆amp; income＆gt; 0），并根据此处的最后一行（http://r-survey.r-forge.r-project.org/survey/svymi.html）建议包括all = TRUE，以仅保留子集中的那些观察值所有的估算。

scf_design_owner＆lt; - subset（scf_design，houses＆gt; 0＆amp; income＆gt; 0，all = TRUE）

或

我甚至在创建插补列表之前切断了na值行，如下所示：

lapply（scf_imp，function（x）{replace_na（x，list（houses = 0，income = 0））}）

我也做了过滤器试用，但有些事情没有在插入列表中工作。

经过那些试验，当我检查错误信息时。警告信息：在subset.svyimputationList（scf_design，houses＆gt; 0＆amp; income＆gt; 0，：插补之间的子集不同

我完全陷入困境，我花了三天多的时间。简而言之，我的计划是通过＆＃34;房屋＆gt; 0和收入＆gt; 0＆＃34;来过滤插补名单。（列表中的两个列名称）并且仅使用所有五个插补数据帧都具有的观察值（行）。

我只是R的初学者，所以请耐心等待。我坚持使用SCF数据集并进行简单的统计分析。我必须修剪数据，其中样本只包括房屋的正值和收入。

首先，我尝试通过在变量重新编码（http://asdfree.com/survey-of-consumer-finances-scf.html）中指定的Anthony Damico在数据帧列表中添加其他列来实现此目的。我无法在那里那样做。所以我决定限制整个数据帧列表（scf_design）以包括条件标准如下：

这是我的R代码（直到子集）：

setwd( "D:/Dropbox/Data/SCF 2016" )
library(mitools)    # allows analysis of multiply-imputed survey data
library(survey)     # load survey package (analyzes complex design surveys)
library(downloader) # downloads and then runs the source() function on 
scripts from github
library(foreign)    # load foreign package (converts data files into R)
library(Hmisc)      # load Hmisc package (loads a simple wtd.quantile function)

scf_imp <- readRDS("scf 2016.rds" )
scf_rw <- readRDS("scf 2016 rw.rds" )

scf_design <- svrepdesign( 

     # use the main weight within each of the imp# objects
     weights = ~wgt , 

     # use the 999 replicate weights stored in the separate replicate weights file, -1 drops first id column
     repweights = scf_rw[ , -1 ] , 

     # read the data directly from the scf data, list of all five imputation data frames
     data = imputationList( scf_imp ) , 

     scale = 1 ,

     rscales = rep( 1 / 998 , 999 ) ,

     # use the mean of the replicate statistics as the center
     # when calculating the variance, as opposed to the main weight's statistic
     mse = TRUE ,

     type = "other" ,

     combined.weights = TRUE
 )

 scf_design_owner <- subset(scf_design, houses > 0 & income > 0)

如果您没有时间，请查看最后一行，我得到的是以下消息

scf_design_owner <- subset(scf_design, houses > 0 & income > 0)
It seemed to work at first (when I did it with only one criterion..) However, 
it shows the following warnings.

Warning message:
In subset.svyimputationList(scf_design, houses > 0 & income > 0) :
subset differed between imputations

问题是每个插补数据框中的样本数量似乎不同。（从SCF创建了五个插补数据框。他们使用多重插补技术..因此，＆＃39; scf_designer＆＃39;是五个数据框的列表）

> lodown:::scf_MIcombine( with( scf_design_owner , svyby( ~ one , ~ one , 
unwtd.count ) ) )
Multiple imputation results:
  with(scf_design_owner, svyby(~one, ~one, unwtd.count))
  lodown:::scf_MIcombine(with(scf_design_owner, svyby(~one, ~one, unwtd.count)))
  results        se
1  4131.6 0.9797959

原始样本的数量是6248.它肯定会减少，但现在它有小数....我怀疑这是由于每个插补列表中的样本数量不同..

我被困在这里。长话短说，这是我的问题。

有没有什么方法可以让 以正确的方式对数据框进行子集 ，以便所有修改后的插补数据帧具有相同数量的样本？< / p>
如果我的方法效率不高，那么如何在“变量重新编码”部分中执行此操作？（这是我原来的试用版）。我能够为房屋添加额外的变量，因为SCF宏中有一个可变的hhouses，这是一个标识房主的逻辑var。但是我的收入没有类似的变数，所以我放弃了。（SCF的收入从0开始，因此在0点有测量值）

我的意思是变量重新编码aprt是Anthony Damico写的如下：

示例：

scf_design <- 
    update( 
    scf_design , 
    hhsex = factor( hhsex , labels = c( "male" , "female" ) ) ,
    married = as.numeric( married == 1 ) ,
    edcl = 
        factor( 
            edcl , 
            labels = 
                c( 
                    "less than high school" , 
                    "high school or GED" , 
                    "some college" , 
                    "college degree" 
                ) 
        )

）

（加）

我找到了这个，并解决了这个问题。如果子集在多个插补之间不同，则默认为采用任何插补的子集中的观察结果，并带有警告。

 d3<-subset(des, HAB1MI>3) 
 Warning message: In subset.svyimputationList(des, HAB1MI > 3) : 
 subset differed between imputations 
 To keep only those observations in the subset for all imputations 
 use the all=TRUE argument to subset

从多个插补列表

）

0 个答案: