我试图在R中使用鼠标软件包进行项目,并发现汇总结果似乎改变了输出中其中一个变量的虚拟代码。
详细说明,让我说我有一个因子foo,有两个级别:0和1.使用常规lm通常会产生foo1的估计值。但是,使用鼠标和池函数会产生foo2的估计值。我使用来自鼠标包的nhanes数据集在下面包含了一个可重现的示例。有什么想法可能会发生吗?
require(mice)
# Create age as: 0, 1, 2
nhanes$age <- as.factor(nhanes$age - 1)
head(nhanes)
# age bmi hyp chl
# 1 0 NA NA NA
# 2 1 22.7 1 187
# 3 0 NA 1 187
# 4 2 NA NA NA
# 5 0 20.4 1 113
# 6 2 NA NA 184
# Use a regular lm with missing data just to see output
# age1 and age2 come up as expected
lm(chl ~ age + bmi, data = nhanes)
# Call:
# lm(formula = chl ~ age + bmi, data = nhanes)
# Coefficients:
# (Intercept) age1 age2 bmi
# -28.948 55.810 104.724 6.921
imp <- mice(nhanes)
str(complete(imp)) # still the same coding
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
# Now the estimates are for age2 and age3
# Call: pool(object = fit)
# Pooled coefficients:
# (Intercept) age2 age3 bmi
# 29.88431 43.76159 56.57606 5.05537
答案 0 :(得分:4)
显然mice
函数设置了因素的对比。所以你得到以下内容(查看列名称):
contrasts(nhanes$age)
## 1 2
## 0 0 0
## 1 1 0
## 2 0 1
contrasts(imp$data$age)
## 2 3
## 0 0 0
## 1 1 0
## 2 0 1
您可以更改插补数据的对比度,然后获得相同的虚拟编码:
imp <- mice(nhanes)
contrasts(imp$data$age) <- contrasts(nhanes$age)
fit <- with(imp, lm(chl ~ age + bmi))
pool(fit)
## Call: pool(object = fit)
##
## Pooled coefficients:
## (Intercept) age1 age2 bmi
## 0.9771566 47.6351257 63.1332336 6.2589887
##
## Fraction of information about the coefficients missing due to nonresponse:
## (Intercept) age1 age2 bmi
## 0.3210118 0.5554399 0.6421063 0.3036489