我正在对老鼠进行多次估算,但我很惊讶地发现没有NA的变量中的原始值会被改变和扭曲。
有关可重复的示例,请参见下文。我将使用mtcars(base R)并在其中嵌入2列中的随机NAs - disp和hp。我将标记这些NA的位置。然后我会将缺失值归为真,并将其与原始值进行比较。最后,我将在散点图中绘制结果:原始值与推算值。我希望原始值与没有NA的列的估算值一致,因为不应该有任何插补。但这种情况并非如此。代码和图表如下:
library(data.table)
library(ggplot2)
library(mice)
data(mtcars)
setDT(mtcars)
dim(mtcars)
# 32 11
mtcars_original <- copy(mtcars)
mtcars[as.numeric(sample(row.names(mtcars), 7)), ]$hp <- NA
mtcars[as.numeric(sample(row.names(mtcars), 7)), ]$disp <- NA
mtcars[, ":="(hp_NA = ifelse(is.na(hp), 1, 0) , disp_NA = ifelse(is.na(disp), 1, 0))]
mtcars_imputed <- complete(mice(mtcars))
mtcars_imputed$disp_original <- mtcars_original$disp
mtcars_imputed$hp_original <- mtcars_original$hp
ggplot(mtcars_imputed, aes(x = disp_original, y= disp, color = as.factor(disp_NA))) +
geom_point(size = 2) + ggtitle("Match between original and imputed values \n disp") +
geom_smooth(method = "lm", color = "red", alpha = 0.3, size = 2) + theme_economist()
ggplot(mtcars_imputed, aes(x = hp_original, y= hp, color = as.factor(hp_NA))) +
geom_point(size = 2) + ggtitle("Match between original and imputed values \n hp") +
geom_smooth(method = "lm", color = "red", alpha = 0.3, size = 2) + theme_economist()
您的建议将不胜感激。