我有一个带有一些估算值的数据集。根据预定义的编辑规则,这些估算值中的一些是不可信的。出于这个原因,我想调整这些难以置信的估算值,但调整应该尽可能小。
这是一个简化的例子:
# Seed
set.seed(111)
# Example data
data <- data.frame(x1 = round(rnorm(200, 5, 5), 0),
x2 = factor(round(runif(200, 1, 3), 0)),
x3 = round(rnorm(200, 2, 10), 0),
x4 = factor(round(runif(200, 0, 5), 0)))
data[data$x1 > 5 & data$x2 == 1, ]$x3 <- 4
data[data$x1 > 5 & data$x2 == 1, ]$x4 <- 5
# Missings
data$x1[sample(1:nrow(data), 25)] <- NA
data$x2[sample(1:nrow(data), 50)] <- NA
data$x3[sample(1:nrow(data), 40)] <- NA
data$x4[sample(1:nrow(data), 35)] <- NA
# Imputation
library("mice")
imp <- mice(data, m = 1)
# Imputed data
data_imp <- complete(imp, "repeated")
# So far everything works well.
# However, there is a predefined edit rule, which should not be violated.
# Edit Rule:
# If x1 > 5 and x2 == 1
# Then x3 > 3 and x4 > 4
# Because of the imputation, some of the observations have implausible values.
implausible <- data_imp[data_imp$x1 > 5 & data_imp$x2 == 1 &
(data_imp$x3 <= 3 | (data_imp$x4 != 4 & data_imp$x4 != 5)), ]
implausible
# Example 1)
# In row 26 x1 has a value > 5 and x2 equals 1.
# For that reason, x3 would have to be larger than 3 (here x3 is -17).
# Like you can see in the original data, x2 has been imputed in row 26.
data[rownames(implausible), ]
# Hence, x2 would have to be adjusted, so that it randomly gets a different category.
# Example 2)
# In row 182 are also implausible values.
# Three of the variables have been imputed in this row.
# Therefore, all/some of the imputed cells would have to be adjusted,
# but the adjustment should be as small as possible.
我已经做了一些研究并找到了一些相关的论文/书籍,其中描述了一些优化算法:
Pannekoek&amp;张(2011):https://www.researchgate.net/publication/269410841_Partial_donor_Imputation_with_Adjustments
de Waal,Pannekoek&amp; Scholtus(2011年):统计数据编辑和估算手册然而,我正在努力在R中实现这些算法。是否有可用的包,这有助于这些计算。我非常感谢我对代码的一些帮助或有关该主题的一些提示!