我尝试使用R
包在dplyr
进行热甲板插补。我有非有限值,我想用从同一组中抽取的随机值替换。
myData <- data.frame(value = sample(c(Inf, NaN, 1:8), 100, replace=TRUE),
group = sample(letters[1:4], 100, replace=TRUE))
value group
1 4 c
2 6 d
3 Inf c
4 8 c
5 7 a
6 2 b
此代码会运行,但也会对Inf
和NaN
值进行采样。
myData <- myData %>%
group_by(group) %>%
mutate(imputedvalue = sample(value, n(), replace = TRUE))
table(is.finite(myData$imputedvalue), is.infinite(myData$imputedvalue))
FALSE TRUE
FALSE 16 7
TRUE 77 0
此代码无法运行。
myData <- myData %>%
group_by(group) %>%
mutate(imputedvalue = ifelse(is.finite(value), value,
sample(value, n(), replace = TRUE)))
Error in n() : This function should not be called directly
我觉得应该有某种filter()
命令,但我真的不知道这应该如何运作......
答案 0 :(得分:1)
这是一种涉及首先拆分数据集的方法:
# filter non-infinite records
myDataOK <- myData %>%
filter(value %>% is.finite)
# how many replacements are needed?
# sample these, a la @eddi
myDataimputed <- myData %>%
group_by(group) %>%
summarise(n_inf = sum(!is.finite(value))) %>%
group_by(group) %>%
do(sample_n(filter(myDataOK,group == .$group),size = .$n_inf,replace = TRUE))
## and combine!
myData2 <- rbind(myDataOK,myDataimputed)
## here are some various checks:
## same size as original?
nrow(myData2) == nrow(myData)
## all infinites replaced?
with(myData2,table(is.finite(value), is.infinite(value)))
## should be no *decreases* after shuffling.
## value x block combinations might increase but should never decrease
check1 <- myDataOK %>%
group_by(group,value) %>%
tally %>%
arrange(group,value)
check2 <- myData2 %>%
group_by(group,value) %>%
tally %>%
arrange(group,value)
if(any((check2$n-check1$n) < 0)) stop("something went wrong!")
## finally, the increases in group size should equal the number of missing values
Ninf <- myData %>%
group_by(group) %>%
summarise(n_inf = sum(!is.finite(value)))
if(any(tally(check2)$n - tally(check1)$n - Ninf$n_inf !=0) )
stop("group sizes changed!")