在dplyr的热甲板插补

时间:2014-07-02 19:28:48

标签: r dplyr

我尝试使用R包在dplyr进行热甲板插补。我有非有限值,我想用从同一组中抽取的随机值替换。

myData <- data.frame(value = sample(c(Inf, NaN, 1:8), 100, replace=TRUE), 
                     group = sample(letters[1:4], 100, replace=TRUE))
  value group
1     4     c
2     6     d
3   Inf     c
4     8     c
5     7     a
6     2     b

此代码会运行,但也会对InfNaN值进行采样。

myData <- myData %>%
  group_by(group) %>%
  mutate(imputedvalue = sample(value, n(), replace = TRUE))

table(is.finite(myData$imputedvalue), is.infinite(myData$imputedvalue))

        FALSE TRUE
  FALSE    16    7
  TRUE     77    0

此代码无法运行。

myData <- myData %>%
  group_by(group) %>%
  mutate(imputedvalue = ifelse(is.finite(value), value, 
                               sample(value, n(), replace = TRUE)))
Error in n() : This function should not be called directly

我觉得应该有某种filter()命令,但我真的不知道这应该如何运作......

1 个答案:

答案 0 :(得分:1)

这是一种涉及首先拆分数据集的方法:

# filter non-infinite records

myDataOK <- myData %>%
  filter(value %>% is.finite)

# how many replacements are needed? 
# sample these, a la @eddi

myDataimputed <- myData %>%
  group_by(group) %>%
  summarise(n_inf = sum(!is.finite(value))) %>% 
  group_by(group) %>%
  do(sample_n(filter(myDataOK,group == .$group),size = .$n_inf,replace = TRUE))

## and combine!
myData2 <- rbind(myDataOK,myDataimputed)

## here are some various checks:

## same size as original?
nrow(myData2) == nrow(myData)

## all infinites replaced?
with(myData2,table(is.finite(value), is.infinite(value)))

## should be no *decreases* after shuffling.  
## value x block combinations might increase but should never decrease
check1 <- myDataOK %>% 
  group_by(group,value) %>%
  tally %>%
  arrange(group,value)
check2 <- myData2 %>% 
  group_by(group,value) %>%
  tally %>%
  arrange(group,value)
if(any((check2$n-check1$n) < 0)) stop("something went wrong!")


## finally, the increases in group size should equal the number of missing values

Ninf <- myData %>%
  group_by(group) %>%
  summarise(n_inf = sum(!is.finite(value)))

if(any(tally(check2)$n - tally(check1)$n - Ninf$n_inf !=0) ) 
  stop("group sizes changed!")