NA值(按唯一ID)

时间:2019-10-17 03:00:54

标签: r data.table

我有一个包含两列的数据框。我希望“ id”列是唯一的,并且未重复的id的值应设置为相同的值,对于重复的id的值应设置为NA。

library(data.table)

DT <- data.table(id = c(1,2,3,3,4,5,5), value = c(17,13,8,NA,9,NA,11))
DT
   id value
1:  1    17
2:  2    13
3:  3     8
4:  3    NA
5:  4     9
6:  5    NA
7:  5    11

预期产量

   id value
1:  1    17
2:  2    13
3:  3    NA
4:  4     9
5:  5    NA

3 个答案:

答案 0 :(得分:6)

这里是一种选择:

> DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
   id value
1:  1    17
2:  2    13
3:  3    NA
4:  4     9
5:  5    NA

答案 1 :(得分:4)

这应该可以解决问题,通过id获取最小值,如果有NA,则将返回NA

DT[,.(value=min(value)),.(id)]

编辑:将此解决方案和@ mt1022解决方案计时在1亿行数据表上,时间相似

library(data.table)
set.seed(1)
DT <- data.table(id = sample(1:1e6,1e8,replace=TRUE), 
                  value = ifelse(runif(1e7) < 0.99,
                                 sample(1:1e6,1e8,replace=TRUE),
                                        NA))

 # my solution with min
 ptm <- proc.time()
 DT[,.(value=min(value)),.(id)]
 proc.time() - ptm

 #   user  system elapsed 
 #   6.34    1.67    2.89 

 # mt1022's solution
 ptm <- proc.time()
 DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
 proc.time() - ptm


 #   user  system elapsed 
 #   6.61    1.35    4.61 

答案 2 :(得分:1)

我发现您主要对data.table解决方案感兴趣,但出于完整性考虑,一种dplyr的可能性可能是:

DT %>%
 group_by(id) %>%
 slice(which.max(is.na(value)))

     id value
  <dbl> <dbl>
1     1    17
2     2    13
3     3    NA
4     4     9
5     5    NA