我有一个包含两列的数据框。我希望“ id”列是唯一的,并且未重复的id的值应设置为相同的值,对于重复的id的值应设置为NA。
library(data.table)
DT <- data.table(id = c(1,2,3,3,4,5,5), value = c(17,13,8,NA,9,NA,11))
DT
id value
1: 1 17
2: 2 13
3: 3 8
4: 3 NA
5: 4 9
6: 5 NA
7: 5 11
预期产量
id value
1: 1 17
2: 2 13
3: 3 NA
4: 4 9
5: 5 NA
答案 0 :(得分:6)
这里是一种选择:
> DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
id value
1: 1 17
2: 2 13
3: 3 NA
4: 4 9
5: 5 NA
答案 1 :(得分:4)
这应该可以解决问题,通过id获取最小值,如果有NA,则将返回NA
DT[,.(value=min(value)),.(id)]
library(data.table)
set.seed(1)
DT <- data.table(id = sample(1:1e6,1e8,replace=TRUE),
value = ifelse(runif(1e7) < 0.99,
sample(1:1e6,1e8,replace=TRUE),
NA))
# my solution with min
ptm <- proc.time()
DT[,.(value=min(value)),.(id)]
proc.time() - ptm
# user system elapsed
# 6.34 1.67 2.89
# mt1022's solution
ptm <- proc.time()
DT[, .(value = if(.N == 1) value else NA_real_), by = .(id)]
proc.time() - ptm
# user system elapsed
# 6.61 1.35 4.61
答案 2 :(得分:1)
我发现您主要对data.table
解决方案感兴趣,但出于完整性考虑,一种dplyr
的可能性可能是:
DT %>%
group_by(id) %>%
slice(which.max(is.na(value)))
id value
<dbl> <dbl>
1 1 17
2 2 13
3 3 NA
4 4 9
5 5 NA