标记在data.table

时间:2018-05-11 13:45:08

标签: r data.table

在C3列的data.table中,我希望按每个组标记N个随机选择的行(C1)。已经在SO hereherehere上提出了几个类似的问题。但根据答案仍无法弄清楚如何为我的任务找到解决方案。

set.seed(1)    
dt = data.table(C1 = c("A","A","A","B","C","C","C","D","D","D"), 
                 C2 = c(2,1,3,1,2,3,4,5,4,5)) 

dt
    C1 C2
 1:  A  2
 2:  A  1
 3:  A  3
 4:  B  1
 5:  C  2
 6:  C  3
 7:  C  4
 8:  D  5
 9:  D  4
10:  D  5

以下是每个组C1对两个随机选择的行的行索引(对于B组不适用):

dt[, sample(.I, min(.N, 2)), by = C1]$V1
[1]  1  3  3  7  5 10  9

注意:对于B,只应选择一行,因为B组只包含一行。

以下是每个组中随机选择的行的解决方案,通常不适用于B组:

dt[, C3 := .I == sample(.I, 1), by = C1]
dt
    C1 C2    C3
 1:  A  2 FALSE
 2:  A  1  TRUE
 3:  A  3 FALSE
 4:  B  1 FALSE
 5:  C  2  TRUE
 6:  C  3 FALSE
 7:  C  4 FALSE
 8:  D  5  TRUE
 9:  D  4 FALSE
10:  D  5 FALSE

实际上我想在N行上展开它。我试过(两行):

dt[, C3 := .I==sample(.I, min(.N, 2)), by = C1]

当然不起作用。

非常感谢任何帮助!

1 个答案:

答案 0 :(得分:1)

N=2
dt[, C3 := {if (.N < N) rep(TRUE,.N) else 1:.N %in%  sample(.N,N) }, by=C1]
dt
# C1 C2    C3
# 1:  A  2  TRUE
# 2:  A  1 FALSE
# 3:  A  3  TRUE
# 4:  B  1  TRUE
# 5:  C  2 FALSE
# 6:  C  3  TRUE
# 7:  C  4  TRUE
# 8:  D  5  TRUE
# 9:  D  4  TRUE
# 10:  D  5 FALSE