我花了一些时间试图弄清楚当data.table
组中的行here,here或here从i
中随机抽样时发生了什么在library(data.table)
library(magrittr)
seed <- 2016
set.seed(seed)
size <- 10
dt <- data.table(
id = 1:size, A = sample(letters[1:3], size, replace = TRUE), B = 'N',
C = sample(1:100, size, replace = TRUE) + sample(30:70, size, replace = TRUE))
组内。
set.seed(seed)
dt[, .N, by = A]
dt[, .N, by = A][, N] %>% sapply(function(x) { sample(x, round(x*0.5)) })
# A N
# 1: a 6
# 2: c 2
# 3: b 2
#
# Which gives the following rows:
# a: 2, 1, 4
# c: 1
# b: 1
#
# So the result should be:
# id A B C | order in sampled dt
# 1: 1 a N 82 | 2
# 2: 2 a N 86 | 1
# 3: 3 c N 68 | 4
# 4: 4 a N 140 |
# 5: 5 b N 92 | 5
# 6: 6 a N 94 | 3
# 7: 7 b N 102 |
# 8: 8 c N 69 |
# 9: 9 a N 126 |
# 10: 10 a N 56 |
# Results as below or just columns A and I (or V1) with ids:
# id A B C
# 1: 2 a N 86
# 2: 1 a N 82
# 3: 6 a N 94
# 4: 3 c N 68
# 5: 5 b N 92
# Get .I and sample from them:
set.seed(seed)
dt[, .I, by = A] %>%
.[, .SD[sample(.N, round(.N*0.5))], by = A]
set.seed(seed)
dt[, .I[sample(.N, round(.N*0.5))], by = A]
# Both returning expected ids
# Sample from .SD
set.seed(seed)
dt[, .SD[sample(.N, round(.N*0.5))], by = A]
# Correct, but populating each .SD, i.e. can be slow
# Sample from .I and use in i (to e.g. change some values in j)
set.seed(seed)
dt[dt[, .I[sample(.N, round(.N*0.5))], by = A]$V1, ]
set.seed(seed)
dt[dt[, sample(.I, round(.N*0.5)), by = A]$V1, ]
# Correct and faster than above
确认:A的顺序是其值在dt中的顺序?
set.seed(seed)
dt[sample(.N, round(.N*0.5)), .I, by = A]
# A I
# 1: a 2
# 2: a 10
# 3: a 1
# 4: b 7
# 5: c 3
set.seed(seed)
dt[sample(.N, round(.N*0.5)), .SD, by = A]
# id A B C
# 1: 2 a N 86
# 2: 10 a N 56
# 3: 1 a N 82
# 4: 7 b N 102
# 5: 3 c N 68
i
这显然与{{1}}内的抽样有关,但我无法弄清究到究竟发生了什么。