跟进this问题后,我想知道如何有效地对分层的Person Period文件进行抽样。
我有一个看起来像这样的数据库
id time var clust
1: 1 1 a clust1
2: 1 2 c clust1
3: 1 3 c clust1
4: 2 1 a clust1
5: 2 2 a clust1
...
将个人id
分组到群组clust
。我想要的是通过id
对clust
进行抽样,保留人的期间格式。
我想出的解决方案是取样id
,然后取回merge
。但是,它不是一个非常优雅的解决方案。
library(data.table)
library(dplyr)
setDT(dt)
dt[,.SD[sample(.N,1)],by = clust] %>%
merge(., dt, by = 'id')
给出了
id clust.x time.x var.x time.y var.y clust.y
1: 2 clust1 1 a 1 a clust1
2: 2 clust1 1 a 2 a clust1
3: 2 clust1 1 a 3 c clust1
4: 3 clust2 3 c 1 a clust2
5: 3 clust2 3 c 2 b clust2
6: 3 clust2 3 c 3 c clust2
7: 5 clust3 1 a 1 a clust3
8: 5 clust3 1 a 2 a clust3
9: 5 clust3 1 a 3 c clust3
有更直接的解决方案吗?
library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2",
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L,
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names = c("id",
"time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))
答案 0 :(得分:3)
以下是@ Frank的评论可能有所帮助的变体,实际上您可以从每个clust
组中抽取唯一ID,并找出带有.I
的相应索引编号进行子集化:
dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]
# id time var clust
#1: 2 1 a clust1
#2: 2 2 a clust1
#3: 2 3 c clust1
#4: 3 1 a clust2
#5: 3 2 b clust2
#6: 3 3 c clust2
#7: 4 1 a clust3
#8: 4 2 b clust3
#9: 4 3 c clust3
答案 1 :(得分:2)
我认为tidy data这里有一个ID表,其中cluster是一个属性:
idDT = unique(dt[, .(id, clust)])
id clust
1: 1 clust1
2: 2 clust1
3: 3 clust2
4: 4 clust3
5: 5 clust3
6: 6 clust2
从那里,样本......
my_selection = idDT[, .(id = sample(id, 1)), by=clust]
和合并或子集
dt[ my_selection, on=names(my_selection) ]
# or
dt[ id %in% my_selection$id ]
我会保留中间表my_selection
,期待它稍后派上用场。