Question

跟进this问题后，我想知道如何有效地对分层的Person Period文件进行抽样。

我有一个看起来像这样的数据库

    id time var  clust
 1:  1    1   a clust1
 2:  1    2   c clust1
 3:  1    3   c clust1
 4:  2    1   a clust1
 5:  2    2   a clust1
...

将个人id分组到群组clust。我想要的是通过id对clust进行抽样，保留人的期间格式。

我想出的解决方案是取样id，然后取回merge。但是，它不是一个非常优雅的解决方案。

library(data.table) 
library(dplyr) 

setDT(dt) 

dt[,.SD[sample(.N,1)],by = clust] %>% 
  merge(., dt, by = 'id')

给出了

   id clust.x time.x var.x time.y var.y clust.y
1:  2  clust1      1     a      1     a  clust1
2:  2  clust1      1     a      2     a  clust1
3:  2  clust1      1     a      3     c  clust1
4:  3  clust2      3     c      1     a  clust2
5:  3  clust2      3     c      2     b  clust2
6:  3  clust2      3     c      3     c  clust2
7:  5  clust3      1     a      1     a  clust3
8:  5  clust3      1     a      2     a  clust3
9:  5  clust3      1     a      3     c  clust3

有更直接的解决方案吗？

library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2", 
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
 3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L, 
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L, 
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names =  c("id", 
 "time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))

Answer 1

以下是@ Frank的评论可能有所帮助的变体，实际上您可以从每个clust组中抽取唯一ID，并找出带有.I的相应索引编号进行子集化：

dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]

#   id time var  clust
#1:  2    1   a clust1
#2:  2    2   a clust1
#3:  2    3   c clust1
#4:  3    1   a clust2
#5:  3    2   b clust2
#6:  3    3   c clust2
#7:  4    1   a clust3
#8:  4    2   b clust3
#9:  4    3   c clust3

Answer 2

我认为tidy data这里有一个ID表，其中cluster是一个属性：

idDT = unique(dt[, .(id, clust)])


   id  clust
1:  1 clust1
2:  2 clust1
3:  3 clust2
4:  4 clust3
5:  5 clust3
6:  6 clust2

从那里，样本......

my_selection = idDT[, .(id = sample(id, 1)), by=clust]

和合并或子集

dt[ my_selection, on=names(my_selection) ]
# or 
dt[ id %in% my_selection$id ]

我会保留中间表my_selection，期待它稍后派上用场。

R - 人员期间档案的分层抽样

2 个答案: