Question

我正在处理一个巨大的人 - 期间文件，我想是这样的处理大型数据集的一种好方法是使用采样和重采样技术。

我的人员期间文件看起来像这样

   id code time
1   1    a    1
2   1    a    2
3   1    a    3
4   2    b    1
5   2    c    2
6   2    b    3
7   3    c    1
8   3    c    2
9   3    c    3
10  4    c    1
11  4    a    2
12  4    c    3
13  5    a    1
14  5    c    2
15  5    a    3

我确实有两个不同的问题。

第一个问题是我在简单的sampling个人期间文件中遇到了麻烦。

例如，我想对2个id序列进行采样，例如：

  id code time
   1    a    1
   1    a    2
   1    a    3
   2    b    1
   2    c    2
   2    b    3

以下代码行用于抽样人员期间文件

dt[which(dt$id %in% sample(dt$id, 2)), ]

但是，我想使用dplyr解决方案，因为我对重新采样感兴趣，特别是我想使用replicate。

我有兴趣做replicate(100, sample_n(dt, 2), simplify = FALSE)

之类的事情

我正在努力解决dplyr解决方案，因为我不确定grouping变量应该是什么。

library(dplyr)
dt %>% group_by(id) %>% sample_n(1)

给我一个不正确的结果，因为它没有保留每个id的完整序列。

我是否可以对人员期间文件进行抽样和重新抽样？

数据

dt = structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L), .Label = c("1", "2", "3", "4", "5"
), class = "factor"), code = structure(c(1L, 1L, 1L, 2L, 3L, 
2L, 3L, 3L, 3L, 3L, 1L, 3L, 1L, 3L, 1L), .Label = c("a", "b", 
"c"), class = "factor"), time = structure(c(1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1", "2", 
"3"), class = "factor")), .Names = c("id", "code", "time"), row.names = c(NA, 
-15L), class = "data.frame")

Answer 1

我认为惯用的方式可能看起来像

set.seed(1)
samp = df %>% select(id) %>% distinct %>% sample_n(2)
left_join(samp, df)

  id code time
1  2    b    1
2  2    c    2
3  2    b    3
4  5    a    1
5  5    c    2
6  5    a    3

这直接扩展到更多分组变量和更高级的采样规则。

如果您需要多次这样做......

nrep = 100
ng   = 2
samps = df %>% select(id) %>% distinct %>% 
  slice(rep(1:n(), nrep)) %>% mutate(r = rep(1:nrep, each = n()/nrep)) %>%
  group_by(r) %>% sample_n(ng)
repdat = left_join(samps, df)

# then do stuff with it:
repdat %>% group_by(r) %>% do_stuff

Answer 2

我们可以将filter与sample

一起使用

dt %>%
    filter(id %in% sample(unique(id),2, replace = FALSE))

注意：使用dplyr方法指定的OP和此解决方案确实使用dplyr。

如果我们需要执行replicate，则可以使用map中的purrr

library(purrr)
dt %>% 
    distinct(id) %>% 
    replicate(2, .) %>%
    map(~sample(., 2, replace=FALSE)) %>%
    map(~filter(dt, id %in% .))
#$id
#  id code time
#1  1    a    1
#2  1    a    2
#3  1    a    3
#4  4    c    1
#5  4    a    2
#6  4    c    3

#$id
#  id code time
#1  4    c    1
#2  4    a    2
#3  4    c    3
#4  5    a    1
#5  5    c    2
#6  5    a    3

Answer 3

我想你正在进行一些模拟，并且可能想要多次进行子集化。您可能还想尝试使用此data.table方法并使用关键列上的快速binary search feature：

library(data.table)
setDT(dt)
setkey(dt, id)
replicate(2, dt[list(sample(unique(id), 2))], simplify = F)

#[[1]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  5    a    1
#5:  5    c    2
#6:  5    a    3

#[[2]]
#   id code time
#1:  3    c    1
#2:  3    c    2
#3:  3    c    3
#4:  4    c    1
#5:  4    a    2
#6:  4    c    3

R - 对人员期间文件进行采样和重采样

3 个答案: