Question

我正在尝试执行分层引导，以从具有嵌套数据结构的大型数据集中获取一些样本均值。

我有一个与此类似的数据集：

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
     '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- data.frame(cbind(ball, box, triangles))
df
--
ball box triangles
   1   1         1
   2   1         0
   3   1         1
   4   1         3
   5   2         1
   6   2         1
   7   2         2
   8   3         2
   9   3         0
  10   3         1
  11   3         1
  12   3         0
  13   3         4

这个想法是有三个盒子，每个盒子里都有许多球。每个球上都有许多三角形，因此看起来像这样：

我在这里的目标是使用自举法来估计每个球上三角形的平均数量，同时控制球位于哪个盒子中。

我希望模拟从盒子中替换10,000次，每次随机拉出一个盒子，然后随机替换n次，对球进行采样，其中n是盒子中的球数（即，如果盒子1是选中后，仿真将随机对这四个球进行四次采样，最终得到任意数量的响应，例如，球1，球1，球3，球4）。

我希望它然后计算它采样的球上的三角形数量的平均值，存储该值，然后对一个新的盒子进行采样，从而重复该过程。

到目前为止，我已经尝试过使用rsample方法（在此描述：https://www.r-bloggers.com/bootstrapping-clustered-data/），如下所示：

#we need to sample groups aka boxes from 
#the dataframe so use list-columns in 
#tibbles
library(tidyverse)
library(tibble)
library(rsample)

Test <- df %>% nest(-box)
head(Test)

#now use bootstraps on this new tibble to 
#sample by ID
set.seed(002)
testbs <- bootstraps(Test, times = 10)
testbs

#let's look at one of the bootstrap 
#samples
as_tibble(testbs$splits[[1]]) %>% head()

#we can unnest the tibble and assess the 
#averages by box 
bs_avgtri<- map(testbs$splits, 
      ~as_tibble(.) %>% unnest() %>% 
                   group_by(box) %>% 
                   summarize(mean_tri = 
                   mean(triangles))) %>% 
                  bind_rows(.id = 'boots')
bs_avgtri

但是，我认为这是有缺陷的，因为我是如何嵌套数据的。而且我得到的输出没有意义，经常显示多个引导程序级别。因此，我倾向于认为它出了问题，但是我也不确定如何真正解析出不同功能的作用。

我也知道我所借鉴的方法并不是真正意义上的我正在做的事情，我正在尝试用杰瑞钻机做这件事的一种方式，但我认为它并没有满足我的需求。要做。

我能想到的唯一另一种方法是编写一对嵌套的for循环，但是我对R中的for循环并不满意，我相当确定有更好的方法。

如果有人对此有任何见解，我将非常感谢！！！！

Answer 1

我对rsample不太了解。

但是根据您的描述，我认为基本函数sample就足够了。

我写了一个简单的版本来达到平均值（基于我的理解）。看看那是否是您想要的。

set.seed(100)

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)

names(ball) = box
names(triangles) = ball

sample_balls = function(input_ball){
  chosen_box = sample(names(input_ball), 1, replace = T)
  chosen_balls = ball[which(names(input_ball) == chosen_box)]
  sampled_balls = sample(chosen_balls, length(chosen_balls), replace = T)
  return(sampled_balls)
}

nTriangles = unlist(lapply(1:100, function(x){
  nTriangle = triangles[sample_balls(ball)]
}))

mean(nTriangles)
#> [1] 1.331237

Answer 2

tidyr::crossing对于仿真非常方便。

library("tidyverse")

ball <- c(1:13)
box <- c('1', '1', '1', '1', '2', '2', '2',
         '3', '3', '3', '3', '3', '3')
triangles <- c(1,0,1,3,1,1,2,2,0,1,1,0,4)
df <- tibble(ball, box, triangles)

df %>%
  # How many times do you want to run the simulation?
  crossing(rep = seq(3)) %>%
  # Next describe the sampling.
  # For each simulation and for each box...
  group_by(rep, box) %>%
  # randomly sample n() balls with replacement,
  # where n() is the number of balls in the box.
  sample_n(n(), ball, replace = TRUE) %>%
  # Compute the mean number of triangles (for each replicate, for each box)
  summarise(triangles = mean(triangles))
#> # A tibble: 9 x 3
#> # Groups:   rep [3]
#>     rep box   triangles
#>   <int> <chr>     <dbl>
#> 1     1 1          1.5 
#> 2     1 2          1.67
#> 3     1 3          2   
#> 4     2 1          2   
#> 5     2 2          1.33
#> 6     2 3          1.33
#> 7     3 1          2   
#> 8     3 2          1.67
#> 9     3 3          1.5

^{由reprex package（v0.2.1）于2019-03-04创建}

分层/嵌套自举方法

2 个答案: