modelr :: bootstrap或broom :: bootrap和分组问题

时间:2017-05-25 09:43:10

标签: r bootstrapping tidyverse broom modelr

我有一个长数据集,它由多个插补产生的几个数据集组成(假设有10个插补)。他们有一个识别插补的id变量。在每个这些估算的数据集上,我想引导10个数据集。在引导程序之后,我想在每个上运行模型(100,插补靴子组合)。

在此示例中,我不确定是使用broom::bootstrap()函数还是modelr::bootstrap()函数。此外,分组似乎在我的管道中丢失了。

以下是使用mtcars数据集的可重现示例:

library(tidyverse)
library(broom)

cars <- mtcars %>%
  mutate(am = as.factor(am)) %>% # This is standing in for my imputation id variable
  group_by(am) 

Source: local data frame [32 x 11]
Groups: am [2]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1
 5  18.7     8 360.0   175  3.15 3.440 17.02     0      0     3     2

正如您所看到的,输出当前显示有两个组,应该如此。在我的数据集中,它会显示每个插补数据集有10个。现在:

cars2 <- cars %>%
  broom::bootstrap(10, by_group = TRUE)

cars2

Source: local data frame [32 x 11]
Groups: replicate [10]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1

现在看起来好像只有10个组代表每个复制品。它似乎没有保留先前的分组。在这一点上,我预计会有20个小组(2 x 10)。

如果我现在这样做:

cars3 <- cars2 %>%
  group_by(am)

cars3

Source: local data frame [32 x 11]
Groups: am [2]

# A tibble: 32 x 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs     am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fctr> <dbl> <dbl>
 1  21.0     6 160.0   110  3.90 2.620 16.46     0      1     4     4
 2  21.0     6 160.0   110  3.90 2.875 17.02     0      1     4     4
 3  22.8     4 108.0    93  3.85 2.320 18.61     1      1     4     1
 4  21.4     6 258.0   110  3.08 3.215 19.44     1      0     3     1

现在似乎只复制了am的组。

无论如何,在 之后,我已经对原始数据集进行了分组。另外,理想情况下,在我引导后,应该有一个id,指示我正在查看哪些bootrapped数据集。

在我的理想世界中,我的代码应该能够做到这样的事情:

cars <- mtcars %>%
  mutate(am = as.factor(am)) %>%
  group_by(am) %>%
  bootstrap(10, by_group = TRUE) %>%
  nest() %>% # create a condensed tidy dataset that has one row per imputation, bootstrap combo
  mutate(model = map(data, ~lm(mpg~, data = .)) # Create a model for each row

1 个答案:

答案 0 :(得分:4)

我正在尝试同时学习# This may take some tweaking, but here's the general idea @api.multi def copy(self, vals): new_product = super(YourClass, self).copy(vals) if vals.get('copy_sellers'): new_product.seller_ids = self.seller_ids.copy({'product_id': new_product.id}) return new_product # Whatever you have calling the copy method will need to include copy_sellers in vals vals.update({'copy_sellers': True}) product.copy(vals) modelr并且他们真的让我头疼。我想我终于想出了这个。

purrr

对数据帧进行分组,然后在每个组中创建10个嵌套的引导程序重复

library(modelr)
library(dplyr)
library(tidyr)
library(broom)
mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) 

重新组合并且不需要扩展为2列用于bootstraps和id

Source: local data frame [2 x 2]
Groups: <by row>

# A tibble: 2 x 2
     am                rs
* <dbl>            <list>
1     0 <tibble [10 x 2]>
2     1 <tibble [10 x 2]>
mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest

分组到最低级别的复制并创建模型

您必须在表带列上使用# A tibble: 20 x 3 # Groups: am [2] am strap .id <dbl> <list> <chr> 1 0 <S3: resample> 01 2 0 <S3: resample> 02 3 0 <S3: resample> 03 4 0 <S3: resample> 04 5 0 <S3: resample> 05 6 0 <S3: resample> 06 7 0 <S3: resample> 07 8 0 <S3: resample> 08 9 0 <S3: resample> 09 10 0 <S3: resample> 10 11 1 <S3: resample> 01 12 1 <S3: resample> 02 13 1 <S3: resample> 03 14 1 <S3: resample> 04 15 1 <S3: resample> 05 16 1 <S3: resample> 06 17 1 <S3: resample> 07 18 1 <S3: resample> 08 19 1 <S3: resample> 09 20 1 <S3: resample> 10 将其重新展开为可用数据。见as.data.frame。这个让我永远想通了。它应该正常工作,如?resample

tidyr::unnest
mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest %>% 
  group_by(am, .id) %>% 
  do(model = lm(mpg~wt, data = as.data.frame(.$strap)))

在每个型号上调用您的功能/摘要

Source: local data frame [20 x 3]
Groups: <by row>

# A tibble: 20 x 3
      am   .id    model
 * <dbl> <chr>   <list>
 1     0    01 <S3: lm>
 2     0    02 <S3: lm>
 3     0    03 <S3: lm>
 4     0    04 <S3: lm>
 5     0    05 <S3: lm>
 6     0    06 <S3: lm>
 7     0    07 <S3: lm>
 8     0    08 <S3: lm>
 9     0    09 <S3: lm>
10     0    10 <S3: lm>
11     1    01 <S3: lm>
12     1    02 <S3: lm>
13     1    03 <S3: lm>
14     1    04 <S3: lm>
15     1    05 <S3: lm>
16     1    06 <S3: lm>
17     1    07 <S3: lm>
18     1    08 <S3: lm>
19     1    09 <S3: lm>
20     1    10 <S3: lm>
mtcars %>% group_by(am) %>% 
    do(rs = modelr::bootstrap(., 10)) %>% 
  group_by(am) %>% 
  unnest %>% 
  group_by(am, .id) %>% 
  do(model = lm(mpg~wt, data = as.data.frame(.$strap))) %>% 
  tidy(model)

可视化

请注意,我将bootstraps的数量增加到1000,大约需要10秒。

# A tibble: 40 x 7
# Groups:   am, .id [20]
      am   .id        term  estimate std.error statistic      p.value
   <dbl> <chr>       <chr>     <dbl>     <dbl>     <dbl>        <dbl>
 1     0    01 (Intercept) 25.800592 2.1055145 12.253818 7.300379e-10
 2     0    01          wt -2.608827 0.5377694 -4.851201 1.497729e-04
 3     0    02 (Intercept) 37.012664 4.7369213  7.813654 5.023424e-07
 4     0    02          wt -5.272094 1.2884870 -4.091693 7.602571e-04
 5     0    03 (Intercept) 26.145563 2.2114832 11.822637 1.263234e-09
 6     0    03          wt -2.428845 0.5541412 -4.383080 4.056524e-04
 7     0    04 (Intercept) 31.502481 4.0753463  7.730013 5.806324e-07
 8     0    04          wt -3.584863 1.1510368 -3.114464 6.305972e-03
 9     0    05 (Intercept) 31.739921 2.2216473 14.286661 6.690920e-11
10     0    05          wt -3.716515 0.5627808 -6.603841 4.471168e-06
# ... with 30 more rows

enter image description here