使用分组变量作为group_map()函数中的输入

时间:2020-05-25 22:20:26

标签: r dplyr subset purrr

我有一个30年的响应变量数据框。我想编写一个代码,将df的子集转换为“ n”年的“ x”年,并在所有这些子集中运行响应的回归。

因此,如果我们以30年开始,x = 5&n = 2,我们将以2个回归结束,每个回归都使用可用30年中的5个随机年。我在这里写了一个函数:

# build df
df = data.frame(year=c(1:30),
                response = runif(30,1,100))



# create function
subsample <- function(df, x, n ){
  df %>%
    # collaplse the tibble
    nest(data=everything()) %>%

    # repeat the tibble for number of simulations
    slice(rep(1:n(), each = n)) %>%

    # add group number, which will be the "nth" trial
    mutate(group = c(1:n)) %>%

    # expand data
    unnest(cols = c(data)) %>%

    # group by group number, then subsample n times from each group
    group_by(group) %>%
    group_map(~ sample_n(.x, x, replace = F)) %>%

    # stitch back together and add group number col back
    bind_rows(.id="trial") %>%

    # arrange by group and year
    mutate(trial=as.numeric(trial)) %>%
    arrange(trial,year) %>%

    # group by subsample and run regression
    group_by(trial) %>%

    do({
      mod = lm(response ~ year, data = .)
      data.frame(Intercept = coef(mod)[1],
                 Slope = coef(mod)[2])
    }) 
}


# test function
subsample(df, x=5, n=2)


# A tibble: 2 x 3
# Groups:   simulation [2]
      trial   Intercept  Slope
       <dbl>     <dbl>  <dbl>
1          1      48.5 -0.895
2          2      35.4 -0.275

太好了,所以可行了,我们得到了两个回归(我想要的只是坡度和截距),每个回归都使用30年中的5年的子集。

但是,现在我想用所有可能的年份组合来进行此操作(所以x = c(2:30)),并以看起来像这样的df结尾

# A tibble: 2 x 3
   number_of_years    trial   Intercept  Slope
        <dbl>         <dbl>     <dbl>  <dbl>
1         2             1      48.5 -0.895
2         2             2      35.4 -0.275
3         3             1      55.2  0.333
4         3             2      34.1  0.224
5         4             1      63.2 -0.359
6         4             2      45.5 -0.241
7         5             1      43.1  0.257
8         5             2      37.9 -0.657
9         6             1      51.0 -0.456
10        6             2      65.6  0.126     

这将显示每个使用2个随机年(number_of_years,“ x”)的2个试验(“ n”)的回归值,然后显示使用3个随机年,4个随机年等的2个试验的回归值,一直到30.

因此,我尝试遵循与上述相同的逻辑,但现在尝试使用构建的自定义函数map_group()


df %>%
  # collaplse the tibble
  nest(data=everything()) %>%

  # repeat the tibble for the number of simulations we want to test (29, in this case)
  slice(rep(1:n(), each = (nrow(df)-1))) %>%

  # add column for number out of total and unnest
  mutate(number_of_years = c(2:(nrow(.)+1))) %>% 
  select(number_of_years,data) %>%  #reorder
  unnest(cols =c(data)) %>%

  # group by out of total
  group_by(number_of_years) %>%

  group_map( ~ subsample(.x, x=5, n=2,))
  ### this is the problematic line! 
  ### this is giving us 2 trials (n=2) of a regression, each using 
  ### x=5 years of sampling. but instead of x=5 years, I want x=number_of_years
  ### so x should be the same as the grouping variable. 

所以这里的问题是,由于我的subsample()函数需要3个输入(df,x,n),因此我需要弄清楚如何使“ x”与数据集的分组变量相同。 x应该是(number_of_years)。我已经尝试过进行group_map( ~ subsample(.x,.x$number_of_years,2)之类的变体,但是我不知道如何使它返回30个小节,每个小节共进行2次试验,这意味着对原始df子样本进行2次回归,但是每个小节都使用计算不同年份的回归。

如果可能的话,我想留在tidyverse / dplyr / purr工作流程中。

谢谢!

0 个答案:

没有答案