我有一个30年的响应变量数据框。我想编写一个代码,将df的子集转换为“ n”年的“ x”年,并在所有这些子集中运行响应的回归。
因此,如果我们以30年开始,x = 5&n = 2,我们将以2个回归结束,每个回归都使用可用30年中的5个随机年。我在这里写了一个函数:
# build df
df = data.frame(year=c(1:30),
response = runif(30,1,100))
# create function
subsample <- function(df, x, n ){
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for number of simulations
slice(rep(1:n(), each = n)) %>%
# add group number, which will be the "nth" trial
mutate(group = c(1:n)) %>%
# expand data
unnest(cols = c(data)) %>%
# group by group number, then subsample n times from each group
group_by(group) %>%
group_map(~ sample_n(.x, x, replace = F)) %>%
# stitch back together and add group number col back
bind_rows(.id="trial") %>%
# arrange by group and year
mutate(trial=as.numeric(trial)) %>%
arrange(trial,year) %>%
# group by subsample and run regression
group_by(trial) %>%
do({
mod = lm(response ~ year, data = .)
data.frame(Intercept = coef(mod)[1],
Slope = coef(mod)[2])
})
}
# test function
subsample(df, x=5, n=2)
# A tibble: 2 x 3
# Groups: simulation [2]
trial Intercept Slope
<dbl> <dbl> <dbl>
1 1 48.5 -0.895
2 2 35.4 -0.275
太好了,所以可行了,我们得到了两个回归(我想要的只是坡度和截距),每个回归都使用30年中的5年的子集。
但是,现在我想用所有可能的年份组合来进行此操作(所以x = c(2:30)),并以看起来像这样的df结尾
# A tibble: 2 x 3
number_of_years trial Intercept Slope
<dbl> <dbl> <dbl> <dbl>
1 2 1 48.5 -0.895
2 2 2 35.4 -0.275
3 3 1 55.2 0.333
4 3 2 34.1 0.224
5 4 1 63.2 -0.359
6 4 2 45.5 -0.241
7 5 1 43.1 0.257
8 5 2 37.9 -0.657
9 6 1 51.0 -0.456
10 6 2 65.6 0.126
这将显示每个使用2个随机年(number_of_years,“ x”)的2个试验(“ n”)的回归值,然后显示使用3个随机年,4个随机年等的2个试验的回归值,一直到30.
因此,我尝试遵循与上述相同的逻辑,但现在尝试使用构建的自定义函数map_group()
:
df %>%
# collaplse the tibble
nest(data=everything()) %>%
# repeat the tibble for the number of simulations we want to test (29, in this case)
slice(rep(1:n(), each = (nrow(df)-1))) %>%
# add column for number out of total and unnest
mutate(number_of_years = c(2:(nrow(.)+1))) %>%
select(number_of_years,data) %>% #reorder
unnest(cols =c(data)) %>%
# group by out of total
group_by(number_of_years) %>%
group_map( ~ subsample(.x, x=5, n=2,))
### this is the problematic line!
### this is giving us 2 trials (n=2) of a regression, each using
### x=5 years of sampling. but instead of x=5 years, I want x=number_of_years
### so x should be the same as the grouping variable.
所以这里的问题是,由于我的subsample()函数需要3个输入(df,x,n),因此我需要弄清楚如何使“ x”与数据集的分组变量相同。 x应该是(number_of_years)。我已经尝试过进行group_map( ~ subsample(.x,.x$number_of_years,2)
之类的变体,但是我不知道如何使它返回30个小节,每个小节共进行2次试验,这意味着对原始df子样本进行2次回归,但是每个小节都使用计算不同年份的回归。
如果可能的话,我想留在tidyverse / dplyr / purr工作流程中。
谢谢!