Question

我有以下1961:2018每年都在计算两个预测变量的影响之间的代码：基于每场比赛的球数（BB）和每场比赛的本垒打（HR），基于每场比赛的响应变量（R）：

rm(list = ls())

library(dbplyr)
library(tidyverse)
library(broom)
library(Lahman)

fit <- Teams %>% 
  filter(yearID %in% 1961:2018) %>% 
  mutate(BB = BB / G, 
         HR = HR / G,
         R = R / G) %>%
  group_by(yearID) %>%
  do(tidy(lm(R ~ BB + HR, data = .), conf.int = TRUE)) %>% filter(term=="BB")
fit

> fit
# A tibble: 58 x 8
# Groups:   yearID [58]
   yearID term  estimate std.error statistic p.value conf.low conf.high
    <int> <chr>    <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
 1   1961 BB      0.0845     0.168     0.502   0.623  -0.274      0.443
 2   1962 BB      0.142      0.273     0.520   0.610  -0.434      0.718
 3   1963 BB      0.339      0.242     1.40    0.178  -0.171      0.849
 4   1964 BB     -0.105      0.302    -0.349   0.731  -0.742      0.532
 5   1965 BB      0.235      0.253     0.928   0.366  -0.299      0.768
 6   1966 BB      0.104      0.216     0.482   0.636  -0.351      0.559
 7   1967 BB      0.0660     0.223     0.296   0.771  -0.405      0.537
 8   1968 BB     -0.199      0.203    -0.983   0.340  -0.627      0.229
 9   1969 BB      0.153      0.163     0.942   0.357  -0.185      0.492
10   1970 BB      0.239      0.157     1.52    0.143  -0.0874     0.566
# ... with 48 more rows

我现在想将这个“拟合”输出为ggplot，实际上是一个小步（或现代化的数据框），以显示每年的估算值（以点为单位），以及回归线以及由{{1}计算的CI。 }模型，而不仅仅是使用lm重新计算。

我尝试了以下操作，但未成功。我知道扫帚上的geom_smooth(method = "lm")应该直接对augment模型输出进行操作，因此以下代码是错误的，但是它说明了我要实现的目标：

lm

如何做到这一点而无需“作弊”（对augment(fit) %>% ggplot() + geom_point(aes(yearID, estimate)) + geom_line(aes(yearID, .fitted), col = "blue")进行两次计算，然后再在ggplot上进行计算）并执行以下操作：

lm

Answer 1

您可以尝试tidyverse中包含的purrr软件包中的map函数。下面列出了您描述的问题的可能代码。如果您不熟悉purrr软件包，也可以使用lapply。

library(tidyverse)
library(broom)
library(Lahman)

fit <- Teams %>% 
  filter(yearID %in% 1961:2018) %>% 
  mutate(BB = BB / G, 
         HR = HR / G,
         R = R / G) %>%
  group_by(yearID) %>%
  # consolidate your data
  nest() %>% 
  # creates new nested column with your regression data
  mutate(model = map(data, function(df) 
    tidy(lm(R ~ BB + HR, data = df), conf.int = TRUE) %>%
      filter(term=="BB")
    ),
    # extract the column estimate
    model_est = map_dbl(model, function(df) 
      df %>% pull(estimate)
    ), 
    # extract the column conf.low
    model_conf.low = map_dbl(model, function(df) 
      df %>% pull(conf.low)
    ), 
    # extract the column conf.high
    model_conf.high = map_dbl(model, function(df) 
      df %>% pull(conf.high)
    )
   ) 


fit %>% ggplot(aes(yearID,model_est)) + geom_point() +
  geom_line(aes(yearID, model_conf.low)) + 
  geom_line(aes(yearID, model_conf.high))

Answer 2

我使用map()和nest()到帕特里克采取了类似的路线：

library(tidyverse)
library(broom)
library(Lahman)
library(magrittr)

fit <- Teams %>%
  filter(yearID %in% 1961:2018) %>%
  mutate(
    BB = BB / G,
    HR = HR / G,
    R  = R / G
  ) %>%
  nest(data = -yearID) %>%
  mutate(
    model  = map(data, ~ lm(R ~ BB + HR, .x)),  # apply model to all nested groups
    m_tidy = map(model, tidy),                  # tidy up
    est    = map_dbl(m_tidy, ~ .x %>%           # pull BB estimate from each group
      filter(term == "BB") %>%
      pull(estimate)),
  )

现在，您可以直接%$%进入下一部分，但是我在这里将它们分开，因此请讨论正确模拟置信区间。 geom_smooth()置信区间基于t分布而不是正态分布。因此，我们将不得不做一些额外的工作来确定工作间隔：

fit %$%
  lm(est ~ yearID) %>%
  augment() %>% 
  mutate(m.se.fit = .se.fit * qt(1 - (1-0.95)/2, nrow(fit))) %>% # 95% conf int calc
  ggplot(aes(yearID, est)) +
  geom_point() +
  geom_line(aes(y = .fitted), col = "blue") + 
  geom_ribbon(aes(ymin = .fitted - m.se.fit, ymax = .fitted + m.se.fit), alpha = .2)

此图基本上反映了所需的图：

fit %>% ggplot(aes(yearID, est)) + 
  geom_point() + 
  geom_smooth(method = "lm")

^{由reprex package（v0.3.0）于2019-10-23创建}

如何通过管道将整洁的lm模型CI插入ggplot2？

2 个答案: