是否可以在Dplyr的摘要中为每个组创建单独的线性模型

时间:2019-06-16 01:44:55

标签: r dplyr

我有一些这样的数据

group_name | x | y
------------------
a          | 1 | 2
a          | 2 | 4
a          | 3 | 6
b          | 1 | 4
b          | 2 | 3
b          | 3 | 2
c          | 1 | 2
c          | 2 | 5
c          | 3 | 8

我想按group_name对其进行分组,并使用Dplyr的summary函数为每个组创建一个包含线性模型lm(y〜x)的列。可能吗如果不是,那么为每个组创建模型的替代方法是什么?

提前谢谢

2 个答案:

答案 0 :(得分:2)

适应https://cran.r-project.org/web/packages/broom/vignettes/broom_and_dplyr.html中的示例:

library(tidyverse); library(broom)
df %>%
  nest(-group_name) %>% 
  mutate(fit = map(data, ~lm(y ~ x, data = .x)),
         tidied = map(fit, tidy)) %>%
  unnest(tidied)

  group_name        term estimate    std.error     statistic      p.value
1          a (Intercept)        0 0.000000e+00           NaN          NaN
2          a           x        2 0.000000e+00           Inf 0.000000e+00
3          b (Intercept)        5 1.017536e-15  4.913830e+15 1.295567e-16
4          b           x       -1 4.710277e-16 -2.123017e+15 2.998656e-16
5          c (Intercept)       -1 1.356715e-15 -7.370745e+14 8.637116e-16
6          c           x        3 6.280370e-16  4.776789e+15 1.332736e-16

编辑:获得预测的一种方法是使用augment中的broom

library(tidyverse); library(broom)
df %>%
  nest(-group_name) %>% 
  mutate(fit = map(data, ~lm(y ~ x, data = .x)),
         predictions = map(fit, augment)) %>%
  unnest(predictions)

   group_name y x .fitted      .se.fit        .resid      .hat .sigma .rownames .cooksd .std.resid
1 a           2 1       2 0.000000e+00  0.000000e+00 0.8333333    NaN      <NA>      NA         NA
2 a           4 2       4 0.000000e+00  0.000000e+00 0.3333333    NaN      <NA>      NA         NA
3 a           6 3       6 0.000000e+00  0.000000e+00 0.8333333    NaN      <NA>      NA         NA
4 b           4 1       4 6.080942e-16  2.719480e-16 0.8333333    NaN         4    2.50          1
5 b           3 2       3 3.845925e-16 -5.438960e-16 0.3333333    NaN         5    0.25         -1
6 b           2 3       2 6.080942e-16  2.719480e-16 0.8333333    Inf         6    2.50          1
7 c           2 1       2 8.107923e-16 -3.625973e-16 0.8333333    NaN         7    2.50         -1
8 c           5 2       5 5.127900e-16  7.251946e-16 0.3333333    NaN         8    0.25          1
9 c           8 3       8 8.107923e-16 -3.625973e-16 0.8333333    Inf         9    2.50         -1

答案 1 :(得分:0)

这是一种方法。

我不得不稍微更改一下您的测试数据,因为我认为存在完美的共线性问题。

df <- data.frame(stringsAsFactors=FALSE,
   group.name = c("a", "a", "a", "b", "b", "b", "c", "c", "c"),
            x = c(1, 2, 3.5, 1, 2.5, 3, 1, 2, 3.5),
            y = c(2, 4, 6, 4, 3, 2, 2, 5, 8)
)

library(dplyr)
groups <- unique(df$group.name)
groups
for (i in groups){
  df_subgroup <- filter(df, group.name==i)
  print(paste("group", i))
  model <- lm(y ~ x, data = df_subgroup)
  print(summary(model))
}

这就是您得到的。我使用stargazer软件包使输出更易于阅读,但是您可以只使用summary(model)


    #> [1] "group a"
    #> 
    #> ===============================================
    #>                         Dependent variable:    
    #>                     ---------------------------
    #>                                  y             
    #> -----------------------------------------------
    #> x                             1.579*           
    #>                               (0.182)          
    #>                                                
    #> Constant                       0.579           
    #>                               (0.437)          
    #>                                                
    #> -----------------------------------------------
    #> Observations                     3             
    #> R2                             0.987           
    #> Adjusted R2                    0.974           
    #> Residual Std. Error       0.324 (df = 1)       
    #> F Statistic             75.000* (df = 1; 1)    
    #> ===============================================
    #> Note:               *p<0.1; **p<0.05; ***p<0.01
    #> [1] "group b"
    #> 
    #> ===============================================
    #>                         Dependent variable:    
    #>                     ---------------------------
    #>                                  y             
    #> -----------------------------------------------
    #> x                             -0.923           
    #>                               (0.266)          
    #>                                                
    #> Constant                      5.000*           
    #>                               (0.620)          
    #>                                                
    #> -----------------------------------------------
    #> Observations                     3             
    #> R2                             0.923           
    #> Adjusted R2                    0.846           
    #> Residual Std. Error       0.392 (df = 1)       
    #> F Statistic             12.000 (df = 1; 1)     
    #> ===============================================
    #> Note:               *p<0.1; **p<0.05; ***p<0.01
    #> [1] "group c"
    #> 
    #> ===============================================
    #>                         Dependent variable:    
    #>                     ---------------------------
    #>                                  y             
    #> -----------------------------------------------
    #> x                             2.368*           
    #>                               (0.273)          
    #>                                                
    #> Constant                      -0.132           
    #>                               (0.656)          
    #>                                                
    #> -----------------------------------------------
    #> Observations                     3             
    #> R2                             0.987           
    #> Adjusted R2                    0.974           
    #> Residual Std. Error       0.487 (df = 1)       
    #> F Statistic             75.000* (df = 1; 1)    
    #> ===============================================
    #> Note:               *p<0.1; **p<0.05; ***p<0.01