在dplyr中加入各种摘要

时间:2018-10-26 02:22:24

标签: r group-by dplyr

我需要按组对几十个变量进行操作,根据变量的不同,通常按照变量名来执行不同的指令,并进行一些临时更改并在此处和此处重命名。 / p>

使用修改后的Diamonds数据集进行说明的reprex如下:

library(tidyverse)

diamond_renamed <- diamonds %>% 
  rename(size_x = x, size_y = y, size_z = z) %>% 
  rename(val_1 = depth, val_2 = table)


diamond_summary <-  bind_cols(diamond_renamed %>% 
                               group_by(cut, color, clarity) %>% 
                               summarise(
                                 cost = sum(price)
                               ), 
                             diamond_renamed %>%
                             group_by(cut, color, clarity) %>%
                               summarise_at(
                                 vars(contains("size")), 
                                 funs(median(.))
                                            ),
                             diamond_renamed %>%
                             group_by(cut, color, clarity) %>% 
                               summarise_at(
                                 vars(contains("val")),
                                 funs(mean(.))
                                 )
                             )

diamond_summary    
#> # A tibble: 276 x 15
#> # Groups:   cut, color [?]
#>    cut   color clarity   cost cut1  color1 clarity1 size_x size_y size_z
#>    <ord> <ord> <ord>    <int> <ord> <ord>  <ord>     <dbl>  <dbl>  <dbl>
#>  1 Fair  D     I1       29532 Fair  D      I1         7.32   7.20   4.70
#>  2 Fair  D     SI2     243888 Fair  D      SI2        6.13   6.06   3.99
#>  3 Fair  D     SI1     247854 Fair  D      SI1        6.08   6.04   3.93
#>  4 Fair  D     VS2     112822 Fair  D      VS2        6.04   6      3.65
#>  5 Fair  D     VS1      14606 Fair  D      VS1        5.56   5.58   3.66
#>  6 Fair  D     VVS2     32463 Fair  D      VVS2       4.95   4.84   3.31
#>  7 Fair  D     VVS1     13419 Fair  D      VVS1       4.92   5.03   3.28
#>  8 Fair  D     IF        4859 Fair  D      IF         4.68   4.73   2.88
#>  9 Fair  E     I1       18857 Fair  E      I1         6.18   6.14   4.03
#> 10 Fair  E     SI2     325446 Fair  E      SI2        6.28   6.20   3.95
#> # ... with 266 more rows, and 5 more variables: cut2 <ord>, color2 <ord>,
#> #   clarity2 <ord>, val_1 <dbl>, val_2 <dbl>

这会产生所需的结果:具有分组摘要的数据集...但是它也会重复分组变量。每次都必须重复group_by代码本身也不是一件好事……但我不确定其他方法。它可能也不是summarise的最有效使用。我们如何避免重复,使代码更好?

谢谢!

1 个答案:

答案 0 :(得分:2)

一个选择是在初始步骤中使用mutate而不是summarize,然后将这些列添加到group_by

diamond_renamed %>%
   group_by(cut, color, clarity) %>% 
   group_by(cost = sum(price), add = TRUE) %>%
   mutate_at(vars(contains("size")), median) %>% 
   group_by_at(vars(contains("size")), .add = TRUE) %>% 
   summarise_at(vars(contains("val")), mean)
# A tibble: 276 x 9
# Groups:   cut, color, clarity, cost, size_x, size_y [?]
#   cut   color clarity   cost size_x size_y size_z val_1 val_2
#   <ord> <ord> <ord>    <int>  <dbl>  <dbl>  <dbl> <dbl> <dbl>
# 1 Fair  D     I1       29532   7.32   7.20   4.70  65.6  56.8
# 2 Fair  D     SI2     243888   6.13   6.06   3.99  64.7  58.6
# 3 Fair  D     SI1     247854   6.08   6.04   3.93  64.6  58.8
# 4 Fair  D     VS2     112822   6.04   6      3.65  62.7  60.3
# 5 Fair  D     VS1      14606   5.56   5.58   3.66  63.2  57.8
# 6 Fair  D     VVS2     32463   4.95   4.84   3.31  61.7  58.8
# 7 Fair  D     VVS1     13419   4.92   5.03   3.28  61.7  64.3
# 8 Fair  D     IF        4859   4.68   4.73   2.88  60.8  58  
# 9 Fair  E     I1       18857   6.18   6.14   4.03  65.6  58.1
#10 Fair  E     SI2     325446   6.28   6.20   3.95  63.4  59.5
# ... with 266 more rows

注意:此处不重复OP中的分组“ cut”,“ color”,“ clarity”列。因此,它只有9列而不是15列