r中分组列的计算平均值

时间:2020-03-20 15:22:46

标签: r dplyr

我有一个超过190,000行的数据框,有点像这样:

 library(tibble)
 mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
            "A", 16, 45, 53, 35, 
            "A", 17,  12, 54, 12,
            "A", 19, 12, 54, 35,
            "B", 10, 87, 55, 22,
            "B", 10, 87, 55, 22,
            "B", 12, 23, 12, 67)

col1有重复的迭代;如示例数据框所示,有些跨列的值相同,而其他跨列的值不同。

对于col1中的每个重复级别,我想将这些值聚合到显示所有行均值的一行中。到目前为止,我已经使用this answer,但是,这留下了所有不同的行:

 mydf %>% group_by(col1) %>% 
   mutate_each(funs(mean), -(1)) %>% 
   distinct()

 # A tibble: 5 x 5
 # Groups:   col1 [2]
   col1   col2  col3  col4  col5
   <chr> <dbl> <dbl> <dbl> <dbl>
 1 A        16  23    53.7  27.3
 2 A        17  23    53.7  27.3
 3 A        19  23    53.7  27.3
 4 B        10  65.7  40.7  37  
 5 B        12  65.7  40.7  37  

我实际上想要的是AB等一行,以显示平均值。

2 个答案:

答案 0 :(得分:1)

您需要使用summarize而不是mutate来汇总分组的值。在这种情况下,我使用summarize_all来汇总所有未分组的值。

library(tidyverse)

mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
                "A", 16, 45, 53, 35, 
                "A", 17,  12, 54, 12,
                "A", 19, 12, 54, 35,
                "B", 10, 87, 55, 22,
                "B", 10, 87, 55, 22,
                "B", 12, 23, 12, 67)

mydf %>% 
  group_by(col1) %>% 
  summarize_all(.funs = list(mean))

# A tibble: 2 x 5
  col1   col2  col3  col4  col5
  <chr> <dbl> <dbl> <dbl> <dbl>
1 A      17.3  23    53.7  27.3
2 B      10.7  65.7  40.7  37  

答案 1 :(得分:0)

如果第2列到第5列都是同一度量的所有观测值,则使用tidyr::pivot_longer()是有意义的。然后按col1分组,将新的cols 2到5分组。最后,要获得所需的表单,请使用tidyr::pivot_wider()

library(tidyverse)

mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
                "A", 16, 45, 53, 35,
                "A", 17,  12, 54, 12,
                "A", 19, 12, 54, 35,
                "B", 10, 87, 55, 22,
                "B", 10, 87, 55, 22,
                "B", 12, 23, 12, 67)

mydf %>%
  pivot_longer(cols = -col1) %>%
  group_by(col1, name) %>%
  summarise(mean = mean(value)) %>%
  pivot_wider(names_from = name, values_from = mean)
#> # A tibble: 2 x 5
#> # Groups:   col1 [2]
#>   col1   col2  col3  col4  col5
#>   <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A      17.3  23    53.7  27.3
#> 2 B      10.7  65.7  40.7  37

reprex package(v0.3.0)于2020-03-20创建