我有一个超过190,000行的数据框,有点像这样:
library(tibble)
mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
"A", 16, 45, 53, 35,
"A", 17, 12, 54, 12,
"A", 19, 12, 54, 35,
"B", 10, 87, 55, 22,
"B", 10, 87, 55, 22,
"B", 12, 23, 12, 67)
col1
有重复的迭代;如示例数据框所示,有些跨列的值相同,而其他跨列的值不同。
对于col1
中的每个重复级别,我想将这些值聚合到显示所有行均值的一行中。到目前为止,我已经使用this answer,但是,这留下了所有不同的行:
mydf %>% group_by(col1) %>%
mutate_each(funs(mean), -(1)) %>%
distinct()
# A tibble: 5 x 5
# Groups: col1 [2]
col1 col2 col3 col4 col5
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 16 23 53.7 27.3
2 A 17 23 53.7 27.3
3 A 19 23 53.7 27.3
4 B 10 65.7 40.7 37
5 B 12 65.7 40.7 37
我实际上想要的是A
,B
等一行,以显示平均值。
答案 0 :(得分:1)
您需要使用summarize
而不是mutate
来汇总分组的值。在这种情况下,我使用summarize_all
来汇总所有未分组的值。
library(tidyverse)
mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
"A", 16, 45, 53, 35,
"A", 17, 12, 54, 12,
"A", 19, 12, 54, 35,
"B", 10, 87, 55, 22,
"B", 10, 87, 55, 22,
"B", 12, 23, 12, 67)
mydf %>%
group_by(col1) %>%
summarize_all(.funs = list(mean))
# A tibble: 2 x 5
col1 col2 col3 col4 col5
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 17.3 23 53.7 27.3
2 B 10.7 65.7 40.7 37
答案 1 :(得分:0)
如果第2列到第5列都是同一度量的所有观测值,则使用tidyr::pivot_longer()
是有意义的。然后按col1分组,将新的cols 2到5分组。最后,要获得所需的表单,请使用tidyr::pivot_wider()
library(tidyverse)
mydf <- tribble(~col1, ~col2, ~col3, ~col4, ~col5,
"A", 16, 45, 53, 35,
"A", 17, 12, 54, 12,
"A", 19, 12, 54, 35,
"B", 10, 87, 55, 22,
"B", 10, 87, 55, 22,
"B", 12, 23, 12, 67)
mydf %>%
pivot_longer(cols = -col1) %>%
group_by(col1, name) %>%
summarise(mean = mean(value)) %>%
pivot_wider(names_from = name, values_from = mean)
#> # A tibble: 2 x 5
#> # Groups: col1 [2]
#> col1 col2 col3 col4 col5
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 17.3 23 53.7 27.3
#> 2 B 10.7 65.7 40.7 37
由reprex package(v0.3.0)于2020-03-20创建