Question

我有一个很大的数据框，其中包含一段时间内几个人的性能数据。而不是让每个人都有各自的表现，我想要一个包含每个人的总数/平均值的数据框。这是一个示例数据框：

name<-c("dwayne", "alf", "christine", "katerina", "dwayne", "christine")
team<- c("halifax", "hamilton", "calgary", "winnipeg", "halifax", "calgary")
pos<- c("left", "middle", "middle", "right", "left", "middle")
amt1<- c(4, 2, 5, 8, 5, 7)
amt2 <- c(12, 14, 13, 18, 17, 18)
perc1<- c(.55, .24, .67, .45, .34, .54)
perc2<- c(.12, .14, .16, .04, .02, .13)

df<-data_frame(team, pos, name, amt1, amt2, perc1, perc2)

到目前为止，我已经弄清楚如何使用group_by和summary_if通过数字列来完成此操作，就像这样：

tot<-df %>%
  group_by(name) %>%
  summarise_at(vars(amt1:amt2), sum)

av <- df %>%
  group_by(name) %>%
  summarise_at(vars(perc1:perc2), mean)

bnd<-cbind(tot, av)

bnd <- bnd[, !duplicated(colnames(bnd))]

但是，我的问题是：此方法返回一个不包含“ pos”或“ team”列的数据框。这些是分析此数据时的关键信息，但不是数字信息，因此在使用摘要时将其删除

函数如何在仍然存在那些因子向量的情况下返回数据帧“ bnd”？

Answer 1

只要团队，pos和名称的组合唯一，就可以将这些变量包括在分组中

tot <- df %>%
  group_by(team, pos, name) %>%
  summarise_at(vars(amt1:amt2), sum) %>%
  ungroup()

# A tibble: 4 x 5
  team     pos    name       amt1  amt2
  <chr>    <chr>  <chr>     <dbl> <dbl>
1 calgary  middle christine    12    31
2 halifax  left   dwayne        9    29
3 hamilton middle alf           2    14
4 winnipeg right  katerina      8    18

Answer 2

如果您不需要分别总结每个团队或职位的球员成绩，那么处理多个团队/职位的另一种选择是保留所有球员/职位。对于每个name，将team的唯一值组合为单个字符串，对于pos同样。例如：

library(tidyverse)

# Added a couple of additional rows for illustration
df = data.frame(name=c("dwayne", "alf", "christine", "katerina", "dwayne", "christine", "christine", "dwayne"),
                team= c("halifax", "hamilton", "calgary", "winnipeg", "halifax", "calgary", "halifax","halifax"),
                pos= c("left", "middle", "middle", "right", "left", "middle", "middle","middle"),
                amt1= c(4, 2, 5, 8, 5, 7,5,5),
                amt2 = c(12, 14, 13, 18, 17, 18,17,13),
                perc1= c(.55, .24, .67, .45, .34, .54,.56,.51),
                perc2= c(.12, .14, .16, .04, .02, .13, .11, .09))

df %>% 
  group_by(name) %>% 
  mutate(team = paste(unique(team), collapse="-"),
         pos = paste(unique(pos), collapse="-")) %>% 
  group_by(name, team, pos) %>% 
  summarise_at(vars(amt1:amt2), sum)

  name      team            pos          amt1  amt2
1 alf       hamilton        middle          2    14
2 christine calgary-halifax middle         17    48
3 dwayne    halifax         left-middle    14    42
4 katerina  winnipeg        right           8    18

汇总分组的数据框，同时保留作为因子向量的所有列

2 个答案: