在单列上使用dplyr的摘要,但具有多个参数值

时间:2019-02-04 23:24:44

标签: r group-by dplyr tidyr data-manipulation

为标题不太清楚表示歉意(可以使用帮助)-希望下面的示例可以阐明许多问题。我有以下篮球比赛结果的数据框(1行== 1篮球比赛):

> dput(zed)
structure(list(shooterTeamAlias = c("DUKE", "DUKE", "BC", "DUKE", 
"DUKE", "DUKE", "DUKE", "DUKE", "DUKE", "BC", "BC", "BC", "DUKE", 
"BC", "BC", "DUKE", "DUKE", "DUKE", "BC", "DUKE"), distanceCategory = c("sht2", 
"sht2", "sht3", "atr2", "mid2", "sht2", "lng3", "sht3", "atr2", 
"sht3", "sht3", "sht2", "mid2", "sht3", "sht3", "sht3", "atr2", 
"atr2", "sht2", "mid2"), eventType = c("twopointmiss", "twopointmade", 
"threepointmade", "twopointmade", "twopointmiss", "twopointmade", 
"threepointmiss", "threepointmiss", "twopointmade", "threepointmiss", 
"threepointmade", "twopointmiss", "twopointmade", "threepointmiss", 
"threepointmade", "threepointmiss", "twopointmade", "twopointmade", 
"twopointmade", "twopointmade")), row.names = c(NA, 20L), class = "data.frame")

> zed
   shooterTeamAlias distanceCategory      eventType
1              DUKE             sht2   twopointmiss
2              DUKE             sht2   twopointmade
3                BC             sht3 threepointmade
4              DUKE             atr2   twopointmade
5              DUKE             mid2   twopointmiss
6              DUKE             sht2   twopointmade
7              DUKE             lng3 threepointmiss
8              DUKE             sht3 threepointmiss
9              DUKE             atr2   twopointmade
10               BC             sht3 threepointmiss
11               BC             sht3 threepointmade
12               BC             sht2   twopointmiss
13             DUKE             mid2   twopointmade
14               BC             sht3 threepointmiss
15               BC             sht3 threepointmade
16             DUKE             sht3 threepointmiss
17             DUKE             atr2   twopointmade
18             DUKE             atr2   twopointmade
19               BC             sht2   twopointmade
20             DUKE             mid2   twopointmade

此数据框当前为整齐格式,我需要按团队分组,然后大量增加时间。完整数据有6个distanceCategories atr2, sht2, mid2, lng2, sht3, lng3(上面的示例只有5个),还有2个类别是其他6个的函数:all2atr2, sht2, lng2, mid2all3sht3, lng3。那么,对于这8个类别中的每一个类别,我想要一列关于制造,尝试,pct和尝试频率的列。我使用eventType列来确定是否拍摄。我目前正在使用以下

fat.data <- {zed %>%
    dplyr::group_by(shooterTeamAlias) %>%
    dplyr::summarise(

      shotsCount = n(),
      # Shooting By Distance Stats
      atr2Made = sum(distanceCategory == "atr2" & eventType == "twopointmade"),
      atr2Att = sum(distanceCategory == "atr2" & eventType %in% c("twopointmiss", "twopointmade")),
      atr2AttFreq = atr2Att / shotsCount,
      atr2Pct = ifelse(atr2Att > 0, atr2Made / atr2Att, 0),

      sht2Made = sum(distanceCategory == "sht2" & eventType == "twopointmade"),
      sht2Att = sum(distanceCategory == "sht2" & eventType %in% c("twopointmiss", "twopointmade")),
      sht2AttFreq = sht2Att / shotsCount, 
      sht2Pct = ifelse(sht2Att > 0, sht2Made / sht2Att, 0),

      mid2Made = sum(distanceCategory == "mid2" & eventType == "twopointmade"),
      mid2Att = sum(distanceCategory == "mid2" & eventType %in% c("twopointmiss", "twopointmade")),
      mid2AttFreq = mid2Att / shotsCount,
      mid2Pct = ifelse(mid2Att > 0, mid2Made / mid2Att, 0),

      lng2Made = sum(distanceCategory == "lng2" & eventType == "twopointmade"),
      lng2Att = sum(distanceCategory == "lng2" & eventType %in% c("twopointmiss", "twopointmade")),
      lng2AttFreq = lng2Att / shotsCount,
      lng2Pct = ifelse(lng2Att > 0, lng2Made / lng2Att, 0),

      all2Made = sum(atr2Made, sht2Made, mid2Made, lng2Made),
      all2Att = sum(atr2Att, sht2Att, mid2Att, lng2Att),
      all2AttFreq = all2Att / shotsCount,
      all2Pct = ifelse(all2Att > 0, all2Made / all2Att, 0),

      sht3Made = sum(distanceCategory == "sht3" & eventType == "threepointmade"),
      sht3Att = sum(distanceCategory == "sht3" & eventType %in% c("threepointmiss", "threepointmade")),
      sht3AttFreq = sht3Att / shotsCount,
      sht3Pct = ifelse(sht3Att > 0, sht3Made / sht3Att, 0),

      lng3Made = sum(distanceCategory == "lng3" & eventType == "threepointmade"),
      lng3Att = sum(distanceCategory == "lng3" & eventType %in% c("threepointmiss", "threepointmade")),
      lng3AttFreq = lng3Att / shotsCount,
      lng3Pct = ifelse(lng3Att > 0, lng3Made / lng3Att, 0),

      all3Made = sum(sht3Made, lng3Made),
      all3Att = sum(sht3Att, lng3Att),
      all3AttFreq = all3Att / shotsCount,
      all3Pct = ifelse(all3Att > 0, all3Made / all3Att, 0))}

...对于出现在数据中的6个类别(除all2all3之外的所有类别),它们的4列均以相同的方式计算。正如您将看到的all2all3一样,计算有些不同。

暂时不用担心all2all3类别,是否有更好的方法来计算数据中6个类别的制造,尝试,pct和尝试频率?对于这里的8个类别* 4列类型== 32列,这还不错,但是我还有另一个类似的实例,其中我有21个类别* 4列类型,我必须在代码中多次这样做。

不确定dplyr::group_by dplyr::summarise是否是我最好的选择(这是我当前正在使用的工具),或者是否有更好的方法来解决此问题。改进此代码/可能加快其速度对我的项目至关重要,对您的帮助也将不胜感激,即使在接下来的两天之内得到答复,我也将尽力记住要奖励该帖子。

编辑!!! :我刚刚意识到,首先按distanceCategory分组,为每个distanceCategory计算4个统计信息,然后将 数据帧重新构造为这种胖格式可能会更容易...我目前正在从事计算工作。遵循以下原则:

zed %>% 
  dplyr::group_by(shooterTeamAlias, distanceCategory) %>%
  dplyr::summarise(
    attempts = ...,
    makes = ...,
    pct = ...,
    attfreq = ...
  ) %>%
  tidyr::spread(...)

谢谢!!

1 个答案:

答案 0 :(得分:1)

通过对distanceCategory分组,然后对每个对象应用相同的逻辑,可以使它看起来更简单:

library(tidyverse)
zed %>%
  group_by(shooterTeamAlias, distanceCategory) %>%
  summarize(att = n(),   # n() counts how many rows in this group
            made = sum(eventType %>% str_detect("made"))
            pct = if_else(att > 0, made / att, 0)) %>%
  mutate(freq = att / sum(att))

# A tibble: 7 x 6
# Groups:   shooterTeamAlias [2]
  shooterTeamAlias distanceCategory   att  made   pct   freq
  <chr>            <chr>            <int> <int> <dbl>  <dbl>
1 BC               sht2                 2     1 0.5   0.286 
2 BC               sht3                 5     3 0.6   0.714 
3 DUKE             atr2                 4     4 1     0.308 
4 DUKE             lng3                 1     0 0     0.0769
5 DUKE             mid2                 3     2 0.667 0.231 
6 DUKE             sht2                 3     2 0.667 0.231 
7 DUKE             sht3                 2     0 0     0.154

如果您希望以宽格式显示,则可以先收集上面的计算,将距离与统计信息统一起来,然后再进行以下计算:

[same code as above] %>%
  gather(stat, value, -distanceCategory, -shooterTeamAlias) %>%
  unite(stat, distanceCategory, stat) %>%
  spread(stat, value)

# A tibble: 2 x 21
# Groups:   shooterTeamAlias [2]
  shooterTeamAlias atr2_att atr2_freq atr2_made atr2_pct lng3_att lng3_freq lng3_made lng3_pct mid2_att mid2_freq mid2_made mid2_pct sht2_att sht2_freq sht2_made sht2_pct sht3_att sht3_freq sht3_made sht3_pct
  <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>     <dbl>    <dbl>
1 BC                     NA    NA            NA       NA       NA   NA             NA       NA       NA    NA            NA   NA            2     0.286         1    0.5          5     0.714         3      0.6
2 DUKE                    4     0.308         4        1        1    0.0769         0        0        3     0.231         2    0.667        3     0.231         2    0.667        2     0.154         0      0