为标题不太清楚表示歉意(可以使用帮助)-希望下面的示例可以阐明许多问题。我有以下篮球比赛结果的数据框(1行== 1篮球比赛):
> dput(zed)
structure(list(shooterTeamAlias = c("DUKE", "DUKE", "BC", "DUKE",
"DUKE", "DUKE", "DUKE", "DUKE", "DUKE", "BC", "BC", "BC", "DUKE",
"BC", "BC", "DUKE", "DUKE", "DUKE", "BC", "DUKE"), distanceCategory = c("sht2",
"sht2", "sht3", "atr2", "mid2", "sht2", "lng3", "sht3", "atr2",
"sht3", "sht3", "sht2", "mid2", "sht3", "sht3", "sht3", "atr2",
"atr2", "sht2", "mid2"), eventType = c("twopointmiss", "twopointmade",
"threepointmade", "twopointmade", "twopointmiss", "twopointmade",
"threepointmiss", "threepointmiss", "twopointmade", "threepointmiss",
"threepointmade", "twopointmiss", "twopointmade", "threepointmiss",
"threepointmade", "threepointmiss", "twopointmade", "twopointmade",
"twopointmade", "twopointmade")), row.names = c(NA, 20L), class = "data.frame")
> zed
shooterTeamAlias distanceCategory eventType
1 DUKE sht2 twopointmiss
2 DUKE sht2 twopointmade
3 BC sht3 threepointmade
4 DUKE atr2 twopointmade
5 DUKE mid2 twopointmiss
6 DUKE sht2 twopointmade
7 DUKE lng3 threepointmiss
8 DUKE sht3 threepointmiss
9 DUKE atr2 twopointmade
10 BC sht3 threepointmiss
11 BC sht3 threepointmade
12 BC sht2 twopointmiss
13 DUKE mid2 twopointmade
14 BC sht3 threepointmiss
15 BC sht3 threepointmade
16 DUKE sht3 threepointmiss
17 DUKE atr2 twopointmade
18 DUKE atr2 twopointmade
19 BC sht2 twopointmade
20 DUKE mid2 twopointmade
此数据框当前为整齐格式,我需要按团队分组,然后大量增加时间。完整数据有6个distanceCategories atr2, sht2, mid2, lng2, sht3, lng3
(上面的示例只有5个),还有2个类别是其他6个的函数:all2
是atr2, sht2, lng2, mid2
和all3
是sht3, lng3
。那么,对于这8个类别中的每一个类别,我想要一列关于制造,尝试,pct和尝试频率的列。我使用eventType
列来确定是否拍摄。我目前正在使用以下
fat.data <- {zed %>%
dplyr::group_by(shooterTeamAlias) %>%
dplyr::summarise(
shotsCount = n(),
# Shooting By Distance Stats
atr2Made = sum(distanceCategory == "atr2" & eventType == "twopointmade"),
atr2Att = sum(distanceCategory == "atr2" & eventType %in% c("twopointmiss", "twopointmade")),
atr2AttFreq = atr2Att / shotsCount,
atr2Pct = ifelse(atr2Att > 0, atr2Made / atr2Att, 0),
sht2Made = sum(distanceCategory == "sht2" & eventType == "twopointmade"),
sht2Att = sum(distanceCategory == "sht2" & eventType %in% c("twopointmiss", "twopointmade")),
sht2AttFreq = sht2Att / shotsCount,
sht2Pct = ifelse(sht2Att > 0, sht2Made / sht2Att, 0),
mid2Made = sum(distanceCategory == "mid2" & eventType == "twopointmade"),
mid2Att = sum(distanceCategory == "mid2" & eventType %in% c("twopointmiss", "twopointmade")),
mid2AttFreq = mid2Att / shotsCount,
mid2Pct = ifelse(mid2Att > 0, mid2Made / mid2Att, 0),
lng2Made = sum(distanceCategory == "lng2" & eventType == "twopointmade"),
lng2Att = sum(distanceCategory == "lng2" & eventType %in% c("twopointmiss", "twopointmade")),
lng2AttFreq = lng2Att / shotsCount,
lng2Pct = ifelse(lng2Att > 0, lng2Made / lng2Att, 0),
all2Made = sum(atr2Made, sht2Made, mid2Made, lng2Made),
all2Att = sum(atr2Att, sht2Att, mid2Att, lng2Att),
all2AttFreq = all2Att / shotsCount,
all2Pct = ifelse(all2Att > 0, all2Made / all2Att, 0),
sht3Made = sum(distanceCategory == "sht3" & eventType == "threepointmade"),
sht3Att = sum(distanceCategory == "sht3" & eventType %in% c("threepointmiss", "threepointmade")),
sht3AttFreq = sht3Att / shotsCount,
sht3Pct = ifelse(sht3Att > 0, sht3Made / sht3Att, 0),
lng3Made = sum(distanceCategory == "lng3" & eventType == "threepointmade"),
lng3Att = sum(distanceCategory == "lng3" & eventType %in% c("threepointmiss", "threepointmade")),
lng3AttFreq = lng3Att / shotsCount,
lng3Pct = ifelse(lng3Att > 0, lng3Made / lng3Att, 0),
all3Made = sum(sht3Made, lng3Made),
all3Att = sum(sht3Att, lng3Att),
all3AttFreq = all3Att / shotsCount,
all3Pct = ifelse(all3Att > 0, all3Made / all3Att, 0))}
...对于出现在数据中的6个类别(除all2
和all3
之外的所有类别),它们的4列均以相同的方式计算。正如您将看到的all2
和all3
一样,计算有些不同。
暂时不用担心all2
和all3
类别,是否有更好的方法来计算数据中6个类别的制造,尝试,pct和尝试频率?对于这里的8个类别* 4列类型== 32列,这还不错,但是我还有另一个类似的实例,其中我有21个类别* 4列类型,我必须在代码中多次这样做。
不确定dplyr::group_by dplyr::summarise
是否是我最好的选择(这是我当前正在使用的工具),或者是否有更好的方法来解决此问题。改进此代码/可能加快其速度对我的项目至关重要,对您的帮助也将不胜感激,即使在接下来的两天之内得到答复,我也将尽力记住要奖励该帖子。
编辑!!! :我刚刚意识到,首先按distanceCategory分组,为每个distanceCategory计算4个统计信息,然后将 数据帧重新构造为这种胖格式可能会更容易...我目前正在从事计算工作。遵循以下原则:
zed %>%
dplyr::group_by(shooterTeamAlias, distanceCategory) %>%
dplyr::summarise(
attempts = ...,
makes = ...,
pct = ...,
attfreq = ...
) %>%
tidyr::spread(...)
谢谢!!
答案 0 :(得分:1)
通过对distanceCategory分组,然后对每个对象应用相同的逻辑,可以使它看起来更简单:
library(tidyverse)
zed %>%
group_by(shooterTeamAlias, distanceCategory) %>%
summarize(att = n(), # n() counts how many rows in this group
made = sum(eventType %>% str_detect("made"))
pct = if_else(att > 0, made / att, 0)) %>%
mutate(freq = att / sum(att))
# A tibble: 7 x 6
# Groups: shooterTeamAlias [2]
shooterTeamAlias distanceCategory att made pct freq
<chr> <chr> <int> <int> <dbl> <dbl>
1 BC sht2 2 1 0.5 0.286
2 BC sht3 5 3 0.6 0.714
3 DUKE atr2 4 4 1 0.308
4 DUKE lng3 1 0 0 0.0769
5 DUKE mid2 3 2 0.667 0.231
6 DUKE sht2 3 2 0.667 0.231
7 DUKE sht3 2 0 0 0.154
如果您希望以宽格式显示,则可以先收集上面的计算,将距离与统计信息统一起来,然后再进行以下计算:
[same code as above] %>%
gather(stat, value, -distanceCategory, -shooterTeamAlias) %>%
unite(stat, distanceCategory, stat) %>%
spread(stat, value)
# A tibble: 2 x 21
# Groups: shooterTeamAlias [2]
shooterTeamAlias atr2_att atr2_freq atr2_made atr2_pct lng3_att lng3_freq lng3_made lng3_pct mid2_att mid2_freq mid2_made mid2_pct sht2_att sht2_freq sht2_made sht2_pct sht3_att sht3_freq sht3_made sht3_pct
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BC NA NA NA NA NA NA NA NA NA NA NA NA 2 0.286 1 0.5 5 0.714 3 0.6
2 DUKE 4 0.308 4 1 1 0.0769 0 0 3 0.231 2 0.667 3 0.231 2 0.667 2 0.154 0 0