在R中,根据拥有Away和Home球队的体育数据进行分组 - 这是一种常见的挫败感

时间:2018-06-05 19:04:11

标签: r dplyr

我经常使用R中的体育数据,并在尝试计算摘要统计数据时遇到与dplyr :: group_by()相同的问题。我在世界杯小组赛的每场比赛中都有以下数据框和预测点:

runTransaction

我已经将epA和epB计算为每场比赛中A队和B队的预期得分,我现在想做一个group_by()来计算32支球队的总预期得分。我历来所做的就是这些方面:

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

对teamA和teamB列中的每一个进行2次单独的group_by()调用,然后是left_join,然后对列进行求和并删除多余的列... yuck。这个问题也很简单:恰好有4列(2个标识列,2个统计列)。由于大量的体育数据有主队/客队的专栏,这是一个常见的问题。

我觉得我需要1个数据帧,其中行数为2倍,列数为1/2,因此我可以执行一个组。感谢任何帮助,谢谢!

编辑:worldcup.df是根据很长的%>%的dplyr函数构建的 - 如果可以在不创建新变量的情况下完成,则奖励积分,而只是:

asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

4 个答案:

答案 0 :(得分:3)

这是一个tidyverse工作流程,通过将数据重新格式化为长格式来工作。它确实跟踪谁在同一个游戏中(game_id),以及他们是A还是B团队 - 如果这是有用的。 (平心而论,这与@Emil的基本理念相同,只是实现它的不同工作流程。)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))

答案 1 :(得分:2)

这是一个线路较少且不需要连接的解决方案:

df2 <- df[,c(2,1,4,3)]
names(df2) <- names(df)
rbind(df, df2) %>% group_by(teamA_name) %>% summarise(sum(epA))

# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

与OP的相同:

outputdf
# A tibble: 32 x 2
teamA_name `sum(epA)`
<chr>           <dbl>
 1 Argentina        6.02
 2 Australia        2.38
 3 Belgium          5.70
 4 Brazil           7.03
 5 Colombia         5.82
 6 Costa Rica       2.64
 7 Croatia          4.40
 8 Denmark          3.86
 9 Egypt            3.44
10 England          5.82

答案 2 :(得分:2)

我也遇到了一些幻想足球的问题。这就是我通常处理它的方式:

df %>% select(team = teamA_name, ep = epA) %>% 
     bind_rows(df %>% select(team = teamB_name, ep = epB)) %>% 
     group_by(team) %>% 
     summarize(ep = sum(ep))

答案 3 :(得分:1)

你的直觉是正确的:你确实想要一个包含更少列和更多行的数据框。 dplyr::gather会这样做;在这种情况下,您可以通过2个管道gather来电。第一个gather会在teamA_nameteamB_name列中创建行。您可以选择从该列中的条目中提取A或B,为每个团队及其分数提供“A”或“B”。第二个gather执行相同操作,但适用于epAepB列。此gather的密钥为您提供了前一个a_or_bgather提供的相同A或B信息,因此我删除了该附加列(select(-pts_a_or_b))。< / p>

library(tidyverse)

df_long <- df %>%
  as_tibble() %>%
  gather(key = a_or_b, value = team, teamA_name, teamB_name) %>%
  mutate(a_or_b = str_extract(a_or_b, "(?<=team)\\w")) %>%
  gather(key = pts_a_or_b, value = points, epA, epB) %>%
  select(-pts_a_or_b)

df_long
#> # A tibble: 192 x 3
#>    a_or_b team       points
#>    <chr>  <chr>       <dbl>
#>  1 A      Russia      1.64 
#>  2 A      Egypt       0.705
#>  3 A      Morocco     1.13 
#>  4 A      Portugal    1.11 
#>  5 A      France      2.20 
#>  6 A      Argentina   1.98 
#>  7 A      Peru        1.58 
#>  8 A      Croatia     1.86 
#>  9 A      Costa Rica  1.28 
#> 10 A      Germany     2.09 
#> # ... with 182 more rows

如果汇总计算的内容多于每个团队的总积分,请随意纠正我;如果我理解你在寻找什么,你可以这样做:

df_long %>%
  group_by(team) %>%
  summarise(totPts = sum(points))
#> # A tibble: 32 x 2
#>    team       totPts
#>    <chr>       <dbl>
#>  1 Argentina    8.33
#>  2 Australia    8.32
#>  3 Belgium      8.33
#>  4 Brazil       8.51
#>  5 Colombia     8.31
#>  6 Costa Rica   8.33
#>  7 Croatia      8.23
#>  8 Denmark      8.22
#>  9 Egypt        8.24
#> 10 England      8.34
#> # ... with 22 more rows