移动平均线具有多个GroupBy

时间:2015-08-12 19:23:34

标签: r group-by subset dplyr moving-average

这是我的数据的小代表:

Team <- rep(c("ind", "sas", "ind", "sas"),c(4,8,2,4))

Player <- c("Paul George", "David West", "Roy Hibbert",
            "Paul George", "Tim Duncan", "Manuel Ginobili",
            "Tony Parker", "Boris Diaw","Danny Green", 
            "Kawhi Leonard", "Matt Bonner", "Patty Mills",
            "George Hill", "C.J.Miles","Tim Duncan",
            "Manuel Ginobili", "Tony Parker", "Boris Diaw")

Team_PTS <- c(101,101,101,98,105,105,105,105,
              105,105,105,105,98,98,89,89,89,128)

Date <- as.Date(c("2015-05-14", "2015-05-14", "2015-05-14",
               "2015-05-16","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-15", "2015-05-15", "2015-05-15",
               "2015-05-15","2015-05-16","2015-05-16","2015-05-29",
               "2015-05-29","2015-05-29","2015-06-03"))

Team_Gamenumber <- rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1))

df <- data.frame(Team,Player,Team_PTS,Date, Team_Gamenumber)

df

   Team          Player Team_PTS       Date Team_Gamenumber Desired_output
1   ind     Paul George      101 2015-05-14               1            101
2   ind      David West      101 2015-05-14               1            101
3   ind     Roy Hibbert      101 2015-05-14               1            101
4   ind     Paul George       98 2015-05-16               2           99.5
5   sas      Tim Duncan      105 2015-05-15               1            105
6   sas Manuel Ginobili      105 2015-05-15               1            105
7   sas     Tony Parker      105 2015-05-15               1            105
8   sas      Boris Diaw      105 2015-05-15               1            105
9   sas     Danny Green      105 2015-05-15               1            105
10  sas   Kawhi Leonard      105 2015-05-15               1            105
11  sas     Matt Bonner      105 2015-05-15               1            105
12  sas     Patty Mills      105 2015-05-15               1            105
13  ind     George Hill       98 2015-05-16               2           99.5
14  ind       C.J.Miles       98 2015-05-16               2           99.5
15  sas      Tim Duncan       89 2015-05-29               2             97
16  sas Manuel Ginobili       89 2015-05-29               2             97
17  sas     Tony Parker       89 2015-05-29               2             97
18  sas      Boris Diaw      128 2015-06-03               3         107.33

所需的输出变量是团队积分的移动或累积平均值(本例中为sas和ind)。

我试过了:

library(dplyr)
df %>% group_by(Team) %>%
       mutate(cumavg_PTS = cumsum(Team_PTS) / seq_along(Team_PTS))

然而,由于信息由玩家组织,因此输出错误。看到Boris Diaw在比赛中错过了第2场比赛但在第3场比赛中出场。

另外我认为cumsum在这种情况下不是正确的方法,因为平均值会受到每场比赛的球员数量的影响。

107.33来自前3场比赛的平均值(105 + 89 + 128)/ 3

3 个答案:

答案 0 :(得分:5)

这是另一种方式。我将使用data.table

来完成此操作
require(data.table)
setDT(df)[, cavg := { dups = !duplicated(Team_Gamenumber)
                      cumsum(Team_PTS * dups) / cumsum(dups)
                    }, by = Team]

或者只是写一个函数:

foo <- function(points, game) {
    dups = !duplicated(game)
    cumsum(points * dups) / cumsum(dups)
}
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]

@bgoldst和@jeremycg的解决方案之间仍然存在差异。 @ bgoldst计算按Team, Team_Gamenumber排序的数据的累积平均值,其中@ jeremycg&#39; s通过保留原始订单来计算。

例如,在您的df中,换取ind = 1的游戏编号:

setDT(df)[c(1:4,13:14), Team_Gamenumber := c(2,2,2,1,1,1)]
setDF(df)

然后尝试两个版本。

我们可以在保留数据的原始顺序的同时获得两个答案,如下所示:

# @jeremycg's
setDT(df)[, cavg := foo(Team_PTS, Team_Gamenumber), by = Team]
# @bglodst's
setDT(df)[order(Team, Team_Gamenumber), cavg := foo(Team_PTS, Team_Gamenumber), by = Team]

答案 1 :(得分:3)

您的Team_PTS列似乎是多余的,因为它包含游戏Team中整个Team_Gamenumber得分的点数,但data.frame每行包含一行每场比赛球员(该球员参加比赛)。因此,TeamTeam_Gamenumber的每条记录都具有相同的Team_PTS值。

因此,您可以&#34;聚合&#34; dfTeam上的原始Team_Gamenumber,获取该组的冗余Team_PTS向量的第一个元素,因为该组中的所有值都是相同的。作为此aggregate()调用的一部分,我还解决了将Team_PTS值存储为字符串的问题,这些字符串由data.frame()调用转换为因子。我知道可以做到的最简单方法是将因子强制转换为实际字符串,然后转换为数字。

然后,通过Desired_Output分组,然后使用Team公式,可以通过cumsum(x)/seq_along(x)列补充汇总表格。然后可以将此结果与df合并,以产生所需的结果。

另请注意,我手动重新排序output以符合您的预期输出,这样我们就可以轻松验证它是否匹配。

df <- data.frame(Team=rep(c('ind','sas','ind','sas'),c(4,8,2,4)),Player=c('Paul George','David West','Roy Hibbert','Paul George','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw','Danny Green','Kawhi Leonard','Matt Bonner','Patty Mills','George Hill','C.J.Miles','Tim Duncan','Manuel Ginobili','Tony Parker','Boris Diaw'),Team_PTS=c(101,101,101,98,105,105,105,105,105,105,105,105,98,98,89,89,89,128),Date=as.Date(c('2015-05-14','2015-05-14','2015-05-14','2015-05-16','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-15','2015-05-16','2015-05-16','2015-05-29','2015-05-29','2015-05-29','2015-06-03')),Team_Gamenumber=rep(c(1,2,1,2,2,3),c(3,1,8,2,3,1)));
output <- merge(df,transform(aggregate(cbind(Team_PTS=as.double(as.character(Team_PTS)))~Team+Team_Gamenumber,df,`[`,1),Desired_Output=ave(Team_PTS,Team,FUN=function(x) cumsum(x)/seq_along(x))))[,c(names(df),'Desired_Output')];
output[c(1:4,9,10,7,8,13,14,11,12,5,6,16:18,15),];
##    Team          Player Team_PTS       Date Team_Gamenumber Desired_Output
## 1   ind     Paul George      101 2015-05-14               1       101.0000
## 2   ind      David West      101 2015-05-14               1       101.0000
## 3   ind     Roy Hibbert      101 2015-05-14               1       101.0000
## 4   ind     Paul George       98 2015-05-16               2        99.5000
## 9   sas      Tim Duncan      105 2015-05-15               1       105.0000
## 10  sas Manuel Ginobili      105 2015-05-15               1       105.0000
## 7   sas     Tony Parker      105 2015-05-15               1       105.0000
## 8   sas      Boris Diaw      105 2015-05-15               1       105.0000
## 13  sas     Danny Green      105 2015-05-15               1       105.0000
## 14  sas   Kawhi Leonard      105 2015-05-15               1       105.0000
## 11  sas     Matt Bonner      105 2015-05-15               1       105.0000
## 12  sas     Patty Mills      105 2015-05-15               1       105.0000
## 5   ind     George Hill       98 2015-05-16               2        99.5000
## 6   ind       C.J.Miles       98 2015-05-16               2        99.5000
## 16  sas      Tim Duncan       89 2015-05-29               2        97.0000
## 17  sas Manuel Ginobili       89 2015-05-29               2        97.0000
## 18  sas     Tony Parker       89 2015-05-29               2        97.0000
## 15  sas      Boris Diaw      128 2015-06-03               3       107.3333

答案 2 :(得分:3)

使用dplyr,在一个可怕的混乱中:

df %>% distinct(Team, Team_Gamenumber) %>%
       group_by(Team) %>%
       mutate(cumavg_PTS = cummean(Team_PTS)) %>%
       select(Team, Team_Gamenumber, cumavg_PTS) %>%
       inner_join(df, .)

Joining by: c("Team", "Team_Gamenumber")
   Team          Player Team_PTS       Date Team_Gamenumber cumavg_PTS
1   ind     Paul George      101 2015-05-14               1   101.0000
2   ind      David West      101 2015-05-14               1   101.0000
3   ind     Roy Hibbert      101 2015-05-14               1   101.0000
4   ind     Paul George       98 2015-05-16               2    99.5000
5   sas      Tim Duncan      105 2015-05-15               1   105.0000
6   sas Manuel Ginobili      105 2015-05-15               1   105.0000
7   sas     Tony Parker      105 2015-05-15               1   105.0000
8   sas      Boris Diaw      105 2015-05-15               1   105.0000
9   sas     Danny Green      105 2015-05-15               1   105.0000
10  sas   Kawhi Leonard      105 2015-05-15               1   105.0000
11  sas     Matt Bonner      105 2015-05-15               1   105.0000
12  sas     Patty Mills      105 2015-05-15               1   105.0000
13  ind     George Hill       98 2015-05-16               2    99.5000
14  ind       C.J.Miles       98 2015-05-16               2    99.5000
15  sas      Tim Duncan       89 2015-05-29               2    97.0000
16  sas Manuel Ginobili       89 2015-05-29               2    97.0000
17  sas     Tony Parker       89 2015-05-29               2    97.0000
18  sas      Boris Diaw      128 2015-06-03               3   107.3333