CumSum仅根据组

时间:2018-03-29 19:05:56

标签: r data.table data-manipulation cumulative-sum

我目前正在尝试创建一个累积和列,该列将根据Game_ID创建累积总和,但只计算一次与Game_ID相关的值。例如,玩家A在Game_ID == 1中拍摄20张照片,在Game_ID == 2拍摄13张照片。对于累积总和,我希望Shot_Count值(基于Game_ID)只计算一次,尽管出现在Shot_Count中列多次。请考虑以下数据集:

Name         Game_ID       Shot_Count        CumSum_Shots
Player A         1             20                20 
Player B         1             15                15 
Player A         1             20                20
Player A         2             13                33 ## (20 + 13)
Player A         2             13                33 ## (20 + 13)
Player B         2             35                50 ## (15 + 35)
Player A         3             30                63 ## (33 + 30)
Player B         3             20                70 ## (50 + 20)
Player A         3             30                63 ## (33 + 30)
Player A         4             12                75 ## (63 + 12)
Player A         4             12                75 ## (63 + 12)
Player B         4             10                80 ## (70 + 10)

请记住,还有其他变量可以使行1和3等不重复。我只是想将数据集简化为相关的变量。

我尝试将cumsum函数与data.table库一起使用:

library(data.table)
dt[ , CumSum_Shots := cumsum(Shot_Count), by = list(dt$Name, dt$Game_ID)]

然而,这总结了基于游戏的Shot_Count行(即CumSum_Shots第三行将是40)。这段代码是有意义的,但我不确定是什么data.table语法存在,以使代码考虑dt $ Game_ID的唯一值。

3 个答案:

答案 0 :(得分:4)

独特,计算,然后合并:

dt[unique(dt, by = c('Name', 'Game_ID', 'Shot_Count'))
       [, Cum_Shots := cumsum(Shot_Count), by = Name]
   , on = .(Name, Game_ID), Cum_Shots := Cum_Shots]

R是一种肮脏的语言。

答案 1 :(得分:1)

我假设您已经使用data.table,那么您可以这样做:

代码:

library(data.table)
merge(dt, 
      dt[, Shot_Count[1], .(Name, Game_ID)][, .(CumSum_Shots = cumsum(V1), Game_ID), Name], 
      sort = FALSE)

输出:

        Name Game_ID Shot_Count CumSum_Shots
 1: Player A       1         20           20
 2: Player B       1         15           15
 3: Player A       1         20           20
 4: Player A       2         13           33
 5: Player A       2         13           33
 6: Player B       2         35           50
 7: Player A       3         30           63
 8: Player B       3         20           70
 9: Player A       3         30           63
10: Player A       4         12           75
11: Player A       4         12           75
12: Player B       4         10           80

说明:

  • dt[, Shot_Count[1], .(Name, Game_ID)]:按[1]Group_ID拍摄第一张照片(Name)。 OP想要的是什么(只计算一次)。
  • [, .(CumSum_Shots = cumsum(V1), Game_ID), Name]:按Name计算总和并保留Group_ID信息。
  • merge(dt, ..., sort = FALSE):与原始数据合并并保留原始订单。

输入(dt):

structure(list(Name = c("Player A", "Player B", "Player A", "Player A", 
"Player A", "Player B", "Player A", "Player B", "Player A", "Player A", 
"Player A", "Player B"), Game_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 
3L, 3L, 3L, 4L, 4L, 4L), Shot_Count = c(20L, 15L, 20L, 13L, 13L, 
35L, 30L, 20L, 30L, 12L, 12L, 10L)), .Names = c("Name", "Game_ID", 
"Shot_Count"), row.names = c(NA, -12L), class = c("data.table", 
"data.frame"))

编辑:

使用长data.table语法字符串时,我更喜欢magrittr个管道:

library(magrittr)
dt %>%
    .[, Shot_Count[1], .(Name, Game_ID)] %>%
    .[, .(CumSum_Shots = cumsum(V1), Game_ID), Name] %>%
    merge(dt, ., sort = FALSE)

答案 2 :(得分:1)

如果没有合并,您可以cumsum唯一值(NameGameShots),然后rep来获得正确的长度。

dt[, CumSum_Shots2 := rep(cumsum(Shot_Count[!duplicated(Game_ID)]), times = .SD[,.N,by = .(Game_ID, Shot_Count)]$N) , 
   by = .(Name)]

dt
 #      Name Game_ID Shot_Count CumSum_Shots CumSum_Shots2
 #1: PlayerA       1         20           20            20
 #2: PlayerB       1         15           15            15
 #3: PlayerA       1         20           20            20
 #4: PlayerA       2         13           33            33
 #5: PlayerA       2         13           33            33
 #6: PlayerB       2         35           50            50
 #7: PlayerA       3         30           63            63
 #8: PlayerB       3         20           70            70
 #9: PlayerA       3         30           63            63
#10: PlayerA       4         12           75            75
#11: PlayerA       4         12           75            75
#12: PlayerB       4         10           80            80