有条件的累积平均值

时间:2015-08-25 19:25:09

标签: r mean sumifs cumulative-sum

R的新手。我的df的小代表:

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway))
df

TeamHome TeamAway PTS_TeamHome PTS_TeamAway
  LAL      IND          101           95
  HOU      LAL           87           89
  SAS      LAL           94          105
  MIA      HOU          110          111
  LAL      NOP           95          121

想象一下,这是一场有1230场比赛的前四场比赛。我想计算主队和客队在任何给定时间的每场比赛累积分数(平均值)。

输出如下:

  TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
1  LAL      IND          101           95                101                 95
2  HOU      LAL           87           89                 87                 95
3  SAS      LAL           94          105                 94              98.33
4  MIA      HOU          110          111                110                 99
5  LAL      NOP           95          121               97.5                121

请注意该公式为主队的第五场比赛做了什么。由于LAL是主队,因此它会查找LAL在主场或在路上比赛时得分的数量。在这种情况下(101 + 89 + 105 + 95)/ 4 = 97.5

这是我尝试过的但没有取得多大成功:

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- ( cumsum(df[which(df$TEAM1[1:i]==df$TEAM1[i]),df$PTS_TeamAway,0]) 
                                 + cumsum(df[which(df$TEAM2[1:i]==df$TEAM1[i]),df$PTS_TeamHome,0]) ) 
                             / #divided by number of games
  df$HOMETEAM_AVGCUMPTS <- unlist(lst)

我想计算累积的PTS,然后计算除以它的游戏数量,但这些都没有效果。

4 个答案:

答案 0 :(得分:3)

我认为你应该用tidier format重构你的数据,每场比赛有两行:访问团队一行,主队团队一行。处理整齐/长格式的数据要容易得多。

library(dplyr)
library(tidyr)

df %>%
  mutate(game = row_number()) %>%
  gather(location, team, TeamHome, TeamAway) %>%
  gather(location2, points, PTS_TeamHome, PTS_TeamAway) %>%
  filter(
    (location == "TeamHome" & location2 == "PTS_TeamHome") | 
      (location == "TeamAway" & location2 == "PTS_TeamAway")
  ) %>%
  select(-location2) %>%
  arrange(game) %>%
  group_by(team) %>%
  mutate(run_mean_points = cummean(points))

数据

# note that cbind() is removed.

df <- data.frame(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway, stringsAsFactors = FALSE)

Source: local data frame [10 x 5]
Groups: team

   game location team points run_mean_points
1     1 TeamHome  LAL    101       101.00000
2     1 TeamAway  IND     95        95.00000
3     2 TeamHome  HOU     87        87.00000
4     2 TeamAway  LAL     89        95.00000
5     3 TeamHome  SAS     94        94.00000
6     3 TeamAway  LAL    105        98.33333
7     4 TeamHome  MIA    110       110.00000
8     4 TeamAway  HOU    111        99.00000
9     5 TeamHome  LAL     95        97.50000
10    5 TeamAway  NOP    121       121.00000

答案 1 :(得分:3)

这是一个短循环版本,它只会覆盖每个唯一的团队名称一次(而不是每一行两次)。这里的想法是预先分配一个具有所需大小的矩阵,然后在填充矩阵内的正确条目的同时对唯一的团队名称运行一个简短的for循环。我们正在以转置形式创建矩阵和临时数据集,因此值将按行而不是列式(Rs默认值)填充,因为游戏序列是行式的

## Transpose the data once
tempdf <- t(df)     
## Create transposed matrix with future column names
mat <- matrix(NA, 2, nrow(df))
rownames(mat) <- c("HOMETEAM_AVGCUMPTS", "ROADTEAM_AVGCUMPTS")    
## Create a vector of unique team names
indx <- as.character(unique(unlist(df[1:2])))
## Run the loop only over the unique team names
for (i in indx) {
  indx2 <- tempdf[1:2, ] == i               
  temp <- tempdf[3:4, ][indx2]
  mat[indx2] <- cumsum(temp)/seq_along(temp)
}
## Combine result with the original data
cbind(df, t(mat))
#   TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1      LAL      IND          101           95              101.0           95.00000
# 2      HOU      LAL           87           89               87.0           95.00000
# 3      SAS      LAL           94          105               94.0           98.33333
# 4      MIA      HOU          110          111              110.0           99.00000
# 5      LAL      NOP           95          121               97.5          121.00000

答案 2 :(得分:3)

转置。这是一种方式,在@ DavidArenburg的回答中重复循环:

sv <- t(df[3:4])
tv <- t(df[1:2])
df[c("homeavg","awayavg")] <- t(ave(sv,tv,FUN=cummean))

cummean来自library(dplyr);如果需要,你可以将其切换为基本R模拟;并且类似于列名。

或交错。上面的所有换位都难以理解。相反,您可以使用Arun's approach

来交错矢量
interleave <- function(a,b) c(a,b)[order(c(seq_along(a), seq_along(b)))]
unleave    <- function(x) split(x,1:2)

sv2 <- interleave(df$PTS_TeamHome,df$PTS_TeamAway)
tv2 <- interleave(df$TeamHome,df$TeamAway)

df[c("homeavg","awayavg")] <- unleave(ave(sv2,tv2,FUN=cummean))

答案 3 :(得分:2)

lst <- list()
for(i in 1:nrow(df)) lst[[i]] <- mean(c(df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamHome[i]],
                                        df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamHome[i]]))
df$HOMETEAM_AVGCUMPTS <- unlist(lst)


lst2 <- list()
for(i in 1:nrow(df)) lst2[[i]] <- mean(c(df$PTS_TeamAway[1:i][df$TeamAway[1:i] == df$TeamAway[i]],
                                        df$PTS_TeamHome[1:i][df$TeamHome[1:i] == df$TeamAway[i]]))
df$ROADTEAM_AVGCUMPTS <- unlist(lst2)


df
#   TeamHome TeamAway PTS_TeamHome PTS_TeamAway HOMETEAM_AVGCUMPTS ROADTEAM_AVGCUMPTS
# 1      LAL      IND          101           95                101                 95
# 2      HOU      LAL           87           89                 87                 95
# 3      SAS      LAL           94          105                 94           98.33333
# 4      MIA      HOU          110          111                110                 99
# 5      LAL      NOP           95          121               97.5                121

该方法分为两个循环。我们采用两个向量的平均值。它们与mean(c(vec1,vec2))格式组合在一起。

第一个向量是主队在主场得分的一组得分(队伍在col1,队伍在col3),第二个向量是主队在他们离开时得分的一组(队列在队列2) ,在col4)。我们使用for循环,因为它允许我们轻松控制子集中正在考虑的行数。使用df$PTS_TeamHome[1:i]时,该集仅限于过去和当前游戏中玩过的游戏。我们使用[df$TeamHome[1:i] == df$TeamHome[i]]对该向量进行子集化。用简单的语言表达的是&#34;团队在&#34; TeamHome类别中直到当前游戏,等于当前正在玩的主队#34;有了这些参数,我们就不会允许&#34; future&#34;游戏破坏了分析。

对于数据,我将stringsAsFactors参数设置为FALSE。并将点列转换为类numeric。见下文。

数据

PTS_TeamHome <- c(101,87,94,110,95)
PTS_TeamAway <- c(95,89,105,111,121)
TeamHome <- c("LAL", "HOU", "SAS", "MIA", "LAL")
TeamAway <- c("IND", "LAL", "LAL", "HOU", "NOP")
df <- data.frame(cbind(TeamHome, TeamAway,PTS_TeamHome,PTS_TeamAway), stringsAsFactors=F)
df[3:4] <- lapply(df[3:4], function(x) as.numeric(x))