函数循环应用于两个不同的数据帧

时间:2019-07-04 23:12:08

标签: r function dataframe

我有两个dataframes。第一个(games)显示了几场比赛中的每一场的年份以及哪个球员完成了某些未指定的目标(player1player2player3)。第二个(rankings显示了给定年份中每个玩家的排名。

我的目标是在games数据框中添加一列,以指示在每个游戏中达到这些目标的所有玩家的平均排名。

可复制的示例:

set.seed(0)
players <- c("Abe", "Bob", "Chris", "John", "Jane", "Linda", "Mason", "Zoe", "NA")
years <- c(2000:2005)
season <- sample(years, 20, replace = TRUE)
player1 <- sample(players, 20, replace = TRUE)
player2 <- sample(players, 20, replace = TRUE)
player3 <- sample(players, 20, replace = TRUE)
games <- data.frame(season, player1, player2, player3, stringsAsFactors = FALSE)
rankings <- data.frame(replicate(6,sample(1:5,8,rep=TRUE)))
colnames(rankings) <- years
ranked_players <- players[-9]
rankings <- cbind(ranked_players, rankings)

games是第一个dataframe,显示游戏年份(season),他是player1,谁是player2,谁是player3。并非总是所有游戏的所有类别都有玩家。

rankings是第二个dataframe,显示给定年份中每个玩家的排名从1到5。

我想计算游戏中每个游戏分别扮演player1player2player3的玩家的排名,并将这些排名平均。

要计算排名,我尝试了以下功能:

calc_ranking <- function(x, y) {
  z <- select(filter(rankings, ranked_players==x), c(y))
  z <- as.integer(z[1,1])
  z
}

显然有效。现在,我必须将其应用于每年玩游戏的每个玩家。

我尝试了这个循环:

new_col <- mapply(calc_ranking, games$player1, games$season)

但是它不起作用。这给我一个错误

 Error in inds_combine(.vars, ind_list) : Position must be between 0 and n 

但是,即使可行,我仍应使用此解决方案重复执行3次以创建3列,每个列分别为player1player2player3,并且然后创建我真正想要的列(3列的平均值)。我怀疑有一种更有效的方法来执行此操作而无需重复循环(假设我可以修复它)?这将非常有用,因为在我的真实数据集中,我有13个“角色”,我必须为其计算排名。

希望第二个问题比我的第一个问题更好。抱歉,我刚学习R才1周(这是我一般的编码经验)。

非常感谢!

2 个答案:

答案 0 :(得分:1)

我的理解是games中的每一行都是一个单独的游戏ID。因此:

season    player1    player2   player3
  2001        Joe       Bill      Jane

player    season     ranking
   Joe      2001           1
  Bill      2001           3
  Jane      2001           5

该场比赛的预期答案是3。为了解决这个问题,将数据融化然后在赛季和球员姓名中合并两个data.frames是最简单的方法。 编辑:使用dcast()行添加了类似于@heds的输出。

library(data.table)

setDT(games)
games[, game_id := seq_len(.N), keyby = season]

molten_games <- melt(games, id.vars = c('season', 'game_id'), variable.name = 'player_number', value.name = 'player')

setDT(rankings)
molten_rankings <- melt(rankings, id.vars = 'ranked_players', variable.name = 'season', value.name = 'ranking', variable.factor = F)[, season:= as.integer(season)]

merged_dt <- molten_rankings[molten_games
                             , on = .(season
                                      , ranked_players = player )
                             , nomatch = 0L
                             ]

merged_dt[, mean(ranking, na.rm = T), by = .(season, game_id)]
    season game_id       V1
 1:   2000       1 2.666667
 2:   2001       2 2.000000
 3:   2001       3 1.333333
 4:   2001       4 4.000000
 5:   2002       1 3.333333
...
#or if you want all the players and rankings
dcast(merged_dt, season + game_id ~ player_number, value.var = c('ranked_players', 'ranking')
      )[, means := rowMeans(.SD), .SDcols = c('ranking_player1', 'ranking_player2', 'ranking_player3')][]

    season game_id ranked_players_player1 ranked_players_player2 ranked_players_player3 ranking_player1 ranking_player2 ranking_player3    means
 1:   2000       1                   John                  Mason                    Zoe               2               2               4 2.666667
 2:   2001       1                   <NA>                    Zoe                   <NA>              NA               4              NA       NA
 3:   2001       2                  Chris                   Jane                  Linda               1               1               4 2.000000
 4:   2001       3                   Jane                   Jane                   John               1               1               2 1.333333
 5:   2001       4                  Linda                    Zoe                    Zoe               4               4               4 4.000000
...

由于您似乎使用了dplyr,所以这是一种类似的方法,尽管我想不出办法在最后将其扩大:

library(dplyr)
library(tidyr)

long_rankings <- rankings%>%
  gather(key = 'season', value = 'ranking', - ranked_players)%>%
  mutate(season = as.integer(season))

long_games <- games%>%
  arrange(season)%>%
  group_by(season)%>%
  mutate(game_id = row_number())%>%
  ungroup()%>%
  gather(key = 'player_number', value = 'player', -season, - game_id)

inner_join(long_rankings
           ,long_games
           , by = c('season' = 'season'
                    , 'ranked_players' = 'player'))%>%
  group_by(season, game_id)%>%
  summarize(game_rank_ave = mean(ranking, na.rm = T))

   season game_id game_rank_ave
    <int>   <int>         <dbl>
 1   2000       1          2.67
 2   2001       1          4   
 3   2001       2          2   
 4   2001       3          1.33
 5   2001       4          4   
 6   2002       1          3.33

关于生成数据,请注意cbind()!它将对象强制转换为矩阵,矩阵只能具有一类,例如字符或数字。我解决了data.frame生成问题,以解决该问题。

使用的数据:

set.seed(0)
players <- c("Abe", "Bob", "Chris", "John", "Jane", "Linda", "Mason", "Zoe", "NA")
years <- c(2000:2005)
season <- sample(years, 20, replace = TRUE)
player1 <- sample(players, 20, replace = TRUE)
player2 <- sample(players, 20, replace = TRUE)
player3 <- sample(players, 20, replace = TRUE)
games <- data.frame(season, player1, player2, player3, stringsAsFactors = FALSE)
rankings <- data.frame(replicate(6,sample(1:5,8,rep=TRUE)))
colnames(rankings) <- years
ranked_players <- players[-9]
#rankings <- cbind(ranked_players, rankings) ##don't cbind unless you're making a matrix!
rankings$ranked_players <- players[-9] 

答案 1 :(得分:1)

我对您期望的输出的解释与@Cole有所不同,因为您希望每场比赛的玩家均值列。我的策略是将播放器提取到自己的数据框中(同时必须将列名更改为非数字)。然后,对于player1player2player3中的每一个,在该“位置”中寻找玩家,并从其各自的数据框中查找其排名。我敢肯定,有更好的方法可以做到这一点,但在这种情况下还是可行的(尽管我不确定这正是您的追求。

输出:

> head(games,10)
     season player1 player2 player3 player1_rank player2_rank player3_rank    means
1  year2005   Mason    John    John            3            3            3 3.000000
2  year2001    <NA>     Zoe    <NA>           NA            4           NA       NA
3  year2002     Bob   Linda   Chris            3            4            3 3.333333
4  year2003   Linda     Zoe    Jane            5            5            3 4.333333
5  year2005     Bob    Jane   Chris            5            1            3 3.000000
6  year2001   Chris    Jane   Linda            1            1            4 2.000000
7  year2005    John     Zoe   Chris            3            3            3 3.000000
8  year2005     Abe     Abe    Jane            4            4            1 3.000000
9  year2003    John    Jane   Mason            1            3            3 2.333333
10 year2003     Zoe   Mason     Abe            5            3            5 4.333333

代码:

for (player in players){
    temp_player <- filter(rankings, ranked_players == player)
    colnames(temp_player) <- c("player","year2000","year2001","year2002","year2003","year2004","year2005")
    assign(paste(player), temp_player)
}

games$season <- paste0("year",games$season)
games[games=="NA"] <- NA

i <- 1
for (rows in games$player1){
        if (!is.na(games$player1[i])) {
            season <- games$season[i]
            games$player1_rank[i] <- get(games$player1[i])[,season]
        }
        else
        {
            games$player1_rank[i] <- NA
        }
        i <- i + 1
    }

i <- 1
for (rows in games$player2){
        if (!is.na(games$player2[i])) {
            season <- games$season[i]
            games$player2_rank[i] <- get(games$player2[i])[,season]
        }
        else
        {
            games$player2_rank[i] <- NA
        }
        i <- i + 1
    }

i <- 1
for (rows in games$player3){
        if (!is.na(games$player3[i])) {
            season <- games$season[i]
            games$player3_rank[i] <- get(games$player3[i])[,season]
        }
        else
        {
            games$player3_rank[i] <- NA
        }
        i <- i + 1
    }

games$means <- rowMeans(games[,5:7]

数据:

set.seed(0)
players <- c("Abe", "Bob", "Chris", "John", "Jane", "Linda", "Mason", "Zoe", "NA")
years <- c(2000:2005)
season <- sample(years, 20, replace = TRUE)
player1 <- sample(players, 20, replace = TRUE)
player2 <- sample(players, 20, replace = TRUE)
player3 <- sample(players, 20, replace = TRUE)
games <- data.frame(season, player1, player2, player3, stringsAsFactors = FALSE)
rankings <- data.frame(replicate(6,sample(1:5,8,rep=TRUE)))
colnames(rankings) <- years
ranked_players <- players[-9]
rankings <- cbind(ranked_players, rankings)