我正在将这个篮球比赛数据与大约50,000行的数据框游戏一起使用。我正在尝试比较每场比赛中每支球队(A和B)的统计数据。
我还有一个名为teamStats的数据框,该数据框每个季节都有大约3000行与每个团队。
到目前为止,我已经汇编了以下代码:
for (i in 1:nrow(games)) {
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i]))) == 1) {
selectTeamA <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamA[i])),4:45]
} else {
selectTeamA <- as.numeric(rep(NA, ncol(differences)))
}
if (length(which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i]))) == 1) {
selectTeamB <- teamStats[which(((teamStats$Year == games$Season[i])==1) & (teamStats$teamID == games$teamB[i])),4:45]
} else {
selectTeamB <- as.numeric(rep(NA, ncol(differences)))
}
differences[i,] <- selectTeamA - selectTeamB
}
基本上,此代码在设置了正确的赛季之后为每个A和B团队搜索了正确的teamID。由于每个赛季的每支球队都没有出现在teamstats中,因此我现在用NA填补了缺失的行。 “差异”数据框是一个空的数据框,该填充框将填充我和A队在for循环中的状态差异。
让您对数据有所了解:
游戏-前6行
Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100
108126 2010 1107 1393 B 43 75
108127 2010 1143 1178 A 95 61
teamStats-前6行,仅前6列用于空间-整个数据帧中具有不同统计信息的许多列。代码为teamID找到正确的行,然后减去stat列,例如G W L等
School Year teamID G W L
1 abilene christian 2018 1101 32 16 16
2 air force 2018 1102 31 12 19
3 akron 2018 1103 32 14 18
4 alabama a&m 2018 1105 31 3 28
5 alabama-birmingham 2018 1412 33 20 13
关闭这个很长的帖子,我的问题。我的for循环代码可以正常工作并填充差异数据帧。问题是运行此代码需要20到30分钟。我对处理这么多数据不是很有经验。有我不知道的技术吗?如何以更有效的方式重写此代码?
答案 0 :(得分:1)
一种方法是合并games
和teamStats
,作为遍历行的替代方法。
一些代码可以复制您的设置,以创建一个最小的工作示例:
library(dplyr)
library(purrr)
set.seed(123)
n_games <- 50000
n_teams <- 400
n_years <- 10
games <- data.frame(Season = rep(2005:(2005 + n_years - 1),
each = n_games / n_years)) %>%
mutate(teamA = sample(1000:(1000 + n_teams - 1), n_games, r = TRUE),
teamB = map_int(teamA, ~sample(setdiff(1000:(1000 + n_teams - 1), .), 1)),
scoreA = as.integer(rnorm(n_games, 80, 20)),
scoreB = as.integer(rnorm(n_games, 80, 20)),
scoreB = ifelse(scoreA == scoreB, scoreA + sample(c(-1, 1), n_games, r = TRUE), scoreB),
winner = ifelse(scoreA > scoreB, "A", "B"))
gen_random_string <- function(...) {
paste(sample(c(letters, " "), rpois(1, 10), r = TRUE), collapse = "")
}
schools_ids <- data.frame(teamID = 1000:(1000 + n_teams - 1)) %>%
mutate(School = map_chr(teamID, gen_random_string))
teamStats <- data.frame(Year = rep(2005:(2005 + n_years - 1), each = 300)) %>%
mutate(teamID = as.vector(replicate(n_years, sample(schools_ids$teamID, 300))),
G = 32, W = rpois(length(teamID), 16), L = G - W) %>%
left_join(schools_ids)
我们有games
(行数为5万)和teamStats(行数为3000)。现在,我们用teamStats
和Year
将teamID
折叠成小标题:
teamStats <- teamStats %>%
group_by(Year, teamID) %>%
nest()
# # A tibble: 3,000 x 3
# Year teamID data
# <int> <int> <list>
# 1 2005 1321 <tibble [1 x 4]>
# 2 2005 1192 <tibble [1 x 4]>
# 3 2005 1074 <tibble [1 x 4]>
# <snip>
制作一个小的便利函数来计算差异:
calculate_diff <- function(x, y) {
if (is.null(x) | is.null(y)) {
data.frame(G = NA, W = NA, L = NA)
} else {
x[, 1:3] - y[, 1:3]
}
}
现在,我们(1)将games
与teamStats
连接(或合并),(2)使用连接的数据集计算差异,以及(3)unnest
(或取消折叠) )数据框。
start <- Sys.time()
differences <- games %>%
left_join(teamStats, c("Season" = "Year", "teamA" = "teamID")) %>%
rename(teamA_stats = data) %>%
left_join(teamStats, c("Season" = "Year", "teamB" = "teamID")) %>%
rename(teamB_stats = data) %>%
mutate(diff = map2(teamA_stats, teamB_stats, calculate_diff)) %>%
select(Season, teamA, teamB, diff) %>%
unnest()
difftime(Sys.time(), start)
# Time difference of 11.27832 secs
结果
head(differences)
# Season teamA teamB G W L
# 1 2005 1115 1085 NA NA NA
# 2 2005 1315 1177 NA NA NA
# 3 2005 1163 1051 0 -9 9
# 4 2005 1353 1190 0 -4 4
# 5 2005 1376 1286 NA NA NA
# 6 2005 1018 1362 0 -1 1
答案 1 :(得分:1)
这是一种使用tidyverse
软件包的方法,我希望它比OP中的循环解决方案要快得多。我希望这种速度来自于更多地依赖数据库联接操作(例如,基础merge
或dplyr的left_join
)来连接两个表。
library(tidyverse)
# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%
# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%
# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%
# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)
# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)
为了对此进行测试,我更改了给定的样本数据,以使两个表相互关联:
games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")
teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)
然后我得到以下输出,这似乎很有意义。 (我只是意识到我应用的聚集/变异/扩展改变了列的顺序;如果有时间,我可能会尝试使用mutate_if来保留顺序。)
> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30