寻求帮助来优化我的sqldf代码,该代码根据非equi连接生成聚合历史统计数据,即数据仅汇总到当前数据行。
重要的是,任何解决方案都能够适用于许多不同的组,例如在sqldf示例中通过tourney_name过滤聚合等。
获取数据:
library(dplyr); library(sqldf); data_list <- list()
for(i in 2000:2018){
data_list[[i]] <-
readr::read_csv(paste0('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_',i,'.csv')) %>%
as.data.frame}
data <- data.table::rbindlist(data_list)
data <- select(data, tourney_name, tourney_date, match_num, winner_id, winner_name, loser_id, loser_name)
system.time(
data2 <- sqldf("select a.*,
count(b.winner_id) as winner_overall_wins
from data a
left join data b
on (a.winner_id = b.winner_id and a.tourney_date > b.tourney_date)
group by a.tourney_name, a.tourney_date, a.match_num, a.winner_id
order by tourney_date desc, tourney_name, match_num desc",
stringsAsFactors = FALSE)
) # takes 16 sec, would like to look for a vectorized solution
head(data2)
方法我试图加快代码速度:
For loop - 太慢
Dplyr完全加入/过滤 - 将内存超过60gb。
Data.table / cumsum - 无法使代码正常工作。更喜欢非数据表格方法,但愿意学习可推广的解决方案
谢谢!