我有一个5m观测值的数据框,其简化版本如下所示:
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=c("A","B","C"), seller =c("B","A","D"),amount=c(1,4,2))
上面的示例如下:2020年5月5日,代理商A从代理商B购买1笔金额,依此类推。
在数据集中,大约有800,000个不同日期的约50万唯一买方和卖方。
对于每个日期,我想创建一个nxn矩阵,该矩阵代表正在交易的代理商的每日库存变化。然后应将此每日计算的矩阵存储在列表中。因此,对于上面的示例,结果将是:
╔══════════════╗
║ A B C D ║
╠══════════════╣
║ A 0 -3 0 0 ║
║ B +3 0 0 0 ║
║ C 0 0 0 2 ║
║ D 0 0 -2 0 ║
╚══════════════╝
代理商A首先从代理商B购买了1个单位,然后又卖回了4个,因此有-3。
我的代码如下:
library("tidyverse")
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=as.character(c("A","B","C")), seller =as.character(c("B","A","D")),amount=c(1,4,2))
daily_matrices <- list() #create empty list to store matrices
dates <- unique(as.Date(df$date))
for (i in 1: length(dates)) { # loop over every date
loop_date <- dates[i]
daily_subset <- df %>% filter(date==loop_date) #filter data for each date
daily_subset_long <- daily_subset %>%
gather(key="var", value="agent",buyer,seller)
daily_agents <- distinct(daily_subset_long, agent) # find unique agents
daily_pairs<-combn(daily_agents$agent,2) # find each possible pair
ndim <- dim(daily_agents)[1]
daily_matrices[[i]] <- matrix(data=0,nrow=ndim, ncol=ndim) #span matrix
colnames(daily_matrices[[i]])<-daily_agents$agent #name columns with agents
rownames(daily_matrices[[i]])<-daily_agents$agent #name rows with agents
for (j in 1: dim(daily_pairs)[2]) { # for each possible pair call below function
trading_partner(daily_pairs[1,j],daily_pairs[2,j])
}
print(i) # just to track progress
}
trading_partner <-function(x,y) {
agent_daily_subset <- daily_subset %>% filter(buyer== x & seller== y | buyer== y & seller== x) # filter trades for each pair
agent_daily_subset_long <- agent_daily_subset %>%
gather(key="var", value="agent",buyer,seller)
agent_daily_subset_long <- agent_daily_subset_long %>% group_by(agent) %>%
mutate(delta_inventory = case_when(var =="buyer" ~ amount,
var =="seller" ~ -amount)) # calculates change in inventory for each trade
subgroup_inventory <- agent_daily_subset_long %>% group_by(agent) %>% summarise(inventory = sum(delta_inventory)) # summarisses change in inventory for each of the two agents in a pair
if (dim(subgroup_inventory)[1] >0) { #if there has been a trade between the pair paste the inventory change in the list of matrices and find the correct row and column by the name of the agents
daily_matrices[[i]][as.character(subgroup_inventory[1,1]),as.character(subgroup_inventory[2,1])] <<- as.double(subgroup_inventory[1,2])
daily_matrices[[i]][as.character(subgroup_inventory[2,1]),as.character(subgroup_inventory[1,1])] <<- as.double(subgroup_inventory[2,2])
}
}
这可以按预期工作,但是由于原始数据集中每天大约有1000种不同的代理,因此我遇到了问题,因此我创建了大量矩阵。
我知道,在R中使用循环并不是一开始的首选方式,但是无法提出其他解决方案。上面的代码对于每个每日矩阵大约需要30分钟。如果有800天的时间,则需要2个星期才能完成工作。
是否有可能以最佳实践的方式加快速度?
答案 0 :(得分:2)
尝试一下:
library(dplyr)
library(tidyr)
df %>%
group_by(date) %>%
do(bind_rows(., transmute(., date, b = buyer, buyer = seller, seller = b, amount = -amount) %>%
select(-b))) %>%
group_by(date, buyer, seller) %>%
summarize(amount = sum(amount)) %>%
group_by(date) %>%
complete(buyer=c(buyer,seller), seller=c(buyer,seller), fill = list(amount = 0)) %>%
ungroup() %>%
pivot_wider(names_from = seller, values_from = amount, values_fill=list(amount=0))
# # A tibble: 4 x 6
# date buyer A B C D
# <date> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020-05-05 A 0 -3 0 0
# 2 2020-05-05 B 3 0 0 0
# 3 2020-05-05 C 0 0 0 2
# 4 2020-05-05 D 0 0 -2 0
仅供参考:tidyr
中用于重塑的推荐功能现在为pivot_longer
和pivot_wider
; expand
和gather
尚未被弃用,但是pivot_*
函数具有更大的功能。
有时,data.table
可以更快和/或更高效地使用内存。如果要使用更大的数据进行测试。
注意:我正在使用tidyr::complete
,因为它做得很好。由于这些操作中的许多操作都是在汇总或扩展中,data.table
的引用语义没有获得太多优势,因此,我认为跨包使用对我们的危害不大。
此外,我正在使用magrittr
的{{1}}运算符(对tidyverse很熟悉)来介绍每个步骤。这不是必需的,但我认为它可以使代码更具可读性。如果从magrittr管道流转换为仅%>%
的流,您的执行可能会缩短一纳秒。
data.table
没有library(data.table)
library(tidyr)
library(magrittr)
DT <- as.data.table(df)
copy(DT) %>%
.[, c("buyer", "seller", "amount") := .(seller, buyer, -amount) ] %>%
list(., DT) %>%
rbindlist(.) %>%
.[, .(amount = sum(amount)), by = .(date, buyer, seller) ] %>%
.[, tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ] %>%
dcast(date + buyer ~ seller, value.var = "amount")
# date buyer A B C D
# 1: 2020-05-05 A 0 -3 0 0
# 2: 2020-05-05 B 3 0 0 0
# 3: 2020-05-05 C 0 0 0 2
# 4: 2020-05-05 D 0 0 -2 0
的传统data.table
流程:
%>%