加快大型矩阵的创建和填充

时间:2020-06-08 17:31:17

标签: r matrix

我有一个5m观测值的数据框,其简化版本如下所示:

df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=c("A","B","C"), seller =c("B","A","D"),amount=c(1,4,2))

上面的示例如下:2020年5月5日,代理商A从代理商B购买1笔金额,依此类推。

在数据集中,大约有800,000个不同日期的约50万唯一买方和卖方。

对于每个日期,我想创建一个nxn矩阵,该矩阵代表正在交易的代理商的每日库存变化。然后应将此每日计算的矩阵存储在列表中。因此,对于上面的示例,结果将是:

╔══════════════╗
║  A  B  C  D  ║
╠══════════════╣
║ A 0 -3  0  0 ║
║ B +3 0  0  0 ║
║ C 0  0  0  2 ║
║ D 0  0 -2  0 ║
╚══════════════╝

代理商A首先从代理商B购买了1个单位,然后又卖回了4个,因此有-3。

我的代码如下:

library("tidyverse")

df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=as.character(c("A","B","C")), seller =as.character(c("B","A","D")),amount=c(1,4,2))

    daily_matrices <- list() #create empty list to store matrices
    dates <- unique(as.Date(df$date))

for (i in 1: length(dates)) { # loop over every date
  loop_date <- dates[i]
  daily_subset <- df %>% filter(date==loop_date) #filter data for each date
  daily_subset_long <- daily_subset %>%
    gather(key="var", value="agent",buyer,seller) 
  daily_agents <- distinct(daily_subset_long, agent) # find unique agents
  daily_pairs<-combn(daily_agents$agent,2) # find each possible pair
  ndim <- dim(daily_agents)[1]
  daily_matrices[[i]] <- matrix(data=0,nrow=ndim, ncol=ndim) #span matrix
  colnames(daily_matrices[[i]])<-daily_agents$agent #name columns with agents
  rownames(daily_matrices[[i]])<-daily_agents$agent #name rows with agents

  for (j in 1: dim(daily_pairs)[2]) { # for each possible pair call below function 
    trading_partner(daily_pairs[1,j],daily_pairs[2,j])
  }
  print(i) # just to track progress

}

trading_partner <-function(x,y) {
    agent_daily_subset <- daily_subset %>% filter(buyer== x & seller== y | buyer== y & seller== x) # filter trades for each pair

  agent_daily_subset_long <- agent_daily_subset %>%
    gather(key="var", value="agent",buyer,seller) 

  agent_daily_subset_long <- agent_daily_subset_long %>% group_by(agent) %>%
    mutate(delta_inventory = case_when(var =="buyer" ~ amount,
                                    var =="seller" ~ -amount)) # calculates change in inventory for each trade

  subgroup_inventory <- agent_daily_subset_long %>% group_by(agent) %>% summarise(inventory = sum(delta_inventory)) # summarisses change in inventory for each of the two agents in a pair
  if (dim(subgroup_inventory)[1] >0) { #if there has been a trade between the pair paste the inventory change in the list of matrices and find the correct row and column by the name of the agents

  daily_matrices[[i]][as.character(subgroup_inventory[1,1]),as.character(subgroup_inventory[2,1])] <<- as.double(subgroup_inventory[1,2])
  daily_matrices[[i]][as.character(subgroup_inventory[2,1]),as.character(subgroup_inventory[1,1])] <<- as.double(subgroup_inventory[2,2])


    }
  }

这可以按预期工作,但是由于原始数据集中每天大约有1000种不同的代理,因此我遇到了问题,因此我创建了大量矩阵。

我知道,在R中使用循环并不是一开始的首选方式,但是无法提出其他解决方案。上面的代码对于每个每日矩阵大约需要30分钟。如果有800天的时间,则需要2个星期才能完成工作。

是否有可能以最佳实践的方式加快速度?

1 个答案:

答案 0 :(得分:2)

尝试一下:

library(dplyr)
library(tidyr)
df %>%
  group_by(date) %>%
  do(bind_rows(., transmute(., date, b = buyer, buyer = seller, seller = b, amount = -amount) %>%
  select(-b))) %>%
  group_by(date, buyer, seller) %>%
  summarize(amount = sum(amount)) %>%
  group_by(date) %>%
  complete(buyer=c(buyer,seller), seller=c(buyer,seller), fill = list(amount = 0)) %>%
  ungroup() %>%
  pivot_wider(names_from = seller, values_from = amount, values_fill=list(amount=0))
# # A tibble: 4 x 6
#   date       buyer     A     B     C     D
#   <date>     <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020-05-05 A         0    -3     0     0
# 2 2020-05-05 B         3     0     0     0
# 3 2020-05-05 C         0     0     0     2
# 4 2020-05-05 D         0     0    -2     0

仅供参考:tidyr中用于重塑的推荐功能现在为pivot_longerpivot_widerexpandgather尚未被弃用,但是pivot_*函数具有更大的功能。


有时,data.table可以更快和/或更高效地使用内存。如果要使用更大的数据进行测试。

注意:我正在使用tidyr::complete,因为它做得很好。由于这些操作中的许多操作都是在汇总或扩展中,data.table的引用语义没有获得太多优势,因此,我认为跨包使用对我们的危害不大。

此外,我正在使用magrittr的{​​{1}}运算符(对tidyverse很熟悉)来介绍每个步骤。这不是必需的,但我认为它可以使代码更具可读性。如果从magrittr管道流转换为仅%>%的流,您的执行可能会缩短一纳秒。

data.table

没有library(data.table) library(tidyr) library(magrittr) DT <- as.data.table(df) copy(DT) %>% .[, c("buyer", "seller", "amount") := .(seller, buyer, -amount) ] %>% list(., DT) %>% rbindlist(.) %>% .[, .(amount = sum(amount)), by = .(date, buyer, seller) ] %>% .[, tidyr::complete(.SD, buyer, seller, fill = list(amount = 0)), by = .(date) ] %>% dcast(date + buyer ~ seller, value.var = "amount") # date buyer A B C D # 1: 2020-05-05 A 0 -3 0 0 # 2: 2020-05-05 B 3 0 0 0 # 3: 2020-05-05 C 0 0 0 2 # 4: 2020-05-05 D 0 0 -2 0 的传统data.table流程:

%>%
相关问题