简短版本:我有一个比平常更复杂的合并操作,我想帮助优化dplyr或合并。我已经有了很多解决方案,但这些解决方案在大型数据集上的运行速度相当慢,我很好奇R中是否存在更快的方法(或者在SQL或python中存在更快的方法)
我有两个data.frames:
问题:商店ID是特定位置的唯一标识符,但商店位置可能会将所有权从一个时段更改为下一个时段(并且只是为了完整性,没有两个所有者可能同时拥有相同的商店)。因此,当我合并商店级别信息时,我需要某种条件,将商店级信息合并到正确的时间段。
可重复的例子:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
现在,我只想合并Store
d.f中的信息。如果Store
在该期间(t
)开放营业,则进行记录。 CloseDate
和StartDate
分别表示此业务运营的最后一个和第一个周期。 (为了完整性但不太重要,StartDate
0商店自样品之前就已存在。对于CloseDate
9,商店在该位置结束时没有停业。样品)
一个解决方案依赖于句点t
级别split()
和dplyr::rbind_all()
,例如
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
从功能上看,这似乎有效(但无论如何还没有遇到重大错误)。但是,我们正在处理(越来越常见的)数十亿行日志数据。
如果你想用它进行基准测试,我在sense.io上做了一个更大的可重复的例子。见这里:https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
两个问题:
答案 0 :(得分:5)
在R中,您可以查看data.table::foverlaps
函数
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e
答案 1 :(得分:1)
您可以转换Stores
data.frame添加t
- 列,其中包含明确商店的所有t
值,然后使用来自Hadley的unnest
函数&# 39; s tydir
包将其转换为&#34; long&#34;形式。
require("tidyr")
require("dplyr")
complxMerge_v2 <- function(df, Stores, by = NULL) {
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
unnest(t) %>% left_join(df, ., by = by)
}
complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")
microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962 10
# complxMerge_v2(df, Stores) 532.744 539.743 567.7207 561.9635 588.0637 636.5775 10
以下是逐步完成流程的结果。
Stores_with_t <-
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 0 2 2 b 0, 1, 2
# 3 0 3 3 c 0, 1, 2, 3
# 4 4 2 9 d 4, 5, 6, 7, 8, 9
# 5 4 3 9 e 4, 5, 6, 7, 8, 9
# After that `unnest(t)`
Stores_with_t_unnest <-
with_t %>% unnest(t)
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0
# 2 0 1 9 a 1
# 3 0 1 9 a 2
# 4 0 1 9 a 3
# 5 0 1 9 a 4
# 6 0 1 9 a 5
# 7 0 1 9 a 6
# 8 0 1 9 a 7
# 9 0 1 9 a 8
# 10 0 1 9 a 9
# 11 0 2 2 b 0
# 12 0 2 2 b 1
# 13 0 2 2 b 2
# 14 0 3 3 c 0
# 15 0 3 3 c 1
# 16 0 3 3 c 2
# 17 0 3 3 c 3
# 18 4 2 9 d 4
# 19 4 2 9 d 5
# 20 4 2 9 d 6
# 21 4 2 9 d 7
# 22 4 2 9 d 8
# 23 4 2 9 d 9
# 24 4 3 9 e 4
# 25 4 3 9 e 5
# 26 4 3 9 e 6
# 27 4 3 9 e 7
# 28 4 3 9 e 8
# 29 4 3 9 e 9
# And then simple `left_join`
left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e