R - 使用行级条件变量合并两个Data.Frame

时间:2015-05-17 21:02:09

标签: python mysql r merge dplyr

简短版本:我有一个比平常更复杂的合并操作,我想帮助优化dplyr或合并。我已经有了很多解决方案,但这些解决方案在大型数据集上的运行速度相当慢,我很好奇R中是否存在更快的方法(或者在SQL或python中存在更快的方法)

我有两个data.frames:

  1. 与商店关联的事件的异步日志,
  2. 一个表格,提供有关该日志中商店的更多详细信息。
  3. 问题:商店ID是特定位置的唯一标识符,但商店位置可能会将所有权从一个时段更改为下一个时段(并且只是为了完整性,没有两个所有者可能同时拥有相同的商店)。因此,当我合并商店级别信息时,我需要某种条件,将商店级信息合并到正确的时间段。

    可重复的例子:

    # asynchronous log. 
    #  t for period. 
    #  Store for store loc ID
    #  var1 just some variable. 
    set.seed(1)
    df <- data.frame(
      t     = c(1,1,1,2,2,2,3,3,4,4,4),
      Store = c(1,2,3,1,2,3,1,3,1,2,3),
      var1 =  runif(11,0,1)
    )
    
    # Store table
    # You can see, lots of store location opening and closing, 
    #  StateDate is when this business came into existence
    #  Store is the store id from df
    #  CloseDate is when this store when out of business
    #  storeVar1 is just some important var to merge over
    Stores <- data.frame(
      StartDate = c(0,0,0,4,4),
      Store     = c(1,2,3,2,3),
      CloseDate = c(9,2,3,9,9),
      storeVar1 = c("a","b","c","d","e")
    )
    

    现在,我只想合并Store d.f中的信息。如果Store在该期间(t)开放营业,则进行记录。 CloseDateStartDate分别表示此业务运营的最后一个和第一个周期。 (为了完整性但不太重要,StartDate 0商店自样品之前就已存在。对于CloseDate 9,商店在该位置结束时没有停业。样品

    一个解决方案依赖于句点t级别split()dplyr::rbind_all(),例如

    # The following seems to do the trick. 
    complxMerge_v1 <- function(df, Stores, by = "Store"){
      library("dplyr")
      temp <- split(df, df$t)
      for (Period in names(temp))(
        temp[[Period]] <- dplyr::left_join(
          temp[[Period]],
          dplyr::filter(Stores, 
                        StartDate <= as.numeric(Period) & 
                        CloseDate >= as.numeric(Period)),
          by = "Store"
        )
      )
      df <- dplyr::rbind_all(temp); rm(temp)
      df
    }
    complxMerge_v1(df, Stores, "Store")
    

    从功能上看,这似乎有效(但无论如何还没有遇到重大错误)。但是,我们正在处理(越来越常见的)数十亿行日志数据。

    如果你想用它进行基准测试,我在sense.io上做了一个更大的可重复的例子。见这里:https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals

    两个问题:

    1. 首先,还有另一种方法可以使用运行速度更快的类似方法来解决这个问题吗?
    2. 在SQL和Python中有一个快速简单的解决方案(其中我并不熟悉,但如果需要可以依赖)。
    3. 另外,你能帮我用更笼统,抽象的方式表达这个问题吗?现在我只知道如何用特定于上下文的术语来讨论这个问题,但是我希望能够用更合适但更通用的编程或数据操作术语来讨论这些类型的问题。

2 个答案:

答案 0 :(得分:5)

在R中,您可以查看data.table::foverlaps函数

library(data.table)

# Set start and end values in `df` and key by them  and by  `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]      
setkey(df, Store, StartDate, CloseDate)

# Run `foverlaps` function
foverlaps(setDT(Stores), df)
#     Store t       var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
#  1:     1 1 0.26550866         1         1           0           9         a
#  2:     1 2 0.90820779         2         2           0           9         a
#  3:     1 3 0.94467527         3         3           0           9         a
#  4:     1 4 0.62911404         4         4           0           9         a
#  5:     2 1 0.37212390         1         1           0           2         b
#  6:     2 2 0.20168193         2         2           0           2         b
#  7:     3 1 0.57285336         1         1           0           3         c
#  8:     3 2 0.89838968         2         2           0           3         c
#  9:     3 3 0.66079779         3         3           0           3         c
# 10:     2 4 0.06178627         4         4           4           9         d
# 11:     3 4 0.20597457         4         4           4           9         e

答案 1 :(得分:1)

您可以转换Stores data.frame添加t - 列,其中包含明确商店的所有t值,然后使用来自Hadley的unnest函数&# 39; s tydir包将其转换为&#34; long&#34;形式。

require("tidyr")
require("dplyr")

complxMerge_v2 <- function(df, Stores, by = NULL)    {
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
    unnest(t) %>% left_join(df, ., by = by)
}

complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
#    t Store       var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")

microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)

# Unit: milliseconds
#                       expr      min       lq      mean    median        uq       max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962    10
# complxMerge_v2(df, Stores)  532.744  539.743  567.7207  561.9635  588.0637  636.5775    10

以下是逐步完成流程的结果。

Stores_with_t <- 
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
#   StartDate Store CloseDate storeVar1                            t
# 1         0     1         9         a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2         0     2         2         b                      0, 1, 2
# 3         0     3         3         c                   0, 1, 2, 3
# 4         4     2         9         d             4, 5, 6, 7, 8, 9
# 5         4     3         9         e             4, 5, 6, 7, 8, 9

# After that `unnest(t)`

Stores_with_t_unnest <- 
  with_t %>% unnest(t)
#    StartDate Store CloseDate storeVar1 t
# 1          0     1         9         a 0
# 2          0     1         9         a 1
# 3          0     1         9         a 2
# 4          0     1         9         a 3
# 5          0     1         9         a 4
# 6          0     1         9         a 5
# 7          0     1         9         a 6
# 8          0     1         9         a 7
# 9          0     1         9         a 8
# 10         0     1         9         a 9
# 11         0     2         2         b 0
# 12         0     2         2         b 1
# 13         0     2         2         b 2
# 14         0     3         3         c 0
# 15         0     3         3         c 1
# 16         0     3         3         c 2
# 17         0     3         3         c 3
# 18         4     2         9         d 4
# 19         4     2         9         d 5
# 20         4     2         9         d 6
# 21         4     2         9         d 7
# 22         4     2         9         d 8
# 23         4     2         9         d 9
# 24         4     3         9         e 4
# 25         4     3         9         e 5
# 26         4     3         9         e 6
# 27         4     3         9         e 7
# 28         4     3         9         e 8
# 29         4     3         9         e 9

# And then simple `left_join`

left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store          var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e