用redshift SQL替换基于循环的重复数据删除代码

时间:2018-12-05 20:26:28

标签: r loops amazon-redshift lag dense-rank

我们正尝试将许多用于数据集操作的旧版R代码迁移到Redshift SQL。除了下面的位,所有这些都很容易移植,事实证明这很棘手。这就是为什么我来找你,温柔的SO读者。我怀疑我要问的是不可能的,但是我没有能力证明它。

以下R代码的作用是使用循环机制对唯一整数标识符进行重复数据删除。您会在嵌入式注释中看到完整的详细信息。

在此之前,这里有一个带注释的小示例集,可让您大致了解所需的SQL代码应具有的影响:

enter image description here

这是我们尝试用redshift SQL替换的带注释的R代码:

# the purpose of this function is to dedupe a set of identifiers
    # so that each month, the set if identifiers grouped under that month
    # will not have appeared in the previous two months
    # it does this by building 3 sets:
        # current month
        # previous month
        # 2 months ago
        # In a loop, it sets the current month set for the current year-month value in the loop
            # then filters that set against the contents of previous 2 months' sets
            # then unions the surving months set against the survivors of previous months so far

# I believe the functionality below is mainly taken from library(dplyr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(multidplyr) 
library(purrr)
library(stringr)
library(RJDBC)

dedupeIdentifiers <- function(dataToDedupe, YearToStart = 2014, YearToEnd = 2016) { 
    # dataToDedupe is input set
    # YearToStart = default starting year
    # YearToEnd = default ending year

    monthYearSeq <- expand.grid(Month = 1:12, Year = YearToStart:YearToEnd) %>% tbl_df() # make a grid having all months 1:12 from starting to ending year
    twoMonthsAgoIdentifiers <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
    oneMonthAgoIdentifiers  <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
    identifiersToKeep <- dataToDedupe %>% slice(0) # make empty data frame to hold list of unique identifiers

    for(i in 1:nrow(monthYearSeq)) {
        curMonth <- monthYearSeq$Month[i] # get current month for row in loop of monthYearSeq
        curYear <- monthYearSeq$Year[i] # get current year for row in loop of monthYearSeq

        curIdentifiers <- dataToDedupe %>% filter(year(initialdate) == curYear, month(initialdate) == curMonth)%>% 
            # initialdate is the date variable in the set by which the set is filtered
            # start by filtering to make a subset, curIdentifiers, which is the set where initialdate == current month and year in the loop
            group_by(uniqueidentifier) %>% slice(1) %>% ungroup() %>%  # take just 1 example of each unique identifier in the subset
            anti_join(twoMonthsAgoIdentifiers) %>% # filter out uniqueidentifier that were in set two months ago
            anti_join(oneMonthAgoIdentifiers) # filter out uniqueidentifier that were in set one month ago

        twoMonthsAgoIdentifiers <- oneMonthAgoIdentifiers # move one month set into two month set
        oneMonthAgoIdentifiers <- curIdentifiers %>% select(uniqueidentifier) # move current month set into one month set
        identifiersToKeep <- bind_rows(identifiersToKeep, curIdentifiers) # add "surviving" unique identifiers after filtering for last 2 months
            # to updated set of deduped indentifiers
    } # lather, rinse, repeat

    return(identifiersToKeep) # return all survivors
}

最后,这是到目前为止我们尝试过的一些没有成功的事情:

  1. 已建议使用递归CTE。 Redshift不允许CTE。
  2. 使用时滞来评估“当前”日期值和先前日期值之间的日期差异,并根据唯一标识符进行划分。如果说对于同一唯一标识符123连续的1-5个月,则此方法不起作用。在这种情况下,第4和5个月都将保留,但实际上应该删除第5个月。
  3. 在唯一标识符上自动将集合左移至自身,以便可以评估所有月份的排列。 -实际上与使用滞后有相同的问题。
  4. 使用具有所有所需月份和年份的虚拟日期集将缺少的月份和年份注入要过滤的集中。标记来自原始待过滤集的行。然后使用在唯一标识符和标志上划分的density_rank来选择排名为%3 = 0的每一行。这样做的问题是,您无法始终获得在各个分区中按需要计数的density_rank值,因此%3值会出错。
  5. 使用以上各项的组合。
  6. Replacing loop with set-based operation

我们可以与原始循环代码达到90%的奇偶校验,但是不幸的是我们必须有一个完美的替代品。

请尊重我们的目标,以在SQL中重现此内容,或证明在这种情况下,使用SQL无法复制循环的结果。诸如“只坚持R”,“在python中执行循环”,“尝试此新程序包”之类的响应将无济于事。

非常感谢任何积极的建议。

1 个答案:

答案 0 :(得分:1)

您的过程可以在Redshift中使用“ SQL会话化”技术来完成。

基本上,您使用许多LAG()语句在特定窗口上比较数据,然后比较结果以完成最终分类。