我们正尝试将许多用于数据集操作的旧版R代码迁移到Redshift SQL。除了下面的位,所有这些都很容易移植,事实证明这很棘手。这就是为什么我来找你,温柔的SO读者。我怀疑我要问的是不可能的,但是我没有能力证明它。
以下R代码的作用是使用循环机制对唯一整数标识符进行重复数据删除。您会在嵌入式注释中看到完整的详细信息。
在此之前,这里有一个带注释的小示例集,可让您大致了解所需的SQL代码应具有的影响:
这是我们尝试用redshift SQL替换的带注释的R代码:
# the purpose of this function is to dedupe a set of identifiers
# so that each month, the set if identifiers grouped under that month
# will not have appeared in the previous two months
# it does this by building 3 sets:
# current month
# previous month
# 2 months ago
# In a loop, it sets the current month set for the current year-month value in the loop
# then filters that set against the contents of previous 2 months' sets
# then unions the surving months set against the survivors of previous months so far
# I believe the functionality below is mainly taken from library(dplyr)
library(dplyr)
library(tidyverse)
library(lubridate)
library(multidplyr)
library(purrr)
library(stringr)
library(RJDBC)
dedupeIdentifiers <- function(dataToDedupe, YearToStart = 2014, YearToEnd = 2016) {
# dataToDedupe is input set
# YearToStart = default starting year
# YearToEnd = default ending year
monthYearSeq <- expand.grid(Month = 1:12, Year = YearToStart:YearToEnd) %>% tbl_df() # make a grid having all months 1:12 from starting to ending year
twoMonthsAgoIdentifiers <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
oneMonthAgoIdentifiers <- data_frame(propertyid = integer(0)) # make empty data frame to hold list of unique identifiers
identifiersToKeep <- dataToDedupe %>% slice(0) # make empty data frame to hold list of unique identifiers
for(i in 1:nrow(monthYearSeq)) {
curMonth <- monthYearSeq$Month[i] # get current month for row in loop of monthYearSeq
curYear <- monthYearSeq$Year[i] # get current year for row in loop of monthYearSeq
curIdentifiers <- dataToDedupe %>% filter(year(initialdate) == curYear, month(initialdate) == curMonth)%>%
# initialdate is the date variable in the set by which the set is filtered
# start by filtering to make a subset, curIdentifiers, which is the set where initialdate == current month and year in the loop
group_by(uniqueidentifier) %>% slice(1) %>% ungroup() %>% # take just 1 example of each unique identifier in the subset
anti_join(twoMonthsAgoIdentifiers) %>% # filter out uniqueidentifier that were in set two months ago
anti_join(oneMonthAgoIdentifiers) # filter out uniqueidentifier that were in set one month ago
twoMonthsAgoIdentifiers <- oneMonthAgoIdentifiers # move one month set into two month set
oneMonthAgoIdentifiers <- curIdentifiers %>% select(uniqueidentifier) # move current month set into one month set
identifiersToKeep <- bind_rows(identifiersToKeep, curIdentifiers) # add "surviving" unique identifiers after filtering for last 2 months
# to updated set of deduped indentifiers
} # lather, rinse, repeat
return(identifiersToKeep) # return all survivors
}
最后,这是到目前为止我们尝试过的一些没有成功的事情:
我们可以与原始循环代码达到90%的奇偶校验,但是不幸的是我们必须有一个完美的替代品。
请尊重我们的目标,以在SQL中重现此内容,或证明在这种情况下,使用SQL无法复制循环的结果。诸如“只坚持R”,“在python中执行循环”,“尝试此新程序包”之类的响应将无济于事。
非常感谢任何积极的建议。
答案 0 :(得分:1)
您的过程可以在Redshift中使用“ SQL会话化”技术来完成。
基本上,您使用许多LAG()语句在特定窗口上比较数据,然后比较结果以完成最终分类。