Redshift替代相关子查询

时间:2017-06-05 20:55:47

标签: sql amazon-redshift correlated-subquery

我正在使用Redshift并且需要替代相关子查询。我得到相关的子查询不支持错误。但是,对于尝试识别同一客户在原始交易的给定时间内所做的所有销售交易的特定练习,我不确定传统的左连接是否也可以。即,查询依赖于父选择的上下文或当前值。我也尝试使用row_number()窗口函数进行类似的操作,但同样需要一种在日期范围内窗口/分区的方法 - 而不仅仅是customer_id。

总体目标是找到给定客户ID的第一个销售交易,然后查找在第一个交易的60分钟内进行的所有后续交易。对于同一客户(以及最终数据库中的所有客户)的剩余交易,此逻辑将继续。也就是说,一旦从第一次交易时间开始建立了最初的60分钟窗口,第二个60分钟窗口将在第一个60分钟窗口结束时开始,第二个窗口内的所有交易也将被识别和组合然后重复剩余的交易。

输出将列出启动60分钟窗口的第一个事务ID,然后列出在60分钟窗口内创建的其他后续事务ID。第二行将显示同一客户在下一个60分钟窗口中创建的第一个交易ID(同样,第一个交易发布的第一个60分钟窗口将是第二个60分钟窗口的开始),然后后续交易也进行在第二个60分钟的窗口内。

最基本形式的查询示例如下面的查询:

select
s1.customer_id,
s1.transaction_id,
s1.order_time,
(
    select
        s2.transaction_id
    from
        sales s2
    where
        s2.order_time > s1.order_time and
        s2.order_time <= dateadd(m,60,s1.order_time) and
        s2.customer_id = s1.customer_id
    order by
        s2.order_time asc
    limit 1
) as sales_transaction_id_1,
(
    select
        s3.transaction_id
    from
        sales s3
    where
        s3.order_time > s1.order_time and
        s3.order_time <= dateadd(m,60,s1.order_time) and
        s3.customer_id = s1.customer_id
    order by
        s3.order_time asc
    limit 1 offset 1
) as sales_transaction_id_2,
(
    select
        s3.transaction_id
    from
        sales s4
    where
        s4.order_time > s1.order_time and
        s4.order_time <= dateadd(m,60,s1.order_time) and
        s4.customer_id = s1.customer_id
    order by
        s4.order_time asc
    limit 1 offset 1
) as sales_transaction_id_3
from
    (
        select 
            sales.customer_id,
            sales.transaction_id,
            sales.order_time
        from
            sales
        order by
            sales.order_time desc
    ) s1;

例如,如果客户进行了以下交易:

customer_id     transaction_id      order_time          
1234                33453           2017-06-05 13:30
1234                88472           2017-06-05 13:45
1234                88477           2017-06-05 14:10

1234                99321           2017-06-07 8:30
1234                99345           2017-06-07 8:45

预期输出为:

customer_id     transaction_id  sales_transaction_id_1 sales_transaction_id_2   sales_transaction_id_3
1234                33453           88472                   88477                   NULL
1234                99321           99345                   NULL                    NULL

此外,看起来Redshift不支持横向连接,这似乎进一步限制了我可以使用的选项。任何帮助将不胜感激。

2 个答案:

答案 0 :(得分:0)

根据您的说明,您只需要group by和某种日期差异。我不确定你想如何组合行,但这是基本的想法:

select s.customer_id,
       min(order_time) as first_order_in_hour,
       max(order_time) as last_order_in_hour,
       count(*) as num_orders
from (select s.*,
             min(order_time) over (partition by customer_id) as min_ot
      from sales s
     ) s
group by customer_id, floor(datediff(second, min_ot, order_time) / (60 * 60));

这种表述(或者类似的东西,因为Postgres没有datediff())在Postgres中也会快得多。

答案 1 :(得分:0)

您可以使用窗口函数来获取每个事务的后续事务。窗口将是客户/小时,您可以对记录进行排名以获得第一个“锚点”交易,并获得您需要的所有后续交易:

with
transaction_chains as (
    select
     customer_id
    ,transaction_id
    ,order_time
    -- rank transactions within window to find the first "anchor" transaction
    ,row_number() over (partition by customer_id,date_trunc('minute',order_time) order by order_time)
    -- 1st next order
    ,lead(transaction_id,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_1
    ,lead(order_time,1) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_1
    -- 2nd next order
    ,lead(transaction_id,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_2
    ,lead(order_time,2) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_2
    -- 2nd next order
    ,lead(transaction_id,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as transaction_id_3
    ,lead(order_time,3) over (partition by customer_id,date_trunc('minute',order_time) order by order_time) as order_time_3
    from sales
)
select 
 customer_id
,transaction_id
,transaction_id_1
,transaction_id_2
,transaction_id_3
from transaction_chains
where row_number=1;