Question

Redshift在其窗口函数中不支持DISTINCT聚合。任何窗口函数都不支持AWS documentation for COUNT states this和distinct。

我的使用案例：在不同的时间间隔和流量渠道中计算客户数

我希望本年度的每月和YTD 唯一客户数量，并且还按流量渠道和所有渠道的总数进行划分。由于客户可以多次访问我只需要计算不同的客户，因此Redshift窗口聚合将无济于事。

我可以使用count(distinct customer_id)...group by统计不同的客户，但这只会给我四个所需的结果。
我不想养成为一堆union all之间堆积的每个所需计数运行完整查询的习惯。我希望这不是唯一的解决方案。

这就是我在postgres（或Oracle）中写的内容：

select order_month
       , traffic_channel
       , count(distinct customer_id) over(partition by order_month, traffic_channel) as customers_by_channel_and_month
       , count(distinct customer_id) over(partition by traffic_channel) as ytd_customers_by_channel
       , count(distinct customer_id) over(partition by order_month) as monthly_customers_all_channels
       , count(distinct customer_id) over() as ytd_total_customers

from orders_traffic_channels
/* otc is a table of dated transactions of customers, channels, and month of order */

where to_char(order_month, 'YYYY') = '2017'

如何在Redshift中解决此问题？

结果需要在红移群集上工作，此外这是一个简化的问题，实际需要的结果包含产品类别和客户类型，它们会增加所需的分区数量。因此，一堆union all汇总不是一个很好的解决方案。

Answer 1

blog post from 2016提出了这个问题并提供了一个基本的解决方法，谢谢Mark D. Adams。奇怪的是，我无法在整个网络上找到，因此我分享了我的（经过测试的）解决方案。

关键见解是，dense_rank()按相关项目排序，为相同的项目提供相同的排名，因此最高排名也是唯一项目的数量。如果您尝试为我想要的每个分区交换以下内容，这是一个可怕的混乱：

dense_rank() over(partition by order_month, traffic_channel order by customer_id)

由于您需要最高排名，因此您必须子查询所有内容并从每个排名中选择最大值。 将外部查询中的分区与子查询中的相应分区进行匹配非常重要。

/* multigrain windowed distinct count, additional grains are one dense_rank and one max over() */
select distinct
       order_month
       , traffic_channel
       , max(tc_mth_rnk) over(partition by order_month, traffic_channel) customers_by_channel_and_month
       , max(tc_rnk) over(partition by traffic_channel)  ytd_customers_by_channel
       , max(mth_rnk) over(partition by order_month)  monthly_customers_all_channels
       , max(cust_rnk) over()  ytd_total_customers

from (
       select order_month
              , traffic_channel
              , dense_rank() over(partition by order_month, traffic_channel order by customer_id)  tc_mth_rnk
              , dense_rank() over(partition by traffic_channel order by customer_id)  tc_rnk
              , dense_rank() over(partition by order_month order by customer_id)  mth_rnk
              , dense_rank() over(order by customer_id)  cust_rnk

       from orders_traffic_channels

       where to_char(order_month, 'YYYY') = '2017'
     )

order by order_month, traffic_channel
;

注释

max()和dense_rank()的分区必须匹配
dense_rank()将对空值进行排名（全部在同一级别，最大值）。如果您不想计算null值，则需要case when customer_id is not null then dense_rank() ...etc...，或者如果您知道有空值，则可以从max()中减去一个。

Answer 2

尽管Redshift在其窗口函数中不支持DISTINCT聚合，但它确实具有listaggdistinct函数。因此，您可以这样做：

regexp_count(
   listaggdistinct(customer_id, ',') over (partition by field2), 
   ','
) + 1

当然，如果您的customer_id字符串中自然存在,，则必须找到一个安全的定界符。

redshift：通过窗口分区计算不同的客户

我的使用案例：在不同的时间间隔和流量渠道中计算客户数

2 个答案:

注释