Question

我有一个如下表：

app_id  supplier_reached    creation_date   platform
10001       1            9/11/2018         iOS
10001       2            9/18/2018         iOS
10002       1            5/16/2018       android
10003       1            5/6/2018        android
10004       1            10/1/2018       android
10004       1            2/3/2018        android
10004       2            2/2/2018           web
10005       4            1/5/2018           web
10005       2            5/1/2018        android
10006       3            10/1/2018         iOS
10005       4            1/1/2018          iOS

目标是查找每月提交的app_id的唯一数量。

如果我只做count(distinct app_id)，我将得到以下结果：

Group by month  count(app number)
     Jan              1
     Feb              1
     may              3
  september           1
   october            2

但是，基于其他字段的组合，应用程序也被认为是唯一的。例如，对于一月，the app_id相同，但是app_id，supplier_reached和platform的组合显示不同的值，因此app_id应该被计数两次。遵循相同的模式，期望的结果应该是：

Group by month  Desired answer
     Jan              2
     Feb              2
     may              3
   september          2
    october           2

最后，表中可能有许多其他列，它们可能会或可能不会影响应用程序的唯一性。

有没有一种方法可以在SQL中进行这种计数？

我正在使用Redshift。

Answer 1

如上所述，在Redshift 6.85 ms ± 223 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)中不适用于多个字段。

您可以先按要唯一的列分组，然后按如下方式对记录进行计数：

count(distinct ...)

Answer 2

我不认为Postgres或Redshift支持带有多个参数的COUNT(DISTINCT)。一种解决方法是使用串联：

count(distinct app_id || ':' || supplier_reached || ':' || platform)

Answer 3

您的目标是错误的。

你不想要

to find the unique number of app_id submitted per month

您想要

to find the unique number of app_id + supplier_reached + platform submitted per month。

因此，您需要使用a）列的组合，例如count(distinct col1||col2||col3)或b）

select t1.month, count(t1.*)
  (select distinct 
         app_id, 
         supplier_reached,  
         platform, 
         month 
   from sometable) t1
group by month

Answer 4

实际上，您可以在Postgres中方便地计算出不同的ROW values：

SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM   tbl
GROUP  BY 1;

ROW关键字在这里只是噪音：

count(DISTINCT ROW(app_id, supplier_reached, platform))

为此，我不建议串联列。这是比较昂贵的，容易出错（考虑不同的数据类型和与语言环境有关的text表示形式），并且如果可以在列值中包含使用的分隔符，则会引入极端情况错误。

A，not supported by Redshift：

...
Value expressions
    Subscripted expressions  
    Array constructors  
    Row constructors
...

根据其他列计算一列中的不同值

4 个答案: