如何为 COUNT DISTINCT 运行 SUM() OVER PARTITION BY

时间:2021-02-02 06:43:46

标签: sql presto trino partition-by

我正在尝试获取每天每个事件的不同用户数量,同时保持每小时的运行总和。 我使用 Athena/Presto 作为查询引擎。

我尝试了以下查询:

<ol className="item-list">
  {
    props.items.map((item, index) => (
      <ShoppingItem
        key={index} // <-- key goes here
        item={item}
      />
    ))
  }
</ol>

但是在看到结果后,我意识到取 COUNT DISTINCT 的 SUM 是不正确的,因为它不是相加的。

所以,我尝试了以下查询

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    count,
    SUM(count) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum_count
FROM (
    SELECT 
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        COUNT(DISTINCT moengageuserid) as count
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
    GROUP BY 1,2
    ORDER BY 1,2
);

但此查询失败并出现以下错误:

SELECT
    eventname,
    date(from_unixtime(time_bucket)) AS date,
    (time_bucket % 86400)/3600 as hour,
    SUM(COUNT(DISTINCT moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum
FROM (
    SELECT
        eventname,
        CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
        moengageuserid
    FROM clickstream.moengage
    WHERE date = '2020-08-20'
    AND eventname IN ('e1', 'e2', 'e3', 'e4')
);

2 个答案:

答案 0 :(得分:0)

要计算运行的不同计数,您可以将用户 ID 收集到集合(不同的数组)中并获取大小:

cardinality(set_agg(moengageuserid)) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY eventname, time_bucket) AS running_sum

这是解析函数,会为整个分区分配相同的值(事件名称,日期),您可以使用max()等聚合上层子查询中的记录

答案 1 :(得分:0)

计算用户第一次出现的次数:

SELECT eventname, date(from_unixtime(time_bucket)) AS date,
       (time_bucket % 86400)/3600 as hour,
       COUNT(DISTINCT moengageuserid) as hour_cont,
       SUM(CASE WHEN seqnunm = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY eventname, date(from_unixtime(time_bucket)) ORDER BY time_bucket) AS running_distinct_count
FROM (SELECT eventname,
             CAST(eventtimestamp AS bigint) - CAST(eventtimestamp AS bigint) % 3600 AS time_bucket,
             moengageuserid as hour_count,
             ROW_NUMBER() OVER (PARTITION BY eventname, moengageuserid ORDER BY eventtimestamp) as seqnum
      FROM clickstream.moengage
      WHERE date = '2020-08-20' AND
            eventname IN ('e1', 'e2', 'e3', 'e4')
    ) m
GROUP BY 1, 2, 3
ORDER BY 1, 2;
相关问题