Question

有相当数量的材料可以使用dense_rank()等详细说明每月计算不同的事情的方法，但是，我一直无法找到任何允许每月计算不同的东西还删除/折扣上个月组中看到的任何ID。

数据可以这样设想：

id (int8 type) | observed time (timestamp utc)
------------------
1  | 2017-01-01
2  | 2017-01-02
1  | 2017-01-02
1  | 2017-02-02
2  | 2017-02-03
3  | 2017-02-04
1  | 2017-03-01
3  | 2017-03-01
4  | 2017-03-01
5  | 2017-03-02

计数过程可以看作：

1：在2017-01我们看到设备1和2所以计数是2

2：在2017-02中我们看到了设备1,2和3.我们已经知道设备1和2，但不是3，所以计数是1

3：在2017-03我们看到了设备1,3,4和5.我们已经知道了1和3，但不是4或5，所以计数是2。

所需的输出类似于：

observed time | count of new id
--------------------------
2017-01       | 2
2017-02       | 1
2017-03       | 2

明确地说，我希望有一个新表，每行汇总一个月，计算一个月内有多少新ID出现以前根本没见过。

IRL案例允许设备在一个月内多次出现，但这不应影响计数。它还使用整数来存储id（正数和负数），时间段将是真正时间戳中的第二个。数据集的大小也很重要。

我最初的尝试是：

WITH records_months AS (
SELECT *,
date_trunc('month', observed_time) AS month_group
FROM my_table
WHERE observed_time > '2017-01-01')
id_months AS (
SELECT DISTINCT 
month_group,
id
FROM records_months
GROUP BY month_group, id)
SELECT *
FROM id-months

但是，我坚持下一部分，即计算前几个月没有看到的新ID的数量。我相信解决方案可能是一个窗口函数，但我无法确定哪个或如何。

Answer 1

我想到的第一件事。想法是

（最里面的查询）计算每个id被看到的最早月份，
（下一级）将其连接回主my_table数据集，然后
（外部查询）在将已经看到的id s归零后按月计算不同的id。

我测试了它并获得了所需的结果集。将最早的月份加入到原始表格中似乎是最自然的事情（对比窗口函数）。希望这对你的Redshift足够高效！

select observed_month,
    -- Null out the id if the observed_month that we're grouping by
    -- is NOT the earliest month that the id was seen.
    -- Then count distinct id
    count(distinct(case when observed_month != earliest_month then null else id end)) as num_new_ids
from (
    select t.id,
        date_trunc('month', t.observed_time) as observed_month,
        earliest.earliest_month
    from my_table t
        join (
            -- What's the earliest month an id was seen?
            select id,
                date_trunc('month', min(observed_time)) as earliest_month
            from my_table
            group by 1
        ) earliest
        on t.id = earliest.id
)
group by 1
order by 1;

如何才能在红移中每月计算“全新的，前所未见的”ID？

1 个答案: