我正在敲打几个小时,试图找出如何使用Redshift 计算每月新闻通讯订阅者总数。
计算的基础是一个事件表,用于跟踪每个用户操作,特别是他订阅或取消订阅来自简报。简化它看起来像这样:
+----------------------+---------+---------------+
| timestamp | user_id | action |
+----------------------+---------+---------------+
| 2017-01-01T12:10:31Z | 1 | subscribed |
| 2017-01-01T13:11:51Z | 2 | subscribed |
| 2017-01-01T13:15:53Z | 3 | subscribed |
| ... | ... | ... |
| 2017-02-17T09:42:33Z | 4 | subscribed |
| ... | ... | ... |
| 2017-03-15T16:59:13Z | 1 | unsubscribed |
| 2017-03-17T02:19:56Z | 2 | unsubscribed |
| 2017-03-17T05:33:05Z | 2 | subscribed |
| ... | ... | ... |
对于每个月,我都想总结一下订阅到新闻稿的用户数量加上已经订阅并且没有取消订阅的用户数量。在上面的示例中,我们将在1月份拥有3个用户,在2月份添加另一个用户,共有4个用户。然后在3月份我们失去了一个用户,而另一个用户只是暂时取消订阅。我们3月份的订阅者总数为3。
我正在寻找的最终结果是这样的:
+------------+-------------+
| month | subscribers |
+------------+-------------+
| 2017-01-01 | 3 |
| 2017-02-01 | 4 |
| 2017-03-01 | 3 |
| ... | ... |
任何想法是否以及如何使用SQL查询解决这个问题(最好是在Redshift或Postgres中工作)?
答案 0 :(得分:1)
您可以使用递归CTE创建每个所需的月份。然后将订阅与取消订阅匹配(为简单起见,另一个CTE)。注意横向连接用于选择前1个匹配取消订阅。最后,获取每个月的不同user_id的计数。
这是Postgres。 Here is the SQL Fiddle where you can run this, adjust the data set, etc.
creationDate
答案 1 :(得分:1)
这似乎需要大量的连接,这可能需要很长时间才能收敛,具体取决于您的表大小。如果空间不是问题并且这些类型的查询经常发生,我会添加第三列,其中包含一个(二进制)标志,显示可以过滤的最新操作。我的尝试:SQL Fiddle
-- get starting month
WITH start_month AS(
SELECT MIN(CAST(DATE_TRUNC('month', ts) AS DATE)) AS earliest
FROM test
),
-- bucket each date into months
month_buckets AS(
SELECT CAST(DATE_TRUNC('month', ts) AS DATE) AS month_bucket
FROM test
GROUP BY 1
),
-- for each month bucket, find all actions taken by each user upto that month
master AS (SELECT mb.month_bucket, user_id, actions, ts
FROM month_buckets mb
LEFT JOIN test
ON CAST(DATE_TRUNC('month', test.ts) AS DATE) <= mb.month_bucket
)
-- for each user, get the latest action and timestamp
-- group by month_bucket, count
SELECT m1.month_bucket AS month,
COUNT(m1.user_id) AS subscribers
FROM master m1
JOIN (SELECT month_bucket, user_id, MAX(ts) AS ts
FROM master
GROUP BY 1, 2
) m2
ON m1.month_bucket = m2.month_bucket
AND m1.user_id = m2.user_id
AND m1.ts = m2.ts
AND m1.actions = 'subscribed'
GROUP BY 1
ORDER BY 1;
答案 2 :(得分:1)
解决方案是:
1)创建一个存储日期的日历表(表中的一行是唯一的日期),请参阅this问题中的更多信息。这对于大多数BI查询都非常方便。
2)编写包含以下步骤的查询:
2a)基于订阅/取消订阅事件,为每个用户构建订阅状态的时间范围(首先使用lead
函数识别每个给定事件的下一个事件并获取必要的对)。如果用户只有一个订阅活动,请使用date_to
将coalesce
设置为当前日期。
2b)将这些范围连接到日历表,以便一行是日期/用户
2c)使用一种或另一种方法计算行数(唯一ID,平均每日,第一个月份,月份的最后日期)
查询看起来像这样:
with
next_events as (
select
user_id
,"timestamp"::date as date_from
,action
,lead(timestamp) over (partition by user_id order by timestamp) ::date as date_to
,lead(action) over (partition by user_id order by timestamp) as next_action
from your_table
where action in ('subscribed','unsubscribed')
)
,ranges as (
select
user_id
,date_from
,coalesce(date_to,current_date) as date_to
from next_events
where (action='subscribed' and next_action='unsubscribed')
or (action='subscribed' and next_action is null)
)
,subscriber_days as (
select
t1.user_id
,t2.date
from ranges t1
join calendar t2
on t2.date between t1.date_from and t1.date_to
)
-- use whatever method needed to identify monthly N from daily N (first day, last day, average, etc.)
-- below is the unique count
select
date_trunc('month',date) as date
,count(distinct user_id) as subscribers
from subscriber_days
group by 1
order by 1
答案 3 :(得分:0)
订阅用户总数为:
select count(*)
from
(
select distinct id
from subscribers
group by id
having count(*) in (1, 3, 5...) -- here you can use a table function to return odd numbers
) a
在某段时间内订阅的数量:
select count(distinct a.id)
from
(
select distinct id
from subscribers
group by id
having count(*) in (1, 3, 5...) -- here you can use a table function to return odd numbers
) a join
subscribers s on a.id = s.id
where timestamp between @date1 and @date2
注意:我没有在Redshift或Postgres中尝试