Redshift - 计算每月活跃用户

时间:2017-02-15 22:42:28

标签: sql aggregate aggregate-functions amazon-redshift

我有一张看起来像这样的表:

Date       | User_ID
2017-1-1   |  1
2017-1-1   |  2
2017-1-1   |  4
2017-1-2   |  3
2017-1-2   |  2
...        |  ..
...        |  ..
...        |  ..
...        |  ..
2017-2-1   |  1
2017-2-2   |  2
...        |  ..
...        |  ..
...        |  ..

我想在30天的滚动期内计算每月活跃用户数。我知道Redshift没有做COUNT(DISTINCT)窗口。我该怎么做以获得以下输出?

Date      | MAU
2017-1-1  | 3
2017-1-2  | 4    <- We don't want to count user_id 2 twice.
...       | ..
...       | ..
...       | ..
2017-2-1  | ..
2017-2-2  | ..
...       | ..
...       | ..

我试图这样做(显然失败了)。这是我的代码:

SELECT event_date
    ,sum(user_count) mau_count
    ,CASE
        WHEN event_date = date_trunc('week', event_date)
            THEN 1
        ELSE 0
        END week_starting FROM (
    SELECT event_date
        ,count(*) OVER (PARTITION BY event_date ORDER BY event_date ROWS BETWEEN 30 PRECEDING
                    AND CURRENT ROW
            ) AS user_count    <-- I know this is wrong. Just my attempt :)
    FROM (
        SELECT DISTINCT (user_id)
            ,event_date
        FROM event_table
        ) daily_distinct_users
    GROUP BY event_date
    ) cumulative_daily_distinct_users GROUP BY event_date;

请告诉我如何准确地获得MAU计数。谢谢!

3 个答案:

答案 0 :(得分:1)

假设没有丢失日期,您可以先使用MIN函数获取用户出现的第一个日期。然后获取每个日期的用户计数,然后使用SUM函数获得滚动总和。

SELECT DISTINCT EVENT_DATE,
SUM(CNT) OVER(ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS MAU
FROM
 (SELECT E.EVENT_DATE,
         COUNT(DISTINCT T.USER_ID) AS CNT
  FROM EVENT_TABLE E
  LEFT JOIN
   (SELECT DISTINCT USER_ID,
     MIN(EVENT_DATE) OVER(PARTITION BY USER_ID
                          ORDER BY EVENT_DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) AS FIRST_APPEARED_ON
    FROM EVENT_TABLE 
   ) T ON T.FIRST_APPEARED_ON=E.EVENT_DATE AND T.USER_ID=E.USER_ID
  GROUP BY E.EVENT_DATE
) T1

Sample Demo using SQL Server

答案 1 :(得分:1)

这个似乎有用(log表中的列名是dtuserid):

SELECT
  end_date,
  -- The number of distinct users during the 30 days prior
  COUNT(DISTINCT userid) distinct_users
FROM log
JOIN
( -- A list of dates to appear in the output first column
  SELECT DISTINCT dt AS end_date
  FROM log
  WHERE dt BETWEEN date '2017-01-01' AND date '2017-01-31'
) ON dt BETWEEN end_date - interval '30 days' AND end_date
GROUP BY end_date
ORDER BY end_date

基本上,子选择会生成一个显示为第一个输出列的end_dates列表。然后,它会加入到所选日期之前30天内显示的不同数量的userid

答案 2 :(得分:0)

@John Rotenstein的回答很有效。

对于那些偶然发现这个问题并且正在寻找更多内容的人,以下blog post描述了一种用于快速计算滚动MAU的替代预计算策略。对于这个问题来说这有点过头了,但如果你这样做可能会派上用场:

  • 对交互式查询的增长指标计算速度缓慢感到恼怒,
  • 需要计算其他滚动增长指标(例如,注册,激活,保留,重新激活)或
  • 定期执行涉及某种滚动用户计数的分析。