Question

我有一个每日会话表，其中包含user_id和date列。我想每天绘制DAU / MAU（每日活跃用户/每月活跃用户）。例如：

Date         MAU      DAU     DAU/MAU
2014-06-01   20,000   5,000   20%
2014-06-02   21,000   4,000   19%
2014-06-03   20,050   3,050   17%
...          ...      ...     ...

计算每日活动数据很容易计算，但计算每月有效数据，例如登录日期30天的用户数量正在导致问题。如果没有每天的左连接，这是如何实现的？

编辑：我正在使用Postgres。

Answer 1

假设您有每天的值，您可以使用子查询和range between获取总计数：

with dau as (
      select date, count(userid) as dau
      from dailysessions ds
      group by date
     )
select date, dau,
       sum(dau) over (order by date rows between -29 preceding and current row) as mau
from dau;

不幸的是，我认为您需要不同的用户而不仅仅是用户数量。这使问题变得更加困难，特别是因为Postgres不支持count(distinct)作为窗函数。

我认为你必须为此做一些自我加入。这是一种方法：

with dau as (
      select date, count(distinct userid) as dau
      from dailysessions ds
      group by date
     )
select date, dau,
       (select count(distinct user_id)
        from dailysessions ds
        where ds.date between date - 29 * interval '1 day' and date
       ) as mau
from dau;

Answer 2

这个使用COUNT DISTINCT来滚动30天DAU / MAU：

（计算reddit在BigQuery中的用户参与度 - 但SQL足够标准，可用于其他数据库）

SELECT day, dau, mau, INTEGER(100*dau/mau) daumau
FROM (
  SELECT day, EXACT_COUNT_DISTINCT(author) dau, FIRST(mau) mau
  FROM (
    SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) day, author
    FROM [fh-bigquery:reddit_comments.2015_09]
    WHERE subreddit='AskReddit') a
  JOIN (
    SELECT stopday, EXACT_COUNT_DISTINCT(author) mau
    FROM (SELECT created_utc, subreddit, author FROM [fh-bigquery:reddit_comments.2015_09], [fh-bigquery:reddit_comments.2015_08]) a
    CROSS JOIN (
      SELECT DATE(SEC_TO_TIMESTAMP(created_utc)) stopday
      FROM [fh-bigquery:reddit_comments.2015_09]
      GROUP BY 1
    ) b
    WHERE subreddit='AskReddit'
    AND SEC_TO_TIMESTAMP(created_utc) BETWEEN DATE_ADD(stopday, -30, 'day') AND TIMESTAMP(stopday)
    GROUP BY 1
  ) b
  ON a.day=b.stopday
  GROUP BY 1
)
ORDER BY 1

我在How to calculate DAU/MAU with BigQuery (engagement)

走得更远

Answer 3

您没有向我们展示您的完整表格定义，但可能是这样的：

select date,
       count(*) over (partition by date_trunc('day', date) order by date) as dau,
       count(*) over (partition by date_trunc('month', date) order by date) as mau
from sessions
order by date;

要获得百分比而不重复窗口函数，只需将其包装在派生表中：

select date, 
       dau,
       mau,
       dau::numeric / (case when mau = 0 then null else mau end) as pct
from (
    select date,
           count(*) over (partition by date_trunc('day', date) order by date) as dau,
           count(*) over (partition by date_trunc('month', date) order by date) as mau
    from sessions
) t
order by date;

以下是输出示例：

postgres=> select * from sessions;
 session_date | user_id
--------------+---------
 2014-05-01   |       1
 2014-05-01   |       2
 2014-05-01   |       3
 2014-05-02   |       1
 2014-05-02   |       2
 2014-05-02   |       3
 2014-05-02   |       4
 2014-05-02   |       5
 2014-06-01   |       1
 2014-06-01   |       2
 2014-06-01   |       3
 2014-06-02   |       1
 2014-06-02   |       2
 2014-06-02   |       3
 2014-06-02   |       4
 2014-06-03   |       1
 2014-06-03   |       2
 2014-06-03   |       3
 2014-06-03   |       4
 2014-06-03   |       5
(20 rows)

postgres=> select session_date,
postgres->        dau,
postgres->        mau,
postgres->        round(dau::numeric / (case when mau = 0 then null else mau end),2) as pct
postgres-> from (
postgres(>     select session_date,
postgres(>            count(*) over (partition by date_trunc('day', session_date) order by session_date) as dau,
postgres(>            count(*) over (partition by date_trunc('month', session_date) order by session_date) as mau
postgres(>     from sessions
postgres(> ) t
postgres-> order by session_date;
 session_date | dau | mau | pct
--------------+-----+-----+------
 2014-05-01   |   3 |   3 | 1.00
 2014-05-01   |   3 |   3 | 1.00
 2014-05-01   |   3 |   3 | 1.00
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-05-02   |   5 |   8 | 0.63
 2014-06-01   |   3 |   3 | 1.00
 2014-06-01   |   3 |   3 | 1.00
 2014-06-01   |   3 |   3 | 1.00
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-02   |   4 |   7 | 0.57
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
 2014-06-03   |   5 |  12 | 0.42
(20 rows)

postgres=>

Answer 4

我在my blog上写过这篇文章。

正如您所注意到的，DAU很简单。您可以通过首先使用布尔值创建一个用户激活和取消激活的视图来解决MAU，如下所示：

CREATE OR REPLACE VIEW "vw_login" AS 
 SELECT *
    , LEAST (LEAD("date") OVER w, "date" + 30) AS "activeExpiry"
    , CASE WHEN LAG("date") OVER w IS NULL THEN true ELSE false AS "activated"
    , CASE
 WHEN LEAD("date") OVER w IS NULL THEN true
 WHEN LEAD("date") OVER w - "date" > 30 THEN true
 ELSE false
 END AS "churned"
    , CASE
 WHEN LAG("date") OVER w IS NULL THEN false
 WHEN "date" - LAG("date") OVER w <= 30 THEN false
 WHEN row_number() OVER w > 1 THEN true
 ELSE false
 END AS "resurrected"
   FROM "login"
   WINDOW w AS (PARTITION BY "user_id" ORDER BY "date")

当每个用户变为活动状态，他们流失以及重新激活时，每个用户每天创建一个布尔值。

然后每日聚合一次：

CREATE OR REPLACE VIEW "vw_activity" AS
SELECT 
    SUM("activated"::int) "activated"
  , SUM("churned"::int) "churned"
  , SUM("resurrected"::int) "resurrected"
  , "date"
  FROM "vw_login"
  GROUP BY "date"
  ;

最后通过计算列的累积总和来计算活动MAU的运行总数。您需要加入vw_activity两次，因为第二个加入到用户变为非活动状态的那一天（即自上次登录后30天）。

我添加了一个日期系列，以确保您的数据集中存在所有日期。你也可以不用它，但你可能会在数据集中跳过几天。

SELECT
 d."date"
 , SUM(COALESCE(a.activated::int,0)
   - COALESCE(a2.churned::int,0)
   + COALESCE(a.resurrected::int,0)) OVER w
 , d."date", a."activated", a2."churned", a."resurrected" FROM
 generate_series('2010-01-01'::date, CURRENT_DATE, '1 day'::interval) d
 LEFT OUTER JOIN vw_activity a ON d."date" = a."date"
 LEFT OUTER JOIN vw_activity a2 ON d."date" = (a2."date" + INTERVAL '30 days')::date
 WINDOW w AS (ORDER BY d."date") ORDER BY d."date";

您当然可以在一个查询中执行此操作，但这有助于更好地理解结构。

随着时间的推移查询DAU / MAU（每日）

4 个答案: