从状态更改历史记录中获取每一天的用户状态

时间:2016-09-15 13:44:17

标签: sql postgresql greatest-n-per-group

我使用postgres并且有非平凡的查询。我有2个解决方案,问题是它们不快。

有一个表user_status_changes,它是用户状态更改的历史记录

 user_id |         created_at  | from_status | to_status
---------+---------------------+-------------+-----------
       3 | 2016-03-24 04:00:00 | active      | pending
       3 | 2016-03-27 19:59:21 | pending     | banned
       6 | 2016-03-16 10:00:00 | pending     | active
       6 | 2016-03-21 15:00:00 | active      | banned
       6 | 2016-03-25 19:52:46 | banned      | pending
       6 | 2016-03-25 20:53:22 | pending     | canceled

users

id |         created_at
----+----------------------------
  3 | 2016-03-21 19:54:09.831252
  6 | 2016-03-14 13:04:09.134358

我想获得的是从user.created_at到今天的每天的列表,其中列出了前一天的日期和用户状态的用户状态。

示例结果(假设今天是2016-03-27):

 user_id   | date        | status_at | previous_status
-----------+-------------+-----------+-----------------
         3 | 2016-03-21  |           |
         3 | 2016-03-22  |           |
         3 | 2016-03-23  |           |
         3 | 2016-03-24  | pending   |
         3 | 2016-03-25  | pending   | pending
         3 | 2016-03-26  | pending   | pending
         3 | 2016-03-27  | banned    | pending
         6 | 2016-03-14  |           | 
         6 | 2016-03-15  |           | 
         6 | 2016-03-16  | active    | 
         6 | 2016-03-17  | active    | active
         6 | 2016-03-18  | active    | active
         6 | 2016-03-19  | active    | active
         6 | 2016-03-20  | active    | active
         6 | 2016-03-21  | banned    | active
         6 | 2016-03-22  | banned    | banned
         6 | 2016-03-23  | banned    | banned
         6 | 2016-03-24  | banned    | banned
         6 | 2016-03-25  | canceled  | banned
         6 | 2016-03-26  | canceled  | canceled
         6 | 2016-03-27  | canceled  | canceled

我有两个解决方案。一个有子查询(非常慢)

WITH possible_dates AS (
  SELECT date(generate_series) AS "date"
    FROM generate_series(
      (SELECT min(created_at) FROM users)::date,
      '2016-03-27'::date,
      '1 day'
    )
)
SELECT 
  user_id,
  possible_dates.date,
  (
    SELECT to_status 
    FROM user_status_changes 
    WHERE user_status_changes.user_id = users.user_id
      AND date(user_status_changes.created_at) <= possible_dates.date
    ORDER BY user_status_changes.created_at DESC
    LIMIT 1
  ) AS status_at,
  LAG(
      SELECT to_status 
      FROM user_status_changes 
      WHERE user_status_changes.user_id = users.user_id
        AND date(user_status_changes.created_at) <= possible_dates.date
      ORDER BY user_status_changes.created_at DESC
      LIMIT 1
    ) OVER (PARTITION BY users.user_id ORDER BY possible_dates.date ASC) AS previous_status
FROM users
CROSS JOIN possible_dates
WHERE date(users.created_at) <= possible_dates.date

另一个via连接(似乎更快):

WITH status_changes AS (
  SELECT
    DISTINCT ON(user_id, date)
    user_id,
    created_at::date AS date,
    to_status,
    from_status
  FROM user_status_changes
  ORDER BY user_id, date, created_at DESC
),
possible_dates AS (
  SELECT date(generate_series) AS "date"
        FROM generate_series(
          (SELECT min(created_at) FROM users)::date,
          '2016-03-27'::date,
          '1 day'
        )
)
SELECT
  DISTINCT ON (users.user_id, possible_dates.date)
  users.user_id AS user_id,
  possible_dates.date AS date,
  s1.to_status AS status_at,
  s2.to_status AS previous_status
FROM users
CROSS JOIN possible_dates
LEFT OUTER JOIN status_changes s1
   ON s1.date <= possible_dates.date
  AND s1.user_id = users.user_id
LEFT JOIN LATERAL (
  SELECT
    status_changes.to_status,
    status_changes.date
  FROM status_changes
  WHERE
    status_changes.date < possible_dates.date AND
    status_changes.user_id = users.user_id
) s2 ON true
WHERE date(users.created_at) <= possible_dates.date
ORDER BY users.user_id, possible_dates.date DESC, s1.date DESC, s2.date DESC;

目前,我们每个用户每月有大约2万个用户,大约10笔付款和2个状态更改。 First user是在1年前创建的。

我认为加入方法的问题是我们加入所有以前的状态更改,然后通过DISTINCT ON删除冗余。

我们非常感谢任何更好的解决方案,也欢迎索引建议。

2 个答案:

答案 0 :(得分:1)

从不从不在潜在的索引列上使用“date(field)&gt; =”和其他函数。这会杀死使用普通(非功能)索引的任何可能性。

select user_id, s_date, status_at,
       lag(status_at) over(partition by user_id order by part,s_date) previous_status
  from
  (
   select user_id, s_date, part,
          first_value(to_status)
          over(partition by user_id,part order by s_date) status_at
     from
     (
       select U.id as user_id, s_date,
              first_value(to_status) over(partition by U.id,s_date order by S.created_at desc) to_status,
              count(to_status) over (partition by U.id order by s_date) as part,
              row_number() over (partition by U.id,s_date order by S.created_at desc) rn
         from users U
         left join
              generate_series(date(U.created_at),'2016-03-27'::date,'1 day') s_date ON true
         left join user_status_changes S
           on S.user_id=U.id
             and S.created_at between s_date and s_date+'23:59:59.999'::interval
     ) D where rn=1
   ) C

可能需要create index user_status_dt on user_status_changes(user_id, created_at)

答案 1 :(得分:1)

我的查询没有使用LATERAL,这需要计算像你这样的每一行,或者@ Mike&#39; s这样做会更快。

说明

首先生成您已经在做的数据集。 CTE: generate_dates

然后将输出限制为每个用户创建的日期,并获取在这些日期设置的状态。 CTE: basic_status

在内部选择中,使用LEFT JOINCOALESCE()填充当前正在发生的状态的每个状态之间的空值,并将输出限制为仅在日期之后设置的所有状态最近使用DISTINCT ON

外部选择仅用于使用LAG()窗口函数计算先前状态。

查询

WITH generate_dates AS (
SELECT date(generate_series) AS date
    FROM generate_series(
      (SELECT min(created_at) FROM users)::date,
      '2016-03-27'::date,
      '1 day'
    )
)
, basic_status AS (
SELECT 
  u.id AS user_id, 
  g.date,
  s.to_status AS status_at,
  row_number() OVER (PARTITION BY u.id ORDER BY g.date) AS rownum
FROM users u
JOIN generate_dates g ON
  g.date > u.created_at - interval '1 day'
LEFT JOIN user_status_changes s ON
  u.id = s.user_id
  AND s.created_at BETWEEN g.date AND g.date + interval '1 day'
)
SELECT 
  *,
  LAG(status_at) OVER (PARTITION BY user_id ORDER BY date) AS previous_status
FROM (
  SELECT 
    DISTINCT ON ( b1.user_id, b1.date )
    b1.user_id,
    b1.date,
    COALESCE(b1.status_at, b2.status_at) AS status_at
  FROM basic_status b1
  LEFT JOIN basic_status b2 ON
    b1.user_id = b2.user_id
    AND b1.status_at IS NULL
    AND b2.status_at IS NOT NULL
    AND b1.rownum > b2.rownum
  ORDER BY b1.user_id, b1.date DESC, b2.rownum DESC
  ) foo;

索引

您可以创建以下索引来加快速度:

  • users(id)
  • user_status_changes(user_id, created_at
  • users(created_at) - 这个可能不那么重要

备注

请务必使用ANALYZE table更新统计信息,以便更准确地估算费用。