我使用postgres并且有非平凡的查询。我有2个解决方案,问题是它们不快。
有一个表user_status_changes
,它是用户状态更改的历史记录
user_id | created_at | from_status | to_status
---------+---------------------+-------------+-----------
3 | 2016-03-24 04:00:00 | active | pending
3 | 2016-03-27 19:59:21 | pending | banned
6 | 2016-03-16 10:00:00 | pending | active
6 | 2016-03-21 15:00:00 | active | banned
6 | 2016-03-25 19:52:46 | banned | pending
6 | 2016-03-25 20:53:22 | pending | canceled
users
id | created_at
----+----------------------------
3 | 2016-03-21 19:54:09.831252
6 | 2016-03-14 13:04:09.134358
我想获得的是从user.created_at
到今天的每天的列表,其中列出了前一天的日期和用户状态的用户状态。
示例结果(假设今天是2016-03-27):
user_id | date | status_at | previous_status
-----------+-------------+-----------+-----------------
3 | 2016-03-21 | |
3 | 2016-03-22 | |
3 | 2016-03-23 | |
3 | 2016-03-24 | pending |
3 | 2016-03-25 | pending | pending
3 | 2016-03-26 | pending | pending
3 | 2016-03-27 | banned | pending
6 | 2016-03-14 | |
6 | 2016-03-15 | |
6 | 2016-03-16 | active |
6 | 2016-03-17 | active | active
6 | 2016-03-18 | active | active
6 | 2016-03-19 | active | active
6 | 2016-03-20 | active | active
6 | 2016-03-21 | banned | active
6 | 2016-03-22 | banned | banned
6 | 2016-03-23 | banned | banned
6 | 2016-03-24 | banned | banned
6 | 2016-03-25 | canceled | banned
6 | 2016-03-26 | canceled | canceled
6 | 2016-03-27 | canceled | canceled
我有两个解决方案。一个有子查询(非常慢)
WITH possible_dates AS (
SELECT date(generate_series) AS "date"
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
SELECT
user_id,
possible_dates.date,
(
SELECT to_status
FROM user_status_changes
WHERE user_status_changes.user_id = users.user_id
AND date(user_status_changes.created_at) <= possible_dates.date
ORDER BY user_status_changes.created_at DESC
LIMIT 1
) AS status_at,
LAG(
SELECT to_status
FROM user_status_changes
WHERE user_status_changes.user_id = users.user_id
AND date(user_status_changes.created_at) <= possible_dates.date
ORDER BY user_status_changes.created_at DESC
LIMIT 1
) OVER (PARTITION BY users.user_id ORDER BY possible_dates.date ASC) AS previous_status
FROM users
CROSS JOIN possible_dates
WHERE date(users.created_at) <= possible_dates.date
另一个via连接(似乎更快):
WITH status_changes AS (
SELECT
DISTINCT ON(user_id, date)
user_id,
created_at::date AS date,
to_status,
from_status
FROM user_status_changes
ORDER BY user_id, date, created_at DESC
),
possible_dates AS (
SELECT date(generate_series) AS "date"
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
SELECT
DISTINCT ON (users.user_id, possible_dates.date)
users.user_id AS user_id,
possible_dates.date AS date,
s1.to_status AS status_at,
s2.to_status AS previous_status
FROM users
CROSS JOIN possible_dates
LEFT OUTER JOIN status_changes s1
ON s1.date <= possible_dates.date
AND s1.user_id = users.user_id
LEFT JOIN LATERAL (
SELECT
status_changes.to_status,
status_changes.date
FROM status_changes
WHERE
status_changes.date < possible_dates.date AND
status_changes.user_id = users.user_id
) s2 ON true
WHERE date(users.created_at) <= possible_dates.date
ORDER BY users.user_id, possible_dates.date DESC, s1.date DESC, s2.date DESC;
目前,我们每个用户每月有大约2万个用户,大约10笔付款和2个状态更改。 First user是在1年前创建的。
我认为加入方法的问题是我们加入所有以前的状态更改,然后通过DISTINCT ON
删除冗余。
我们非常感谢任何更好的解决方案,也欢迎索引建议。
答案 0 :(得分:1)
从不从不在潜在的索引列上使用“date(field)&gt; =”和其他函数。这会杀死使用普通(非功能)索引的任何可能性。
select user_id, s_date, status_at,
lag(status_at) over(partition by user_id order by part,s_date) previous_status
from
(
select user_id, s_date, part,
first_value(to_status)
over(partition by user_id,part order by s_date) status_at
from
(
select U.id as user_id, s_date,
first_value(to_status) over(partition by U.id,s_date order by S.created_at desc) to_status,
count(to_status) over (partition by U.id order by s_date) as part,
row_number() over (partition by U.id,s_date order by S.created_at desc) rn
from users U
left join
generate_series(date(U.created_at),'2016-03-27'::date,'1 day') s_date ON true
left join user_status_changes S
on S.user_id=U.id
and S.created_at between s_date and s_date+'23:59:59.999'::interval
) D where rn=1
) C
可能需要create index user_status_dt on user_status_changes(user_id, created_at)
答案 1 :(得分:1)
我的查询没有使用LATERAL
,这需要计算像你这样的每一行,或者@ Mike&#39; s这样做会更快。
首先生成您已经在做的数据集。 CTE: generate_dates
。
然后将输出限制为每个用户创建的日期,并获取在这些日期设置的状态。 CTE: basic_status
。
在内部选择中,使用LEFT JOIN
和COALESCE()
填充当前正在发生的状态的每个状态之间的空值,并将输出限制为仅在日期之后设置的所有状态最近使用DISTINCT ON
。
外部选择仅用于使用LAG()
窗口函数计算先前状态。
WITH generate_dates AS (
SELECT date(generate_series) AS date
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
, basic_status AS (
SELECT
u.id AS user_id,
g.date,
s.to_status AS status_at,
row_number() OVER (PARTITION BY u.id ORDER BY g.date) AS rownum
FROM users u
JOIN generate_dates g ON
g.date > u.created_at - interval '1 day'
LEFT JOIN user_status_changes s ON
u.id = s.user_id
AND s.created_at BETWEEN g.date AND g.date + interval '1 day'
)
SELECT
*,
LAG(status_at) OVER (PARTITION BY user_id ORDER BY date) AS previous_status
FROM (
SELECT
DISTINCT ON ( b1.user_id, b1.date )
b1.user_id,
b1.date,
COALESCE(b1.status_at, b2.status_at) AS status_at
FROM basic_status b1
LEFT JOIN basic_status b2 ON
b1.user_id = b2.user_id
AND b1.status_at IS NULL
AND b2.status_at IS NOT NULL
AND b1.rownum > b2.rownum
ORDER BY b1.user_id, b1.date DESC, b2.rownum DESC
) foo;
您可以创建以下索引来加快速度:
users(id)
user_status_changes(user_id, created_at
)users(created_at)
- 这个可能不那么重要请务必使用ANALYZE table
更新统计信息,以便更准确地估算费用。