我的表格包含可重复问卷的答案,这些问卷可以在30天内填写,并且每60天安排一次。 因此,来自单个调查表实例的答案分布在一个总是小于30天的日期范围内,而对下一个可重复调查表的第一个答案至少比上一个答案的最后一个答案晚31天。 我该如何创建一个视图来计算日期在距开始日期(最短日期)后30天内的得分(基本上是单个问卷的答案的总和)?
Table raw_data
------------------------------------------------
user_name | question_id | answer | answer_date |
------------------------------------------------
user001 | 1 | 2 | 2019-02-04 |
user001 | 2 | 1 | 2019-02-04 |
user001 | 3 | 2 | 2019-02-05 |
user001 | 4 | 2 | 2019-02-05 |
user001 | 5 | 2 | 2019-02-09 |
user002 | 1 | 2 | 2019-01-09 |
user002 | 2 | 2 | 2019-01-10 |
user002 | 3 | 1 | 2019-02-01 |
user002 | 4 | 2 | 2019-02-01 |
user002 | 5 | 1 | 2019-02-01 |
user002 | 1 | 2 | 2019-03-11 |
user002 | 2 | 2 | 2019-03-11 |
user002 | 3 | 1 | 2019-03-12 |
user002 | 4 | 1 | 2019-03-13 |
user002 | 5 | 1 | 2019-03-14 |
Expected result
------------------------------
user_name | sum | start_date |
------------------------------
user001 | 9 | 2019-02-04 |
user002 | 8 | 2019-01-09 |
user002 | 7 | 2019-03-11 |
我尝试过的解决方案仅适用于第一组:
SELECT user_name, SUM(answer::int),
CASE
WHEN answer_date - MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) < 30
THEN MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC)
ELSE answer_date END AS start_date,
FROM public.raw_data
GROUP BY user_name, answer_date
答案 0 :(得分:0)
使用lag()
查找差距。然后是一个累加的总和,以分配一个“询问期”,然后进行总结:
select userid, min(answer_date) as start_date, sum(answer)
from (select rd.*,
count(*) filter (where prev_ad is null or prev_ad < answer_date - interval '30 day') over (partition by user_id) as period
from (select rd.*,
lag(answer_date) over (partition by user_id order by answer_date) as prev_ad
from raw_data rd
) rd
)
group by userid, period;
答案 1 :(得分:0)
感谢@Gordon和这个 answer 最终,我发现缺少按日期范围确定组的步骤。
我将使用以下查询创建一个视图,并按grp2对SUM答案进行分组
WITH query AS (
SELECT r.*,
SUM(CASE WHEN answer_date < prev_date + 30 THEN 0 ELSE 1 END) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS grp
FROM (SELECT r.*,
LAG(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS prev_date
FROM raw_data r
) r
)
SELECT user_name, question_id, answer_date, answer, DENSE_RANK() OVER (ORDER BY user_name, grp) AS grp2
FROM query
答案 2 :(得分:0)
这是一个经典的gaps-and-islands问题。在我添加的标签下,您会发现很多东西。
针对您的案例的优化查询如下:
SELECT user_name
, sum(answer)
, min(answer_date) AS start_date
FROM (
SELECT user_name, answer, answer_date
, count(*) FILTER (WHERE step) OVER (PARTITION BY user_name ORDER BY answer_date) AS grp
FROM (
SELECT user_name, answer, answer_date
, lag(answer_date) OVER (PARTITION BY user_name ORDER BY answer_date) < answer_date - 30 AS step
FROM raw_data
) sub1
) sub2
GROUP BY user_name, grp
ORDER BY user_name, start_date; -- ORDER BY optional
db <>提琴here
密切相关,有更多解释:
答案 3 :(得分:0)
您可以通过以下row_number()
窗口分析功能来使用查询
with raw_data( user_name, question_id, answer, answer_date ) as
(
select 'user001',1,2, '2019-02-04' union all
select 'user001',2,1, '2019-02-04' union all
select 'user001',3,2, '2019-02-05' union all
select 'user001',4,2, '2019-02-05' union all
select 'user001',5,2, '2019-02-09' union all
select 'user002',1,2, '2019-01-09' union all
select 'user002',2,2, '2019-01-10' union all
select 'user002',3,1, '2019-02-01' union all
select 'user002',4,2, '2019-02-01' union all
select 'user002',5,1, '2019-02-01' union all
select 'user002',1,2, '2019-03-11' union all
select 'user002',2,2, '2019-03-11' union all
select 'user002',3,1, '2019-03-12' union all
select 'user002',4,1, '2019-03-13' union all
select 'user002',5,1, '2019-03-14'
)
select user_name, sum(answer) as sum, min(answer_date) as start_date
from
(
select row_number() over (partition by question_id order by user_name, answer_date) as rn,
t.*
from raw_data t
) t
group by user_name, rn
order by rn;
user_name sum start_date
--------- --- ----------
user001 9 2019-02-04
user002 8 2019-01-09
user002 7 2019-03-11