我有一个日期和ID列表,我希望在每个ID中将它们汇总到消费日期。
对于名为“data”的表格中包含“testid”和“pulldate”列的表:
| A79 | 2010-06-02 |
| A79 | 2010-06-03 |
| A79 | 2010-06-04 |
| B72 | 2010-04-22 |
| B72 | 2010-06-03 |
| B72 | 2010-06-04 |
| C94 | 2010-04-09 |
| C94 | 2010-04-10 |
| C94 | 2010-04-11 |
| C94 | 2010-04-12 |
| C94 | 2010-04-13 |
| C94 | 2010-04-14 |
| C94 | 2010-06-02 |
| C94 | 2010-06-03 |
| C94 | 2010-06-04 |
我想生成一个包含“testid”,“group”,“start_date”,“end_date”列的表格:
| A79 | 1 | 2010-06-02 | 2010-06-04 |
| B72 | 2 | 2010-04-22 | 2010-04-22 |
| B72 | 3 | 2010-06-03 | 2010-06-04 |
| C94 | 4 | 2010-04-09 | 2010-04-14 |
| C94 | 5 | 2010-06-02 | 2010-06-04 |
这是我提出的代码:
SELECT t2.testid,
t2.group,
MIN(t2.pulldate) AS start_date,
MAX(t2.pulldate) AS end_date
FROM(SELECT t1.pulldate,
t1.testid,
SUM(t1.check) OVER (ORDER BY t1.testid,t1.pulldate) AS group
FROM(SELECT data.pulldate,
data.testid,
CASE
WHEN data.testid=LAG(data.testid,1)
OVER (ORDER BY data.testid,data.pulldate)
AND data.pulldate=date (LAG(data.pulldate,1)
OVER (PARTITION BY data.testid
ORDER BY data.pulldate)) + integer '1'
THEN 0
ELSE 1
END AS check
FROM data
ORDER BY data.testid, data.pulldate) AS t1) AS t2
GROUP BY t2.testid,t2.group
ORDER BY t2.group;
我使用LAG窗口函数将每一行与前一行进行比较,如果我需要递增以启动一个新组,则设置为1,然后我将该列的运行总和,然后聚合到“group”的组合“和”testid“。
有没有更好的方法来实现我的目标,或者此操作是否有名称?
我正在使用PostgreSQL 8.4
答案 0 :(得分:1)
我不知道这项技术的任何已知名称。我自己尝试编写它并提出了一些基本上与你相同的东西 - 只是少了一个WindowAgg。
select testid, group_num as group,
min(pulldate) as start_date,
max(pulldate) as end_date
from (select testid,
pulldate,
sum(case when projected_pulldate is null or pulldate <> projected_pulldate
then 1 else 0 end) over (order by testid, pulldate) as group_num
from (select testid, pulldate,
(lag(pulldate, 1) over (partition by testid order by pulldate)
) + 1 as projected_pulldate
from data) x
) grouped
group by testid, group_num
order by 1, 2
这不是很漂亮,我想知道这是否只是一个使用plpgsql或类似的东西可能更合适的情况。
create or replace function data_extents()
returns table(testid char(3), "group" int, start_date date, end_date date)
language plpgsql
stable as $$
declare
rec data%rowtype;
begin
"group" := 1;
for rec in select * from data order by testid, pulldate loop
if testid is null then
-- first row
testid := rec.testid;
start_date := rec.pulldate;
end_date := rec.pulldate;
elsif rec.testid <> testid or rec.pulldate <> (end_date + 1) then
-- discontinuity
return next;
testid := rec.testid;
start_date := rec.pulldate;
end_date := rec.pulldate;
"group" := "group" + 1;
else
end_date := end_date + 1;
end if;
end loop;
if testid is not null then
return next;
end if;
end;
$$;
这也不是很好......虽然它原则上从单次扫描得到输出而没有做几个不同的聚合,至少感觉更好。在微小的数据集上只需要相同的时间;更大的数据集?我还没有尝试过。
由于我们的解决方案都不允许将诸如“testid = XXX”之类的谓词用于数据扫描(afaict),因此函数可能是进行有效过滤的唯一方法吗?
答案 1 :(得分:1)
这是另一种方法:
WITH TEMP_TAB AS (
SELECT testid, pulldate,
(pulldate + (row_number || ' days')::interval)::date AS dummydate
FROM ( SELECT *, row_number() OVER () FROM
( SELECT * FROM data ORDER BY testid,pulldate DESC
) AS tab1
) AS tab2
)
SELECT * FROM (
SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate
FROM TEMP_TAB GROUP BY testid,dummydate
) AS tab3
ORDER BY testid, mindate
警告:如果重复(testid, pulldate)
对,此策略会中断。在这种情况下,首先应该对这些字段进行DISTINCT。
解释:中间表有dummydate
,通过添加等于“行号”的天数(在有序选择中)获得;它唯一的含义是具有相同dummydate
的行在同一组连续日期中。例如:中间结果:
test=# SELECT *, row_number() OVER () FROM
test-# ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1;
testid | pulldate | row_number
--------+------------+------------
A79 | 2010-06-04 | 1
A79 | 2010-06-03 | 2
A79 | 2010-06-02 | 3
B72 | 2010-06-04 | 4
B72 | 2010-06-03 | 5
B72 | 2010-04-22 | 6
C94 | 2010-06-04 | 7
C94 | 2010-06-03 | 8
C94 | 2010-06-02 | 9
C94 | 2010-04-14 | 10
C94 | 2010-04-13 | 11
C94 | 2010-04-12 | 12
C94 | 2010-04-11 | 13
C94 | 2010-04-10 | 14
C94 | 2010-04-09 | 15
test=# SELECT
test-# testid,pulldate,(pulldate + (row_number || 'days')::interval)::date AS dummydate
test-# FROM ( SELECT *, row_number() OVER () FROM
test(# ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
test-# AS tab2;
testid | pulldate | dummydate
--------+------------+------------
A79 | 2010-06-04 | 2010-06-05
A79 | 2010-06-03 | 2010-06-05
A79 | 2010-06-02 | 2010-06-05
B72 | 2010-06-04 | 2010-06-08
B72 | 2010-06-03 | 2010-06-08
B72 | 2010-04-22 | 2010-04-28
C94 | 2010-06-04 | 2010-06-11
C94 | 2010-06-03 | 2010-06-11
C94 | 2010-06-02 | 2010-06-11
C94 | 2010-04-14 | 2010-04-24
C94 | 2010-04-13 | 2010-04-24
C94 | 2010-04-12 | 2010-04-24
C94 | 2010-04-11 | 2010-04-24
C94 | 2010-04-10 | 2010-04-24
C94 | 2010-04-09 | 2010-04-24
编辑:此处不需要WITH(但我喜欢它),这是相同的:
SELECT * FROM (
SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate
FROM (
SELECT
testid,pulldate,
(pulldate + (row_number || ' days')::interval)::date AS dummydate
FROM ( SELECT *, row_number() OVER () FROM
(
SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
AS tab2
) as temp_tab
GROUP BY testid,dummydate
) AS tab3
ORDER BY testid, mindate