将日期列表聚合到开始日期和结束日期

时间:2010-06-08 05:23:41

标签: postgresql

我有一个日期和ID列表,我希望在每个ID中将它们汇总到消费日期。

对于名为“data”的表格中包含“testid”和“pulldate”列的表:

| A79 | 2010-06-02 |
| A79 | 2010-06-03 |
| A79 | 2010-06-04 |
| B72 | 2010-04-22 |
| B72 | 2010-06-03 |
| B72 | 2010-06-04 |
| C94 | 2010-04-09 |
| C94 | 2010-04-10 |
| C94 | 2010-04-11 |
| C94 | 2010-04-12 |
| C94 | 2010-04-13 |
| C94 | 2010-04-14 |
| C94 | 2010-06-02 |
| C94 | 2010-06-03 |
| C94 | 2010-06-04 |

我想生成一个包含“testid”,“group”,“start_date”,“end_date”列的表格:

| A79 | 1 | 2010-06-02 | 2010-06-04 |
| B72 | 2 | 2010-04-22 | 2010-04-22 |
| B72 | 3 | 2010-06-03 | 2010-06-04 |
| C94 | 4 | 2010-04-09 | 2010-04-14 |
| C94 | 5 | 2010-06-02 | 2010-06-04 |

这是我提出的代码:

SELECT t2.testid,
  t2.group,
  MIN(t2.pulldate) AS start_date,
  MAX(t2.pulldate) AS end_date
FROM(SELECT t1.pulldate,
  t1.testid,
  SUM(t1.check) OVER (ORDER BY t1.testid,t1.pulldate) AS group
FROM(SELECT data.pulldate,
  data.testid,
  CASE
  WHEN data.testid=LAG(data.testid,1) 
    OVER (ORDER BY data.testid,data.pulldate)
  AND data.pulldate=date (LAG(data.pulldate,1) 
    OVER (PARTITION BY data.testid 
    ORDER BY data.pulldate)) + integer '1'
  THEN 0
  ELSE 1
  END AS check
FROM data 
ORDER BY data.testid, data.pulldate) AS t1) AS t2
GROUP BY t2.testid,t2.group
ORDER BY t2.group;

我使用LAG窗口函数将每一行与前一行进行比较,如果我需要递增以启动一个新组,则设置为1,然后我将该列的运行总和,然后聚合到“group”的组合“和”testid“。

有没有更好的方法来实现我的目标,或者此操作是否有名称?

我正在使用PostgreSQL 8.4

2 个答案:

答案 0 :(得分:1)

我不知道这项技术的任何已知名称。我自己尝试编写它并提出了一些基本上与你相同的东西 - 只是少了一个WindowAgg。

select testid, group_num as group,
       min(pulldate) as start_date,
       max(pulldate) as end_date
from (select testid,
             pulldate,
             sum(case when projected_pulldate is null or pulldate <> projected_pulldate
                      then 1 else 0 end) over (order by testid, pulldate) as group_num
      from (select testid, pulldate,
                   (lag(pulldate, 1) over (partition by testid order by pulldate)
                   ) + 1 as projected_pulldate
            from data) x
     ) grouped
group by testid, group_num
order by 1, 2

这不是很漂亮,我想知道这是否只是一个使用plpgsql或类似的东西可能更合适的情况。

create or replace function data_extents()
 returns table(testid char(3), "group" int, start_date date, end_date date)
 language plpgsql
 stable as $$
declare
  rec data%rowtype;
begin
  "group" := 1;
  for rec in select * from data order by testid, pulldate loop
    if testid is null then
      -- first row
      testid := rec.testid;
      start_date := rec.pulldate;
      end_date := rec.pulldate;
    elsif rec.testid <> testid or rec.pulldate <> (end_date + 1) then
      -- discontinuity
      return next;
      testid := rec.testid;
      start_date := rec.pulldate;
      end_date := rec.pulldate;
      "group" := "group" + 1;
    else
      end_date := end_date + 1;
    end if;
  end loop;
  if testid is not null then
    return next;
  end if;
end;
$$;

这也不是很好......虽然它原则上从单次扫描得到输出而没有做几个不同的聚合,至少感觉更好。在微小的数据集上只需要相同的时间;更大的数据集?我还没有尝试过。

由于我们的解决方案都不允许将诸如“testid = XXX”之类的谓词用于数据扫描(afaict),因此函数可能是进行有效过滤的唯一方法吗?

答案 1 :(得分:1)

这是另一种方法:

WITH TEMP_TAB AS (
SELECT testid, pulldate,
       (pulldate + (row_number || ' days')::interval)::date AS dummydate
 FROM ( SELECT *, row_number() OVER () FROM
    ( SELECT * FROM data ORDER BY testid,pulldate DESC
    ) AS tab1 
 ) AS tab2 
)
SELECT * FROM (
  SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
    FROM TEMP_TAB GROUP BY testid,dummydate 
  )  AS tab3 
ORDER BY testid, mindate

警告:如果重复(testid, pulldate)对,此策略会中断。在这种情况下,首先应该对这些字段进行DISTINCT。

解释:中间表有dummydate,通过添加等于“行号”的天数(在有序选择中)获得;它唯一的含义是具有相同dummydate的行在同一组连续日期中。例如:中间结果:

test=#  SELECT *, row_number() OVER  () FROM
test-#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1;
 testid |  pulldate  | row_number
--------+------------+------------
 A79    | 2010-06-04 |          1
 A79    | 2010-06-03 |          2
 A79    | 2010-06-02 |          3
 B72    | 2010-06-04 |          4
 B72    | 2010-06-03 |          5
 B72    | 2010-04-22 |          6
 C94    | 2010-06-04 |          7
 C94    | 2010-06-03 |          8
 C94    | 2010-06-02 |          9
 C94    | 2010-04-14 |         10
 C94    | 2010-04-13 |         11
 C94    | 2010-04-12 |         12
 C94    | 2010-04-11 |         13
 C94    | 2010-04-10 |         14
 C94    | 2010-04-09 |         15



test=# SELECT
test-#  testid,pulldate,(pulldate + (row_number || 'days')::interval)::date AS dummydate
test-#  FROM ( SELECT *, row_number() OVER  () FROM
test(#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
test-#  AS tab2;
 testid |  pulldate  | dummydate
--------+------------+------------
 A79    | 2010-06-04 | 2010-06-05
 A79    | 2010-06-03 | 2010-06-05
 A79    | 2010-06-02 | 2010-06-05
 B72    | 2010-06-04 | 2010-06-08
 B72    | 2010-06-03 | 2010-06-08
 B72    | 2010-04-22 | 2010-04-28
 C94    | 2010-06-04 | 2010-06-11
 C94    | 2010-06-03 | 2010-06-11
 C94    | 2010-06-02 | 2010-06-11
 C94    | 2010-04-14 | 2010-04-24
 C94    | 2010-04-13 | 2010-04-24
 C94    | 2010-04-12 | 2010-04-24
 C94    | 2010-04-11 | 2010-04-24
 C94    | 2010-04-10 | 2010-04-24
 C94    | 2010-04-09 | 2010-04-24

编辑:此处不需要WITH(但我喜欢它),这是相同的:

SELECT * FROM (
  SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
  FROM (
    SELECT
      testid,pulldate,
      (pulldate + (row_number || ' days')::interval)::date AS dummydate
    FROM ( SELECT *, row_number() OVER  () FROM
      ( 
       SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )  
       AS tab2 
    ) as temp_tab
  GROUP BY testid,dummydate 
)  AS tab3
ORDER BY testid, mindate