排除重叠时段聚合函数

时间:2017-05-23 12:18:33

标签: sql postgresql

我有一个表格,其中包含每个开始日期和结束日期:

DROP TABLE temp_period;

CREATE TABLE public.temp_period
(
  id integer NOT NULL,
  "startDate" date,
  "endDate" date
);

INSERT INTO temp_period(id,"startDate","endDate") VALUES(1,'2010-01-01','2010-03-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(2,'2013-05-17','2013-07-18');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(3,'2010-02-15','2010-05-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(7,'2014-01-01','2014-12-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(56,'2014-03-31','2014-06-30');

现在我想知道那里存储的所有时段的总持续时间。我只需要interval的时间。这很简单:

SELECT sum(age("endDate","startDate")) FROM temp_period;

然而,问题是:那些时期确实重叠。我想消除所有重叠的时段,以便获得表中至少一条记录所涵盖的总时间。

你知道,时间之间存在相当大的差距,因此将最小的开始日期和最近的结束日期传递给age函数不会有效。但是,我考虑过这样做并减去差距的总量,但没有优雅的方法可以做到这一点。

我使用PostgreSQL 9.6。

5 个答案:

答案 0 :(得分:1)

这个怎么样:

WITH
   /* get all time points where something changes */
   points AS (
       SELECT "startDate" AS p
       FROM temp_period
       UNION SELECT "endDate"
       FROM temp_period
   ),
   /*
    * Get all date ranges between these time points.
    * The first time range will start with NULL,
    * but that will be excluded in the next CTE anyway.
    */
   inter AS (
      SELECT daterange(
                lag(p) OVER (ORDER BY p),
                p
             ) i
      FROM points
   ),
   /*
    * Get all date ranges that are contained
    * in at least one of the intervals.
    */
   overlap AS (
      SELECT DISTINCT i
      FROM inter
         CROSS JOIN temp_period
      WHERE i <@ daterange("startDate", "endDate")
   )
/* sum the lengths of the date ranges */
SELECT sum(age(upper(i), lower(i)))
FROM overlap;

对于您的数据,它将返回:

┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1 row)

答案 1 :(得分:1)

您可以尝试使用递归cte来计算周期。对于每条记录,我们将检查它是否与之前的记录重叠。如果是,我们只计算不重叠的时期。

WITH RECURSIVE days_count AS 
  ( 
         SELECT startDate, 
                endDate, 
                AGE(endDate, startDate) AS total_days, 
                rowSeq 
         FROM   ordered_data 
         WHERE  rowSeq = 1 
         UNION ALL 
         SELECT     GREATEST(curr.startDate, prev.endDate)                                            AS startDate,
                    GREATEST(curr.endDate, prev.endDate)                                              AS endDate,
                    AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
                    curr.rowSeq 
         FROM       ordered_data curr 
         INNER JOIN days_count prev 
         ON         curr.rowSeq > 1 
         AND        curr.rowSeq = prev.rowSeq + 1), 
ordered_data AS 
  ( 
           SELECT   *, 
                    ROW_NUMBER() OVER (ORDER BY startDate) AS rowSeq 
           FROM     temp_period) 
SELECT SUM(total_days) AS total_days
FROM   days_count;

我创建了一个演示here

答案 2 :(得分:1)

实际上有一个案例没有被前面的例子所涵盖。 如果我们有这样一个时期怎么办?

INSERT INTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');

我们有以下间隔:

 Interval No. |                  | start_date |                |  end_date
--------------+------------------+------------+----------------+------------
            1 |  Interval start  | 2010-01-01 |  Interval end  | 2010-03-31
            2 |  Interval start  | 2010-01-03 |  Interval end  | 2010-02-10
            3 |  Interval start  | 2010-02-15 |  Interval end  | 2010-05-31
            4 |  Interval start  | 2013-05-17 |  Interval end  | 2013-07-18
            5 |  Interval start  | 2014-01-01 |  Interval end  | 2014-12-31
            6 |  Interval start  | 2014-03-31 |  Interval end  | 2014-06-30

即使第 3 段与第 1 段重叠,它也被视为一个新段,因此(错误的)结果:

 sum
-----
 620
(1 row)

解决方案是调整查询的核心

CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END

需要替换为

CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END

然后它按预期工作

 sum
-----
 576
(1 row)

总结:

SELECT sum(e - s)
  FROM (
    SELECT left_edge as s, max(end_date) as e
    FROM (   
      SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
      FROM ( 
        SELECT start_date, end_date, CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END AS new_start
        FROM temp_period
      ) s1
    ) s2
    GROUP BY left_edge
  ) s3;

答案 3 :(得分:0)

这个在复杂查询中需要两个外连接。一个连接以识别具有大于THIS的startdate的所有重叠并且扩展时间跨度以匹配两者中的较大者。需要第二个连接来匹配没有重叠的记录。取最小值的最小值和最大值的最大值,包括非匹配值。我使用的是MSSQL,因此语法可能略有不同。

DECLARE @temp_period TABLE
(
  id int NOT NULL,
  startDate datetime,
  endDate datetime
)

INSERT INTO @temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')


;WITH OverLaps AS
(
    SELECT 
        Main.id,
        OverlappedID=Overlaps.id,
        OverlapMinDate,
        OverlapMaxDate
    FROM
        @temp_period Main
        LEFT OUTER JOIN
        (
            SELECT 
                This.id,
                OverlapMinDate=CASE WHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
                OverlapMaxDate=CASE WHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
                PriorID=Prior.id
            FROM
                @temp_period This
                LEFT OUTER JOIN @temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
        ) Overlaps ON Main.Id=Overlaps.PriorId
)

SELECT
    T.Id,
    --If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
    MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
    MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM
    @temp_period T
    LEFT OUTER JOIN OverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
    LEFT OUTER JOIN OverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
    IsAOverlap.OverlappedID IS NULL -- Exclude older records that have overlaps
GROUP BY
    T.Id

答案 4 :(得分:0)

注意:Laurenz Albe 的回答存在巨大的可扩展性问题。

当我找到它时,我非常高兴。我根据我们的需要定制了它。我们部署到暂存区,很快,服务器花了几分钟才返回结果。

然后我在 postgresql.org 上找到了这个答案。效率更高。 https://wiki.postgresql.org/wiki/Range_aggregation

SELECT sum(e - s)
FROM (
  SELECT left_edge as s, max(end_date) as e
  FROM (   
    SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
    FROM ( 
      SELECT start_date, end_date, CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END AS new_start
      FROM temp_period
      ) s1
    ) s2
  GROUP BY left_edge
  ) s3;

结果:

 sum
-----
 576
(1 row)