Question

我有一张这样的表：

ID    BEGIN    END

如果同一个ID有重叠的剧集（例如2000-01-01 - 2001-12-31和2000-06-01 - 2002-06-31），我希望合并行，使用{{ 1}}，MIN(BEGIN)。

如果剧集是直接连续的（例如MAX(END) - 2000-01-01和2000-06-31 - 2000-07-01），也应该这样做。

如果剧集之间有“缺失”天数（例如2000-12-31 - 2000-01-01和2000-06-15 - 2000-07-01），则不应为合并。

如何实现这一目标？

目前我的代码如下：

2000-12-31

但当然，这不符合最后一个条件（如果有“缺失”天数则不合并。）

提前谢谢！

[编辑]

我正在研究一个解决方案，我自己加入了这个表。这是一个改进，但它还没有完成这项工作。我认为其他建议更好（但更复杂）。但是，我想分享我未完成的工作：

SELECT "ID", MIN("BEGIN"), MAX("END")
FROM ...
GROUP BY "ID"

[编辑2]

感谢您的帮助！

我试图弄清楚窗口函数和WITH-queries现在如何工作几个小时 - 直到我意识到我的数据库在PostGreSQL 8.3上运行（它不支持它们）。有没有办法没有窗口函数和WITH-queries？

再次感谢你！

[编辑3]

示例数据：

SELECT "ID", LEAST(tab1."BEGIN", tab2."BEGIN"), GREATEST(tab1."END", tab2."END")
  FROM <mytable> AS tab1
  JOIN <mytable> AS tab2
    ON tab1."ID" = tab2."ID"
    AND  (tab1."BEGIN", tab1."END" + INTERVAL '2 day') OVERLAPS (tab2."BEGIN", tab2."END")
  ORDER BY "ID"

示例输出：

ID        BEGIN         END
1;"2000-01-01";"2000-03-31"  
1;"2000-04-01";"2000-05-31"  
1;"2000-04-15";"2000-07-31"  
1;"2000-09-01";"2000-10-31"  
2;"2000-02-01";"2000-03-15"  
2;"2000-01-15";"2000-03-31"  
2;"2000-04-01";"2000-04-15"  
3;"2000-06-01";"2000-06-15"  
3;"2000-07-01";"2000-07-15"

[编辑4]

一种可能的解决方案：

ID        BEGIN         END
1;"2000-01-01";"2000-07-31"
1;"2000-09-01";"2000-10-31"
2;"2000-01-15";"2000-04-15"
3;"2000-06-01";"2000-06-15"
3;"2000-07-01";"2000-07-15"

非常感谢本文的作者： http://blog.developpez.com/sqlpro/p9821/langage-sql-norme/agregation-d-intervalles-en-sql-1/

Answer 1

编辑：这是一个好消息，你的DBA同意升级到更新版本的PostgreSQL。单独的窗口功能使升级成为值得的投资。

我的原始答案 - 正如您所说 - 有一个重大缺陷：每id行限制一行。
以下是没有这种限制的更好的解决方案我在我的系统上使用测试表进行了测试（8.4）。

如果/当你得到片刻我想知道它对你的数据的表现如何我还在这里写了一个解释：http://adam-bernier.appspot.com/post/91001/recursive-sql-example

WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
    SELECT id, "begin", "end", n
      FROM (
        SELECT
            id, "begin", "end",
            CASE 
                WHEN LEAD("begin") OVER (
                PARTITION BY    id
                ORDER BY        "begin") <= ("end" + interval '2' day) 
                THEN 1 ELSE 0 END AS cl,
            ROW_NUMBER() OVER (
                PARTITION BY    id
                ORDER BY        "begin") AS n
        FROM mytable 
    ) s
    WHERE s.cl = 1
  UNION ALL
    SELECT p1.id, p1."begin", p1."end", a.n
      FROM t1_rec a 
           JOIN mytable p1 ON p1.id = a.id
       AND p1."begin" > a."begin"
       AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
           (p1."begin", p1."end")
)
SELECT t1.id, min(t1."begin"), max(t1."end")
  FROM t1_rec t1
       LEFT JOIN t1_rec t2 ON t1.id = t2.id 
       AND t2."end" = t1."end"
       AND t2.n < t1.n
 WHERE t2.n IS NULL
 GROUP BY t1.id, t1.n
 ORDER BY t1.id, t1.n;

原始（已弃用）答案如下;
注意：每id行限制一行。

Denis可能正确使用lead()和lag()，但还有另一种方法！
您也可以使用所谓的recursive SQL解决此问题 overlaps function也派上用场了。

我已经在我的系统（8.4）上完全测试了这个解决方案它运作良好。

WITH RECURSIVE rec_stmt ( id, begin, end ) AS (
    /* seed statement: 
           start with only first start and end dates for each id 
    */
      SELECT id, MIN(begin), MIN(end)
        FROM mytable seed_stmt
    GROUP BY id

    UNION ALL

    /* iterative (not really recursive) statement: 
           append qualifying rows to resultset 
    */
      SELECT t1.id, t1.begin, t1.end
        FROM rec_stmt r
             JOIN mytable t1 ON t1.id = r.id
         AND t1.begin > r.end
         AND (r.begin, r.end + INTERVAL '1' DAY) OVERLAPS 
             (t1.begin - INTERVAL '1' DAY, t1.end)
)
  SELECT MIN(begin), MAX(end) 
    FROM rec_stmt
GROUP BY id;

Answer 2

我没有完全理解你的问题，但我绝对肯定你需要调查lead()/lag() window functions。

例如，像这样的东西将是放置在子查询或common table expression中的良好起点，以便识别每个ID是否重叠行：

select id,
       lag(start) over w as prev_start,
       lag(end) over w as prev_end,
       start,
       end,
       lead(start) over w as next_start,
       lead(end) over w as next_end
from yourtable
window w as (
       partition by id
       order by start, end
       )

Answer 3

关于你的第二个问题，我不确定PostgreSQL，但在SQL Server中有一个DATEDIFF（interval，start_date，end_date），它给你两个日期之间指定的间隔。您可以使用MIN（开始）作为开始日期，使用MAX（结束）作为结束日期来获得间隔差异。然后，您可以在case语句中使用它来输出内容，尽管您可能需要为您的方案创建子查询或等效项。

Answer 4

纯SQL

对于纯SQL解决方案，请查看Adam的帖子和阅读本文this article（它是用法语写的，但是你会发现它并不难读）。在咨询postgresql-mailing-list后，我向你推荐了这篇文章（谢谢你！）。

对于我的数据，这不合适，因为所有可能的解决方案都需要自己加入一个表至少3次。这对于（非常）大量数据来说是一个问题。

半SQL，半命令语言

如果您主要关心速度并且您可以使用命令式语言，那么您可以更快地获得（当然，取决于数据量）。在我的情况下，使用R。

执行任务（至少）快了1.000倍

步骤：

（1）获取.csv文件。 负责排序!!!

COPY (
  SELECT "ID", "BEGIN", "END"
  <sorry, for a reason I don't know StackOverflow won't let me finish my code here...>

（2）做这样的事情（这段代码是R，但你可以用任何命令式语言做类似的事情）：

data - read.csv2("</path/to.csv>")
data$BEGIN - as.Date(data$BEGIN)
data$END - as.Date(data$END)

smoothingEpisodes - function (theData) {

    theLength - nrow(theData)
    if (theLength  2L) return(theData)

    ID - as.integer(theData[["ID"]])
    BEGIN - as.numeric(theData[["BEGIN"]])
    END - as.numeric(theData[["END"]])

    curId - ID[[1L]]
    curBEGIN - BEGIN[[1L]]
    curEND - END[[1L]]



    out.1 - integer(length = theLength)
    out.2 - out.3 - numeric(length = theLength)

    j - 1L

    for(i in 2:nrow(theData)) {
        nextId - ID[[i]]
        nextBEGIN - BEGIN[[i]]
        nextEND - END[[i]]

        if (curId != nextId | (curEND + 1)  nextBEGIN) {
            out.1[[j]] - curId
            out.2[[j]] - curBEGIN
            out.3[[j]] - curEND

            j - j + 1L

            curId - nextId
            curBEGIN - nextBEGIN
            curEND - nextEND
        } else {
            curEND - max(curEND, nextEND, na.rm = TRUE)
        }
    }

    out.1[[j]] - curId
    out.2[[j]] - curBEGIN
    out.3[[j]] - curEND

    theOutput - data.frame(ID = out.1[1:j], BEGIN = as.Date(out.2[1:j], origin = "1970-01-01"), END = as.Date(out.3[1:j], origin = "1970-01-01"))

    theOutput
}

data1 - smoothingEpisodes(data)

data2 - transform(data1, TAGE = (as.numeric(data1$END - data1$BEGIN) + 1))

write.csv2(data2, file = "</path/to/output.csv>")

您可以在此处找到有关此R代码的详细讨论： "smoothing" time data - can it be done more efficient?

如果剧集直接连续或重叠，则合并DATE-rows

4 个答案: