查找PostgreSQL中所有范围集的所有交集

时间:2014-07-25 16:55:20

标签: sql postgresql date-range

我正在寻找一种有效的方法来查找时间戳范围集之间的所有交叉点。它需要与PostgreSQL 9.2一起使用。

让我们说范围代表一个人可以见面的时间。每个人在可用时可能有一个或多个时间范围。我想找到所有会议开始的时间段(即所有人都可用的时间段)。

这是我到目前为止所得到的。它似乎有效,但我认为它非常有效,因为它一次只考虑一个人的可用性。

WITH RECURSIVE td AS
(
    -- Test data. Returns:
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00")
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00")
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00")
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 3, '2014-01-20', '2014-04-20'
)
, ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
, min_max AS
(
    SELECT MIN(entity_id), MAX(entity_id)
    FROM td
)
, inter AS
(
    -- Ranges for the lowest ID
    SELECT entity_id AS last_id, the_range
    FROM ranges r
    WHERE r.entity_id = (SELECT min FROM min_max)

    UNION ALL

    -- Iteratively intersect with ranges for the next higher ID
    SELECT entity_id, r.the_range * i.the_range
    FROM ranges r
    JOIN inter i ON r.the_range && i.the_range
    WHERE r.entity_id > i.last_id
        AND NOT EXISTS
        (
            SELECT *
            FROM ranges r2
            WHERE r2.entity_id < r.entity_id AND r2.entity_id > i.last_id
        )
)
-- Take the final set of intersections
SELECT *
FROM inter
WHERE last_id = (SELECT max FROM min_max)
ORDER BY the_range;

3 个答案:

答案 0 :(得分:7)

我创建了tsrange_interception_agg聚合

create function tsrange_interception (
    internal_state tsrange, next_data_values tsrange
) returns tsrange as $$
    select internal_state * next_data_values;
$$ language sql;

create aggregate tsrange_interception_agg (tsrange) (
    sfunc = tsrange_interception,
    stype = tsrange,
    initcond = $$[-infinity, infinity]$$
);

然后这个查询

with td (id, begin_time, end_time) as
(
    values
    (1, '2014-01-01'::timestamp, '2014-01-31'::timestamp),
    (1, '2014-02-01', '2014-02-28'),
    (1, '2014-04-01', '2014-04-30'),
    (2, '2014-01-15', '2014-02-20'),
    (2, '2014-04-15', '2014-05-05'),
    (3, '2014-01-20', '2014-04-20')
), ranges as (
    select
        id,
        row_number() over(partition by id) as rn,
        tsrange(begin_time, end_time) as tr
    from td
), cr as (
    select r0.tr tr0, r1.tr as tr1
    from ranges r0 cross join ranges r1
    where
        r0.id < r1.id and
        r0.tr && r1.tr and
        r0.id = (select min(id) from td)
)
select tr0 * tsrange_interception_agg(tr1) as interseptions
from cr
group by tr0
having count(*) = (select count(distinct id) from td) - 1
;
                 interseptions                 
-----------------------------------------------
 ["2014-02-01 00:00:00","2014-02-20 00:00:00")
 ["2014-01-20 00:00:00","2014-01-31 00:00:00")
 ["2014-04-15 00:00:00","2014-04-20 00:00:00")

答案 1 :(得分:1)

如果您想要交叉引用固定数量的实体,则可以为每个实体使用交叉连接,并构建交集(在范围上使用*运算符)。

但是,使用这样的交叉连接可能效率较低。以下示例更多地与解释下面更复杂的示例有关。

WITH td AS
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 4, '2014-01-20', '2014-04-20'
)
,ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
SELECT r1.the_range * r2.the_range * r3.the_range AS r
FROM ranges r1
CROSS JOIN ranges r2
CROSS JOIN ranges r3
WHERE r1.entity_id=1 AND r2.entity_id=2 AND r3.entity_id=4
  AND NOT isempty(r1.the_range * r2.the_range * r3.the_range)
ORDER BY r

在这种情况下,多重交叉连接的效率可能较低,因为实际上并不需要拥有每个范围的所有可能组合,因为isempty(r1.the_range * r2.the_range)足以使isempty(r1.the_range * r2.the_range * r3.the_range)成立。

我不认为你可以避免每个人的可用性,因为你希望他们都能满足。

通过将每个人的可用性交叉连接到您使用另一个递归CTE计算的前一个子集(在下面的示例中为intersections),可以帮助逐步构建交集的集合。然后,您可以逐步构建交叉点并消除空的范围,两个存储的数组:

WITH RECURSIVE td AS
(
    SELECT 1 AS entity_id, '2014-01-01'::timestamp AS begin_time, '2014-01-31'::timestamp AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 4, '2014-01-20', '2014-04-20'
)
,ranges AS
(
    -- Convert to tsrange type
    SELECT entity_id, tsrange(begin_time, end_time) AS the_range
    FROM td
)
,ranges_arrays AS (
    -- Prepare an array of all possible intervals per entity
    SELECT entity_id, array_agg(the_range) AS ranges_arr
    FROM ranges
       GROUP BY entity_id
)
,numbered_ranges_arrays AS (
    -- We'll join using pos+1 next, so we want continuous integers
    -- I've changed the example entity_id from 3 to 4 to demonstrate this.
    SELECT ROW_NUMBER() OVER () AS pos, entity_id, ranges_arr
    FROM ranges_arrays
)
,intersections (pos, subranges) AS (
    -- We start off with the infinite range.
    SELECT 0::bigint, ARRAY['[,)'::tsrange]
    UNION ALL
    -- Then, we unnest the previous intermediate result,
    -- cross join it against the array of ranges from the
    -- next row in numbered_ranges_arrays (joined via pos+1).
    -- We take the intersection and remove the empty array.
    SELECT r.pos,
           ARRAY(SELECT x * y FROM unnest(r.ranges_arr) x CROSS JOIN unnest(i.subranges) y WHERE NOT isempty(x * y))
    FROM numbered_ranges_arrays r
        INNER JOIN intersections i ON r.pos=i.pos+1
)
,last_intersections AS (
    -- We just really want the result from the last operation (with the max pos).
    SELECT subranges FROM intersections ORDER BY pos DESC LIMIT 1
)
SELECT unnest(subranges) r FROM last_intersections ORDER BY r

不幸的是,我不确定这是否可能表现更好。您可能需要更大的数据集来获得有意义的基准测试。

答案 2 :(得分:0)

好吧,我在TSQL中编写并测试了它,但它应该运行或者至少足够接近你才能翻译,它都是相当普通的构造。 除了可能之间,但可以分解成&lt;条款和a&gt;条款。(谢谢@Horse)

WITH cteSched AS ( --Schedule for everyone
    -- Test data. Returns:
    -- ["2014-01-20 00:00:00","2014-01-31 00:00:00")
    -- ["2014-02-01 00:00:00","2014-02-20 00:00:00")
    -- ["2014-04-15 00:00:00","2014-04-20 00:00:00")
    SELECT 1 AS entity_id, '2014-01-01' AS begin_time, '2014-01-31' AS end_time
    UNION SELECT 1, '2014-02-01', '2014-02-28'
    UNION SELECT 1, '2014-04-01', '2014-04-30'
    UNION SELECT 2, '2014-01-15', '2014-02-20'
    UNION SELECT 2, '2014-04-15', '2014-05-05'
    UNION SELECT 3, '2014-01-20', '2014-04-20'
), cteReq as (  --List of people to schedule (or is everyone in Sched required? Not clear, doesn't hurt)
    SELECT 1 as entity_id UNION SELECT 2 UNION SELECT 3
), cteBegins as (
    SELECT distinct begin_time FROM cteSched as T 
    WHERE NOT EXISTS (SELECT entity_id FROM cteReq as R 
                      WHERE NOT EXISTS (SELECT * FROM cteSched as X 
                                        WHERE X.entity_id = R.entity_id 
                                            AND T.begin_time BETWEEN X.begin_time AND X.end_time ))
) SELECT B.begin_time, MIN(S.end_time ) as end_time  
  FROM cteBegins as B cross join cteSched as S 
  WHERE B.begin_time between S.begin_time and S.end_time 
  GROUP BY B.begin_time
-- NOTE: This assume users do not have schedules that overlap with themselves! That is, nothing like
-- John is available 2014-01-01 to 2014-01-15 and 2014-01-10 to 2014-01-20. 

编辑:从上面添加输出(在SQL-Server 2008R2上执行时)
    begin_time end_time
    2014-01-20 2014-01-31
    2014-02-01 2014-02-20
    2014-04-15 2014-04-20