如何检查一个2型SCD维度

时间:2017-07-07 09:29:22

标签: sql sql-server tsql

对于一个scd类型2维度,我有一个问题是识别和修复一些具有重叠时间间隔的记录。    我所拥有的是:

Bkey   Uid  startDate                       endDate
'John'  1   1990-01-01 (some time stamp)    2017-01-10 (some time stamp)
'John'  2   2016-11=03 (some time stamp)    2016-11-14 (some time stamp)
'John'  3   2016-11-14 (some time stamp)    2016-12-29 (some time stamp)
'John'  4   2016-12-29 (some time stamp)    2017-01-10 (some time stamp)
'John'  5   2017-01-10 (some time stamp)    2017-04-22 (some time stamp)
......

我想找到(第一个)所有约翰都有重叠时间段的表,对于一个有很多很多约翰的表,然后找出纠正那些重叠时间段的方法。对于最新的我知道有一些功能LAGG,LEAD,它可以处理,但它让我不知道如何找到那些过度。 任何提示? 的问候,

2 个答案:

答案 0 :(得分:1)

[1]以下查询将返回重叠的时间范围:

SELECT  *,
        (
            SELECT  *
            FROM    @Dimension1 y
            WHERE   x.Bkey = y.Bkey
            AND     x.Uid <> y.Uid
            AND     NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
            FOR XML RAW, ROOT, TYPE
        ) OverlappingTimeRanges
FROM    @Dimension1 x

完整脚本:

DECLARE @Dimension1 TABLE (
    Bkey        VARCHAR(50) NOT NULL,
    Uid         INT NOT NULL,
    startDate   DATE NOT NULL,
    endDate     DATE NOT NULL,
        CHECK(startDate < endDate)
);
INSERT  @Dimension1 
SELECT 'John',  1,   '1990-01-01', '2017-01-10' UNION ALL
SELECT 'John',  2,   '2016-11-03', '2016-11-14' UNION ALL
SELECT 'John',  3,   '2016-11-14', '2016-12-29' UNION ALL
SELECT 'John',  4,   '2016-12-29', '2017-01-10' UNION ALL
SELECT 'John',  5,   '2017-01-11', '2017-04-22';

SELECT  *,
        (
            SELECT  *
            FROM    @Dimension1 y
            WHERE   x.Bkey = y.Bkey
            AND     x.Uid <> y.Uid
            AND     NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
            FOR XML RAW, ROOT, TYPE
        ) OverlappingTimeRanges
FROM    @Dimension1 x

<强> Demo here

[2]为了找到具有重叠原始行的不同时间范围组,我将使用以下方法:

-- Edit 1
DECLARE @Groups TABLE (
    Bkey            VARCHAR(50) NOT NULL,
    Uid             INT NOT NULL,
    startDateNew    DATE NOT NULL,
    endDateNew      DATE NOT NULL,
        CHECK(startDateNew < endDateNew)
);
INSERT  @Groups
SELECT  x.Bkey, x.Uid, z.startDateNew, z.endDateNew
FROM    @Dimension1 x
OUTER APPLY (
    SELECT  MIN(y.startDate) AS startDateNew, MAX(y.endDate) AS endDateNew
    FROM    @Dimension1 y
    WHERE   x.Bkey = y.Bkey
    AND     NOT(x.startDate > y.endDate OR x.endDate < y.startDate)
) z
-- End of Edit 1

-- This returns distinct groups identified by DistinctGroupId together with all overlapping Uid(s) from current group
SELECT  *
FROM (
    SELECT ROW_NUMBER() OVER(ORDER BY b.Bkey, b.startDateNew, b.endDateNew) AS DistinctGroupId, b.*
    FROM (
        SELECT  DISTINCT a.Bkey, a.startDateNew, a.endDateNew
        FROM    @Groups a
    ) b
) c
OUTER APPLY (
    SELECT  d.Uid AS Overlapping_Uid
    FROM    @Groups d
    WHERE   c.Bkey = d.Bkey
    AND     c.startDateNew = d.startDateNew
    AND     c.endDateNew = d.endDateNew
) e

-- This returns distinct groups identified by DistinctGroupId together with an XML (XmlCol) which includes overlapping Uid(s)
SELECT  *
FROM (
    SELECT ROW_NUMBER() OVER(ORDER BY b.Bkey, b.startDateNew, b.endDateNew) AS DistinctGroupId, b.*
    FROM (
        SELECT  DISTINCT a.Bkey, a.startDateNew, a.endDateNew
        FROM    @Groups a
    ) b
) c
OUTER APPLY (
    SELECT (
    SELECT  d.Uid AS Overlapping_Uid
    FROM    @Groups d
    WHERE   c.Bkey = d.Bkey
    AND     c.startDateNew = d.startDateNew
    AND     c.endDateNew = d.endDateNew
    FOR XML RAW, TYPE
    ) AS XmlCol
) e

enter image description here

注意:我的示例中使用的最后一个范围是'John', 5, '2017-01-11', '2017-04-22';而不是'John', 5, '2017-01-10', '2017-04-22';。此外,使用的数据类型为DATE而非DATETIME[2][OFFSET]

答案 1 :(得分:0)

我认为您的查询中棘手的部分是能够清楚地表达重叠范围的逻辑。我们可以自己加入,条件是左边的一行与右边的任何一行重叠。所有匹配的行都是重叠的行。

我们可以考虑四种可能的重叠场景:

|---------|   |---------|    no overlap

|---------|
       |---------|           1st end and 2nd start overlap

       |---------|
 |---------|                 1st start and 2nd end overlap

 |---------|
    |---|                    2nd completely contained inside 1st
                             (could be 1st inside 2nd also)

SELECT DISTINCT
    t.Uid
FROM yourTable t1
INNER JOIN yourTable t2
    ON t1.startDate <= t2.endDate AND
       t2.startDate <= t1.endDate
WHERE
    t1.Bkey = 'John' AND t2.Bkey = 'John'

这至少可以让您识别重叠记录。以有意义的方式更新和分离它们可能最终会成为一个丑陋的差距和岛屿问题,或许会引起另一个问题。