用分区聚合相邻行

时间:2019-06-16 09:57:58

标签: sql sql-server aggregate

我在MS SQL 2012上有一个庞大的数据集,必须进行特殊的聚合。 这是数据集的示例。

Key PartitionID StartTime                   Duration    Name
1   1           23/05/2019 18:18:28.125     1           X   
2   1           23/05/2019 18:18:28.480     2           Y   
3   1           23/05/2019 18:18:29.622     1           X   
4   1           23/05/2019 18:18:32.513     2           X   
5   2           23/05/2019 18:21:13.973     3           X   
6   2           23/05/2019 18:21:14.945     4           X   
7   2           23/05/2019 18:21:21.949     5           X   
8   2           23/05/2019 18:21:30.871     2           X   
9   2           23/05/2019 18:21:35.710     4           X   
10  2           23/05/2019 18:21:48.550     1           X   
11  2           23/05/2019 18:22:00.144     3           X   
12  2           23/05/2019 18:22:01.094     6           X   
13  2           23/05/2019 18:22:03.354     1           X   
14  3           23/05/2019 18:24:44.219     6           X   
15  3           23/05/2019 18:24:46.076     1           Y   
16  3           23/05/2019 18:24:52.399     4           X   
17  3           23/05/2019 18:25:03.620     6           X   
18  3           23/05/2019 18:25:11.208     1           X   
19  3           23/05/2019 18:25:12.616     4           X   
20  3           23/05/2019 18:25:28.019     6           X   
21  3           23/05/2019 18:25:31.384     2           Y   
21  3           23/05/2019 18:25:32.334     2           Y   
21  3           23/05/2019 18:25:33.344     2           X   

我必须创建一个新列,该列将根据Name将数据划分为多个集合,当以不同的Name分隔时,同一Name的CalculatedID必须不同。换句话说,如果相邻行具有相同的名称,那么它们也将具有相同的CalculatedId。

结果应与此类似:

Key PartitionID StartTime                   Duration    Name    CalculatedID
1   1           23/05/2019 18:18:28.125     1           X       1
2   1           23/05/2019 18:18:28.480     2           Y       2
3   1           23/05/2019 18:18:29.622     1           X       3
4   1           23/05/2019 18:18:32.513     2           X       3
5   2           23/05/2019 18:21:13.973     3           X       1
6   2           23/05/2019 18:21:14.945     4           X       1
7   2           23/05/2019 18:21:21.949     5           X       1
8   2           23/05/2019 18:21:30.871     2           X       1
9   2           23/05/2019 18:21:35.710     4           X       1
10  2           23/05/2019 18:21:48.550     1           X       1
11  2           23/05/2019 18:22:00.144     3           X       1
12  2           23/05/2019 18:22:01.094     6           X       1
13  2           23/05/2019 18:22:03.354     1           X       1
14  3           23/05/2019 18:24:44.219     6           X       1
15  3           23/05/2019 18:24:46.076     1           Y       2
16  3           23/05/2019 18:24:52.399     4           X       3
17  3           23/05/2019 18:25:03.620     6           X       3
18  3           23/05/2019 18:25:11.208     1           X       3
19  3           23/05/2019 18:25:12.616     4           X       3
20  3           23/05/2019 18:25:28.019     6           X       3
21  3           23/05/2019 18:25:31.384     2           Y       4
21  3           23/05/2019 18:25:32.334     2           Y       4
21  3           23/05/2019 18:25:33.344     2           X       5

我真的想避免循环访问数据,因为数据集很容易超过10M。

1 个答案:

答案 0 :(得分:3)

这可以通过使用带有lag的公用表表达式来完成,以基于PartitionId和StartTime的值为每个原始数据获取Name的先前值,然后使用sum作为一个窗口函数以得到一个可交换的和 名称与当前名称不同的行中的行。

首先,创建并填充示例表(在您将来的问题中为我们保存此步骤):

DECLARE @T AS TABLE
(
    [Key] int,
    PartitionID int,
    StartTime datetime,
    Duration int,   
    Name char(1)
)

INSERT INTO @T ([Key] ,PartitionID, StartTime, Duration, Name) VALUES
(1 , 1, '2019-05-23T18:18:28.125', 1, 'X'),   
(2 , 1, '2019-05-23T18:18:28.480', 2, 'Y'),   
(3 , 1, '2019-05-23T18:18:29.622', 1, 'X'),   
(4 , 1, '2019-05-23T18:18:32.513', 2, 'X'),   
(5 , 2, '2019-05-23T18:21:13.973', 3, 'X'),   
(6 , 2, '2019-05-23T18:21:14.945', 4, 'X'),   
(7 , 2, '2019-05-23T18:21:21.949', 5, 'X'),   
(8 , 2, '2019-05-23T18:21:30.871', 2, 'X'),   
(9 , 2, '2019-05-23T18:21:35.710', 4, 'X'),   
(10, 2, '2019-05-23T18:21:48.550', 1, 'X'),   
(11, 2, '2019-05-23T18:22:00.144', 3, 'X'),   
(12, 2, '2019-05-23T18:22:01.094', 6, 'X'),   
(13, 2, '2019-05-23T18:22:03.354', 1, 'X'),   
(14, 3, '2019-05-23T18:24:44.219', 6, 'X'),   
(15, 3, '2019-05-23T18:24:46.076', 1, 'Y'),   
(16, 3, '2019-05-23T18:24:52.399', 4, 'X'),   
(17, 3, '2019-05-23T18:25:03.620', 6, 'X'),   
(18, 3, '2019-05-23T18:25:11.208', 1, 'X'),   
(19, 3, '2019-05-23T18:25:12.616', 4, 'X'),   
(20, 3, '2019-05-23T18:25:28.019', 6, 'X'),   
(21, 3, '2019-05-23T18:25:31.384', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:32.334', 2, 'Y'),   
(21, 3, '2019-05-23T18:25:33.344', 2, 'X')

公用表表达式:

;WITH CTE AS
(
    SELECT  [Key] ,PartitionID, StartTime, Duration, Name,
            LAG(Name) OVER(PARTITION BY PartitionID ORDER BY StartTime) As PrevName
    FROM @T
)

查询:

SELECT  [Key] ,PartitionID, StartTime, Duration, Name, 
        SUM(IIF(Name = PrevName, 0, 1)) OVER(PARTITION BY PartitionID ORDER BY StartTime) As CalculatedId
FROM CTE
ORDER BY [Key]

结果:

Key PartitionID StartTime               Duration    Name    CalculatedId
1   1           23.05.2019 18:18:28     1           X       1
2   1           23.05.2019 18:18:28     2           Y       2
3   1           23.05.2019 18:18:29     1           X       3
4   1           23.05.2019 18:18:32     2           X       3
5   2           23.05.2019 18:21:13     3           X       1
6   2           23.05.2019 18:21:14     4           X       1
7   2           23.05.2019 18:21:21     5           X       1
8   2           23.05.2019 18:21:30     2           X       1
9   2           23.05.2019 18:21:35     4           X       1
10  2           23.05.2019 18:21:48     1           X       1
11  2           23.05.2019 18:22:00     3           X       1
12  2           23.05.2019 18:22:01     6           X       1
13  2           23.05.2019 18:22:03     1           X       1
14  3           23.05.2019 18:24:44     6           X       1
15  3           23.05.2019 18:24:46     1           Y       2
16  3           23.05.2019 18:24:52     4           X       3
17  3           23.05.2019 18:25:03     6           X       3
18  3           23.05.2019 18:25:11     1           X       3
19  3           23.05.2019 18:25:12     4           X       3
20  3           23.05.2019 18:25:28     6           X       3
21  3           23.05.2019 18:25:31     2           Y       4
21  3           23.05.2019 18:25:32     2           Y       4
21  3           23.05.2019 18:25:33     2           X       5