如何在SQL Server中对类似的行进行分组

时间:2017-12-13 16:28:21

标签: sql sql-server tsql

我有一张这样的表:

Date        ConfigID    ItemID    ClientName    Metric1    Metric2
====        ========    ======    ==========    =======    =======
2017-01-01  1           1         A             2.0        2.0
2017-01-01  3           1         A             2.0        2.0
2017-01-01  4           2         B             5.0        5.0
2017-01-02  4           3         A             6.0        6.0
2017-01-01  2           1         A             2.0        2.0
....
(20 million rows here)

我想根据DateItemIDClientNameMetric1Metric2检测重复项,所以我写道:

CREATE TABLE MyTable ([Date] date,
                      ConfigID int,
                      ItemID int,
                      ClientName char(1),
                      Metric1 decimal(3,1),
                      Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
       ('2017-01-01',3,1,'A',2.0,2.0),
       ('2017-01-01',4,2,'B',5.0,5.0),
       ('2017-01-02',4,3,'A',6.0,6.0),
       ('2017-01-01',2,1,'A',2.0,2.0);    

WITH Dupes          
AS (            
    SELECT *        
        ,ROW_NUMBER() OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS RowNum 
    FROM myTable)

SELECT *
FROM Dupes

但是返回的内容如下:

Date        ConfigID    ItemID    ClientName    Metric1    Metric2    RowNum
====        ========    ======    ==========    =======    =======    ======
2017-01-01  1           1         A             2.0        2.0        1
2017-01-01  3           1         A             2.0        2.0        2
2017-01-01  4           2         B             5.0        5.0        1
2017-01-02  4           3         A             6.0        6.0        1
2017-01-01  2           1         A             2.0        2.0        3
....
(20 million rows here)

我想根据PARTITION BY子句对类似的项进行分组。换句话说,我希望看到类似的内容(我真的不需要RowNum):

Date        ConfigID    ItemID    ClientName    Metric1    Metric2    RowNum
====        ========    ======    ==========    =======    =======    ======
2017-01-01  1           1         A             2.0        2.0        1
2017-01-01  3           1         A             2.0        2.0        2
2017-01-01  2           1         A             2.0        2.0        3
2017-01-01  4           2         B             5.0        5.0        1
2017-01-02  4           3         A             6.0        6.0        1
....
(20 million rows here)

什么SQL查询可以帮助我对表中的重复/类似行进行分组?谢谢你提出建议和答案!

4 个答案:

答案 0 :(得分:1)

只需在选择

中下订单即可
;           
WITH Dupes          
AS (            
    SELECT *        
        ,ROW_NUMBER() OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS RowNum 
    FROM myTable)

SELECT *
FROM Dupes
order by [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2],
RowNum 

答案 1 :(得分:1)

我认为你只需要order by。并且CTE没有必要:

. . .
SELECT *
FROM Dupes
ORDER BY [Date], [ItemID], [ClientName], [Metric1], [Metric2];

答案 2 :(得分:1)

使用DENSE_RANK代替ROW_NUMBER可以提供帮助吗?

;           
WITH Dupes          
AS (            
    SELECT *        
        ,DENSE_RANK ( )
        OVER (    
            ORDER BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
             DESC
    ) AS GroupID 
    FROM myTable)

SELECT *
FROM Dupes

这里建议的解决方案:

;           
WITH D1          
AS (            
    SELECT *        
        ,DENSE_RANK ( )
        OVER (    
            ORDER BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
             DESC
    ) AS GroupID 
    FROM myTable)
, Dupes AS (
    SELECT *
        , COUNT(*) OVER (PARTITION BY GroupID) AS GroupItemsCount
    FROM D1
)
SELECT *
FROM Dupes
WHERE GroupItemsCount <> 1

但更好的方法可能是

;           
WITH Dupes          
AS (            
    SELECT *        
        ,COUNT(*)
        OVER (    
            partition BY
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]           
    ) AS GroupItemsCount
    FROM myTable)

SELECT *
FROM Dupes
WHERE GroupItemsCount > 1

答案 3 :(得分:1)

根据Luca在评论中的建议,使用COUNT(*) PARTITION BY(...)似乎有效:

CREATE TABLE MyTable ([Date] date,
                      ConfigID int,
                      ItemID int,
                      ClientName char(1),
                      Metric1 decimal(3,1),
                      Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
       ('2017-01-01',3,1,'A',2.0,2.0),
       ('2017-01-01',4,2,'B',5.0,5.0),
       ('2017-01-02',4,3,'A',6.0,6.0),
       ('2017-01-01',2,1,'A',2.0,2.0);    

WITH Dupes          
AS (            
    SELECT *        
        ,COUNT(*) OVER (    
            PARTITION BY 
                [Date]
               ,[ItemID]
               ,[ClientName]
               ,[Metric1]
               ,[Metric2]
            ORDER BY [Date] DESC
    ) AS DupeCount 
    FROM myTable)

SELECT *
FROM Dupes
WHERE DupeCount > 1