我有一张这样的表:
Date ConfigID ItemID ClientName Metric1 Metric2
==== ======== ====== ========== ======= =======
2017-01-01 1 1 A 2.0 2.0
2017-01-01 3 1 A 2.0 2.0
2017-01-01 4 2 B 5.0 5.0
2017-01-02 4 3 A 6.0 6.0
2017-01-01 2 1 A 2.0 2.0
....
(20 million rows here)
我想根据Date
,ItemID
,ClientName
,Metric1
和Metric2
检测重复项,所以我写道:
CREATE TABLE MyTable ([Date] date,
ConfigID int,
ItemID int,
ClientName char(1),
Metric1 decimal(3,1),
Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
('2017-01-01',3,1,'A',2.0,2.0),
('2017-01-01',4,2,'B',5.0,5.0),
('2017-01-02',4,3,'A',6.0,6.0),
('2017-01-01',2,1,'A',2.0,2.0);
WITH Dupes
AS (
SELECT *
,ROW_NUMBER() OVER (
PARTITION BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
ORDER BY [Date] DESC
) AS RowNum
FROM myTable)
SELECT *
FROM Dupes
但是返回的内容如下:
Date ConfigID ItemID ClientName Metric1 Metric2 RowNum
==== ======== ====== ========== ======= ======= ======
2017-01-01 1 1 A 2.0 2.0 1
2017-01-01 3 1 A 2.0 2.0 2
2017-01-01 4 2 B 5.0 5.0 1
2017-01-02 4 3 A 6.0 6.0 1
2017-01-01 2 1 A 2.0 2.0 3
....
(20 million rows here)
我想根据PARTITION BY
子句对类似的项进行分组。换句话说,我希望看到类似的内容(我真的不需要RowNum
):
Date ConfigID ItemID ClientName Metric1 Metric2 RowNum
==== ======== ====== ========== ======= ======= ======
2017-01-01 1 1 A 2.0 2.0 1
2017-01-01 3 1 A 2.0 2.0 2
2017-01-01 2 1 A 2.0 2.0 3
2017-01-01 4 2 B 5.0 5.0 1
2017-01-02 4 3 A 6.0 6.0 1
....
(20 million rows here)
什么SQL查询可以帮助我对表中的重复/类似行进行分组?谢谢你提出建议和答案!
答案 0 :(得分:1)
只需在选择
中下订单即可;
WITH Dupes
AS (
SELECT *
,ROW_NUMBER() OVER (
PARTITION BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
ORDER BY [Date] DESC
) AS RowNum
FROM myTable)
SELECT *
FROM Dupes
order by [Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2],
RowNum
答案 1 :(得分:1)
我认为你只需要order by
。并且CTE没有必要:
. . .
SELECT *
FROM Dupes
ORDER BY [Date], [ItemID], [ClientName], [Metric1], [Metric2];
答案 2 :(得分:1)
使用DENSE_RANK代替ROW_NUMBER可以提供帮助吗?
;
WITH Dupes
AS (
SELECT *
,DENSE_RANK ( )
OVER (
ORDER BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
DESC
) AS GroupID
FROM myTable)
SELECT *
FROM Dupes
这里建议的解决方案:
;
WITH D1
AS (
SELECT *
,DENSE_RANK ( )
OVER (
ORDER BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
DESC
) AS GroupID
FROM myTable)
, Dupes AS (
SELECT *
, COUNT(*) OVER (PARTITION BY GroupID) AS GroupItemsCount
FROM D1
)
SELECT *
FROM Dupes
WHERE GroupItemsCount <> 1
但更好的方法可能是
;
WITH Dupes
AS (
SELECT *
,COUNT(*)
OVER (
partition BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
) AS GroupItemsCount
FROM myTable)
SELECT *
FROM Dupes
WHERE GroupItemsCount > 1
答案 3 :(得分:1)
根据Luca在评论中的建议,使用COUNT(*) PARTITION BY(...)
似乎有效:
CREATE TABLE MyTable ([Date] date,
ConfigID int,
ItemID int,
ClientName char(1),
Metric1 decimal(3,1),
Metric2 decimal(3,1));
INSERT INTO MyTable
VALUES ('2017-01-01',1,1,'A',2.0,2.0),
('2017-01-01',3,1,'A',2.0,2.0),
('2017-01-01',4,2,'B',5.0,5.0),
('2017-01-02',4,3,'A',6.0,6.0),
('2017-01-01',2,1,'A',2.0,2.0);
WITH Dupes
AS (
SELECT *
,COUNT(*) OVER (
PARTITION BY
[Date]
,[ItemID]
,[ClientName]
,[Metric1]
,[Metric2]
ORDER BY [Date] DESC
) AS DupeCount
FROM myTable)
SELECT *
FROM Dupes
WHERE DupeCount > 1